When Simple Becomes Hard: What Four Digit Multiplication Reveals About the Limits of Today’s AI
Why a task taught in elementary school exposes deep architectural blind spots in modern language models
Large language models have become the engines behind some of the most impressive feats in contemporary computing. They write complex software, summarize scientific papers, and navigate intricate chains of reasoning. Yet as a recent study shows, these same systems falter on a task that most ten-year-olds can perform with pencil and paper. According to a new article from TechXplore and the accompanying research paper Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls, even state of the art models fail almost completely at multiplying two four-digit numbers. This surprising failure opens a window into how these systems learn, why they get stuck, and what it takes to help them move beyond their limitations.
The researchers describe this tension as part of AI’s “jagged frontier,” a landscape where models can excel at sophisticated reasoning yet stumble on seemingly simple tasks. Four-digit multiplication turns out to be a perfect example. Humans learn it by breaking the problem into smaller pieces, computing partial products, carrying digits, and keeping track of intermediate sums in their heads. All of this requires holding information in mind across many steps. In machine learning terms, this is called a long-range dependency. It is the ability to store something early in a sequence and retrieve it later when it becomes relevant.
This is where today’s models struggle. Standard language models learn by recognizing patterns in the data they are trained on. They excel when the next step in a sequence can be predicted from nearby context. But as the TechXplore article notes, the more complex a problem becomes, the less likely a model is to have seen that exact pattern before. Four-digit multiplication requires a model to remember earlier computations while generating later digits, and that is something pattern matching alone cannot accomplish.
To understand why, the researchers tested models with anywhere from two to twelve layers. (A “layer” is one pass of computation inside an AI model where information is transformed before being passed to the next step.) Regardless of size, every model trained with standard fine tuning achieved less than one percent accuracy on four-digit multiplication. Fine tuning is the common method for teaching a model a new task. It involves feeding the model many four-digit and adjusting its parameters so it predicts the correct output. This approach works well when the task can be learned by scaling up data or depth. But in this case, scaling did nothing. The models consistently converged on what the researchers call a local optimum. This is a solution that seems good from the model’s perspective but is fundamentally flawed because it lacks the ability to store and retrieve intermediate information.
The heart of the problem is that multiplication requires a model to carry information forward. If it cannot remember partial products or running sums, it cannot compute the correct digits later in the sequence. The researchers confirmed this by probing the internal states of the models. They attempted to decode intermediate values, such as the running sum that humans compute naturally. In the standard fine-tuned models, these values were nowhere to be found. The models simply had not learned to represent them.
To explore what successful learning looks like, the team examined a model trained using a different method called Implicit Chain of Thought, or ICoT. This technique begins by giving the model explicit step by step reasoning during training. Over time, those intermediate steps are gradually removed. The model is forced to internalize the reasoning process rather than rely on visible hints. The result was significant. Whereas standard fine tuning achieved less than one percent accuracy, the ICoT model reached one hundred percent accuracy.
By reverse engineering the ICoT model, the researchers uncovered how it had succeeded. The model learned to track long range dependencies by organizing its attention patterns into structured pathways across time. In early layers, it computed products of digit pairs and stored them in specific positions, almost like placing documents into labeled folders. In later layers, it retrieved exactly the information needed to compute each digit of the final answer. This internal filing system never emerged in the standard model.
Even more intriguing, the ICoT model developed elegant internal representations. Instead of treating digits as simple symbols, it encoded them as wave like patterns known as Fourier bases. (A Fourier base is a simple wave pattern that can be combined with others to represent numbers or signals in a compact, structured way.) When multiplying digit pairs, it used a geometric operation called a Minkowski sum. None of this was programmed by the researchers. It emerged naturally as the model learned to solve the task. It is as if the model invented its own compact mathematical language for arithmetic.
Armed with this understanding, the researchers asked whether the failing models could be rescued with the right guidance. If the core issue was the inability to track intermediate values, perhaps the model simply needed a training signal that encouraged it to do so. They introduced a small auxiliary objective that required the model to predict the running sum at each step. This gentle nudge provided the missing inductive bias. The result was dramatic. A two-layer model that previously achieved less than one percent accuracy suddenly reached ninety nine percent accuracy without any explicit chain of thought supervision.
When the team examined the attention patterns of this newly successful model, they found that it had learned mechanisms similar to the ICoT model. It stored and retrieved partial products and even developed additional strategies, such as tracking multiple digit pairs simultaneously. This confirmed that the right architectural guidance can unlock capabilities that scaling alone cannot reach.
The implications extend far beyond multiplication. Long range dependencies appear throughout language modeling, scientific reasoning, planning, and any task that requires information to be carried across many steps. The study shows that standard training methods can trap models in shallow solutions that look correct locally but fail globally. It also shows that carefully designed inductive biases can help models escape these traps and learn deeper, more structured reasoning.
The researchers recommend that future work focus on developing general purpose inductive biases that help models track information across long sequences. Rather than relying solely on more data or larger models, the field may need to incorporate architectural insights that encourage models to build internal representations that support reasoning.
In practical terms, this research could influence how future AI systems are designed. Models that can reliably store and retrieve intermediate information will be better equipped for tasks like multi step planning, mathematical reasoning, scientific analysis, and complex decision making. The real world impact could be significant, especially in domains where accuracy and reliability matter.
The path ahead involves exploring new training techniques, refining architectural components, and developing tools that help models learn processes rather than patterns. This study provides a clear example of how understanding a model’s internal mechanics can lead to breakthroughs in capability. It reminds us that AI is still in its infancy, and that progress in AI is not just about making models bigger. It is about making them smarter in how they learn, remember, and reason.



