New Research Exposes Limitations in Reasoning in Popular LLMs

Recent advancements in large language models (LLMs) like GPT-4 have led to impressive capabilities across a wide range of tasks, including complex reasoning. However, new research from researchers at Apple sheds light on concerning limitations in the reasoning abilities of even the most advanced LLMs available today.

The study, titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” (https://arxiv.org/abs/2410.05229), introduces a novel benchmark called GSM-Symbolic to evaluate LLMs on grade-school level math problems. Unlike existing benchmarks that provide only a single accuracy metric, GSM-Symbolic enables generating diverse variants of questions to analyze model performance across different scenarios.

The researchers evaluated over 20 state-of-the-art open-source and closed-source LLMs, including models from major AI labs like OpenAI, Google, and Meta. Their findings reveal several key insights:

  1. High performance variability: The researchers generated multiple versions of the same underlying math problems, changing only names and numerical values. When tested on these variants, all models showed significant variability in accuracy. For example, the Gemma2-9B model’s accuracy ranged from 75% to 87% across different question variants. This suggests LLMs may be relying more on pattern matching than true logical reasoning.
  2. Sensitivity to numerical values: The study compared model performance when changing only proper names in questions versus changing numerical values. While models showed some robustness to name changes, they were much more sensitive to alterations in numerical values. This further indicates a lack of deep conceptual understanding.
  3. Declining performance with complexity: Researchers created versions of problems with varying difficulty by adding or removing clauses. As they increased question difficulty by adding more components, model performance consistently decreased while variance increased. This hints at fundamental limitations in handling multi-step reasoning.
  4. Confusion from irrelevant information: The researchers created a dataset called GSM-NoOp by adding extra clauses to math word problems that appeared relevant but did not actually impact the solution. Even the most advanced models struggled immensely with these questions, with performance drops of up to 65%.

This last point is particularly noteworthy. Remarkably, even the most advanced models struggled immensely with these questions, with performance drops of up to 65%.

For example, consider this sample problem from the GSM-NoOp dataset:

“Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?”

The clause about five kiwis being smaller is completely irrelevant to the total count. However, multiple top LLMs tested, including OpenAI’s latest o1-preview model, incorrectly subtracted these five kiwis from the total in their calculations.

This propensity to blindly incorporate irrelevant information reveals a critical flaw in how current LLMs approach problem-solving. Rather than truly understanding the conceptual underpinnings of a problem, they appear to be applying learned patterns in ways that can easily lead them astray when presented with novel scenarios.

The implications of these findings are significant. As LLMs are increasingly deployed in real-world applications that involve critical decision-making or complex reasoning, their brittleness in the face of irrelevant information could lead to dangerous errors. Whether in healthcare, finance, or other high-stakes domains, the ability to discern relevant from irrelevant information is crucial.

Moreover, these results call into question how much genuine reasoning capability current LLMs actually possess. While they can certainly produce impressive-looking step-by-step solutions to many problems, their sensitivity to minor changes and inability to ignore distractors suggest their apparent reasoning may be more superficial than previously thought.

The researchers argue that addressing these limitations will require fundamental advances in AI architecture and training approaches. Simply scaling up existing models or fine-tuning on more data is unlikely to bridge the gap to true conceptual understanding and robust reasoning.

As AI continues to advance at a rapid pace, studies like this serve as crucial reality checks on the current state of the technology. While LLMs have made remarkable strides in recent years, achieving human-like reasoning capabilities remains a formidable challenge. Researchers and practitioners alike must remain cognizant of these limitations as they develop and deploy AI systems.

Moving forward, the GSM-Symbolic benchmark introduced in this study provides a valuable new tool for assessing and improving the mathematical reasoning abilities of language models. By enabling more nuanced evaluation across diverse problem variants, it can help AI developers better understand the strengths and weaknesses of their models.

Ultimately, bridging the gap between pattern matching and genuine conceptual reasoning may prove to be one of the most important frontiers in AI research. As this study demonstrates, there is still much work to be done before we can confidently rely on AI systems for complex reasoning tasks in the real world.

To learn more about how to utilize LLMs to advance applications at your organization, visit us here https://www.njii.com/ai-ml-overview/ or Contact Us today!