Why GPT Falls Short of Human Thinking

Universiteit van Amsterdam

Artificial Intelligence (AI), particularly large language models like GPT-4, has shown impressive performance on reasoning tasks. But does AI truly understand abstract concepts, or is it just mimicking patterns? A new study from the University of Amsterdam and the Santa Fe Institute reveals that while GPT models perform well on some analogy tasks, they fall short when the problems are altered, highlighting key weaknesses in AI's reasoning capabilities.

Analogical reasoning is the ability to draw a comparison between two different things based on their similarities in certain aspects. It is one of the most common methods by which human beings try to understand the world and make decisions. An example of analogical reasoning: cup is to coffee as soup is to ??? (the answer being: bowl)

Large language models like GPT-4 perform well on various tests, including those requiring analogical reasoning. But can AI models truly engage in general, robust reasoning or do they over-rely on patterns from their training data? This study by language and AI experts Martha Lewis (Institute for Logic, Language and Computation at the University of Amsterdam) and Melanie Mitchell (Santa Fe Institute) examined whether GPT models are as flexible and robust as humans in making analogies. 'This is crucial, as AI is increasingly used for decision-making and problem-solving in the real world', explains Lewis.

Comparing AI models to human performance

Lewis and Mitchell compared the performance of humans and GPT models on three different types of analogy problems:

  1. Letter sequences – Identifying patterns in letter sequences and completing them correctly.
  2. Digit matrices – Analyzing number patterns and determining the missing numbers.
  3. Story analogies – Understanding which of two stories best corresponds to a given example story.

A system that truly understands analogies should maintain high performance even on variations

In addition to testing whether GPT models could solve the original problems, the study examined how well they performed when the problems were subtly modified. 'A system that truly understands analogies should maintain high performance even on these variations', state the authors in their article.

GPT models struggle with robustness

Humans maintained high performance on most modified versions of the problems, but GPT models, while performing well on standard analogy problems, struggled with variations. 'This suggests that AI models often reason less flexibly than humans and their reasoning is less about true abstract understanding and more about pattern matching', explains Lewis.

In digit matrices, GPT models showed a significant drop in performance when the position of the missing number changed. Humans had no difficulty with this. In story analogies, GPT-4 tended to select the first given answer as correct more often, whereas humans were not influenced by answer order. Additionally, GPT-4 struggled more than humans when key elements of a story were reworded, suggesting a reliance on surface-level similarities rather than deeper causal reasoning.

On simpler analogy tasks, GPT models showed a decline in performance decline when tested on modified versions, while humans remained consistent. However, for more complex analogical reasoning tasks, both humans and AI struggled.

Weaker than human cognition

This research challenges the widespread assumption that AI models like GPT-4 can reason in the same way humans do. 'While AI models demonstrate impressive capabilities, this does not mean they truly understand what they are doing', conclude Lewis and Mitchell. 'Their ability to generalize across variations is still significantly weaker than human cognition. GPT models often rely on superficial patterns rather than deep comprehension.'

This is a critical warning for the use of AI in important decision-making areas such as education, law, and healthcare. AI can be a powerful tool, but it is not yet a replacement for human thinking and reasoning.

Article details

Martha Lewis and Melanie Mitchell, 2025, 'Evaluating the Robustness of Analogical Reasoning in Large Language Models', In: Transactions on Machine Learning Research.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.