Apple Finds AI Is Great at Pretending to Think—But Not Much Else

The study challenges the prevailing belief that such models truly “think” like humans

Apple Finds AI Is Great at Pretending to Think—But Not Much Else
(Downloaded from Freepik- AI Generated)

Apple researchers have published a revealing study titled “The Illusion of Thinking,” shedding light on the actual reasoning capabilities of large language models (LLMs) and their more advanced counterparts, large reasoning models (LRMs).

The study challenges the prevailing belief that such models truly “think” like humans, demonstrating that their apparent intelligence may be more superficial than previously assumed.

The researchers, led by Navid Shojaee and team, constructed a set of synthetic puzzles with adjustable complexity to rigorously test how well LRMs can handle multistep logical reasoning.

While these models often produce what looks like coherent “chain-of-thought” responses, the paper argues that this reasoning may be largely decorative—lacking the depth and adaptability of real human cognition.

One of the study’s most striking findings is that LRMs begin to collapse under increasing task complexity. While they perform well on simple and moderately difficult tasks, their performance degrades significantly as puzzles become more complex.

In fact, as problem difficulty increases, the models often expend fewer computational resources—suggesting they are not even attempting to “try harder.” This behavior indicates a fundamental limitation in current reasoning architectures.

The researchers observed three clear performance regimes: LLMs outperform LRMs on very simple tasks, LRMs excel at moderate complexity, but both types of models fail on high-complexity problems. Perhaps most alarmingly, even when models are provided with a complete algorithm or a step-by-step reasoning structure, they frequently fail to follow through correctly.

The study concludes that LRMs give an “illusion of thinking”—they mimic the form of reasoning without engaging in the substance. This raises questions about the validity of using chain-of-thought prompting as evidence of deep understanding or cognitive ability.

Reacting to the paper, AI influencer and Associate Professor at The Wharton School, Ethan Mollick, posted on X saying that the Apple paper on the limits of reasoning models in particular tests is useful & important, but the “LLMs are hitting a wall” narrative on X(previously Twitter) around it feels premature at best.

"Reminds me of the buzz over model collapse - limitations that were overcome quickly in practice," he remarked.

To which, LLM critic and AI scientist, Gary Marcus, replied, "AI is not hitting a wall. But LLMs probably are (or at least a point of diminishing returns). We need new approaches, and to diversify the which roads are being actively explored."

He further pointed out in a substack post, "At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing."

Last year, Apple released another paper titled "Understanding the Limitations of Mathematical Reasoning in Large Language Models" in which Apple researchers found that LLMs despite sounding smart, fail basic tests of logical reasoning.

In experiments, inserting irrelevant details into questions caused popular AI systems to give wrong or nonsensical answers. The study suggested LLMs don’t actually “understand” queries—they mimic intelligence by predicting likely responses, not through genuine comprehension or reasoning.