17 Jun, 2025

Apple’s “The Illusion of Thinking” Is a Wake-Up Call for AI Industry

Design Insights • Aakash Jethwani • 5 Mins reading time

Reading Time: 4 minutes

Just days before Apple’s much-anticipated WWDC 2025, the tech giant quietly dropped a research paper that may have done more to shake up the AI world than anything announced on stage.

Titled “The Illusion of Thinking,” the research paper delivers a blunt and unsettling message: large reasoning models (LRMs)—including OpenAI’s o1 and o3, Claude 3.7 Sonnet Thinking, DeepSeek R1, and Google’s Gemini Flash Thinking—completely collapse when confronted with harder logic puzzles.

Yes, collapse. Not underperform. Not make a few mistakes. They give up.

And if you’re one of those riding the AGI hype train? This may be the cold bucket of reality you didn’t see coming.

Apple’s shocking findings on AI eeasoning

The research, conducted by Apple’s AI team, revisits classic logic puzzles:

Tower of Hanoi—Move stacked disks from one peg to another using strict rules.
Jumping Checkers—Move pieces over others into empty spaces using logic.
River Crossing Problem—Classic chicken-fox-grain problem where items must cross in a boat with constraints.
Block Stacking—Arrange blocks in a specific sequence using spatial reasoning.

They increased difficulty systematically—for instance, adding more disks to the Tower of Hanoi or more constraints in the river crossing puzzle.

You probably solved them in a school math class or stumbled across them in a logic game online.

But when Apple’s researchers used them to test the “reasoning” of top AI models, the results were damning:

“Accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model-specific complexity threshold,” Apple’s paper states.

Apple found a three-stage behavior pattern across all LRMs:

Complexity	Behavior	Insight
Low	Surprisingly, traditional LLMs outperformed LRMs	Possibly due to overthinking or misalignment in “reasoning” mechanisms
Medium	LRMs improved and outperformed LLMs	This is their “sweet spot”
High	All models collapsed in accuracy (approaching 0%)	Total reasoning breakdown

Claude 3.7 Sonnet + Thinking and DeepSeek R1 began to fail when a fifth disc was added to the Tower of Hanoi puzzle.
Even with more computing power, the models still failed to solve more complex problems.
The models initially spend more tokens (i.e., “thinking effort”) as problems get harder — but after a certain point, they begin spending fewer tokens, essentially giving up.

Let that sink in: as the puzzle gets tougher, the AI literally thinks less.

AI can code and write, but can it really reason?

You might be wondering, “Why is this a big deal? AI still performs well on math and code, right?”

Sure. But logic and reasoning are cornerstones of human intelligence—and more importantly, of Artificial General Intelligence (AGI).

And if these LRMs—fine-tuned to do thinking tasks—fail on puzzles many humans can solve, it throws a wrench into the narrative that AGI is around the corner.

Even more alarming: when researchers fed the models the correct algorithm upfront, accuracy didn’t improve.

Yes, even with step-by-step instructions included in the prompt, the models still failed.

Apple’s reluctance to join the AI arms race

Until now, Apple has seemed like a reluctant participant in the generative AI boom.

While Google and Samsung have packed their devices with flashy AI features, Apple stayed relatively quiet—until WWDC 2025, where it finally launched “Apple Intelligence.”

But even then, reactions were lukewarm. “Underwhelming,” some said.

Now, this research may explain why. Apple may be signaling a radically different—and cautious—approach to AI: don’t fake thinking. Don’t pretend reasoning is solved. Build from first principles. Test it rigorously.

And to be fair, it’s not all bad news. Apple’s paper doesn’t say that AI doesn’t reason at all. It just points out that reasoning ability plateaus fast—and then collapses.

The AGI skeptics are having a moment

For AGI skeptics and critics of the AI hype cycle, this research is pure vindication.

Apple just GaryMarcus'd LLM reasoning ability pic.twitter.com/735UMGk4be
— Josh Wolfe (@wolfejosh) June 7, 2025

He’s not wrong. LLMs—no matter how powerful—aren’t magic. They’re statistical machines predicting the next token. Coding and summarizing? Sure. Deep reasoning? Not yet.

And here’s where it gets spicier: Apple’s researchers didn’t compare the models to human results. Why? Because many humans also fail at, say, Tower of Hanoi with 8 discs.

So yes, AI models aren’t “superintelligent.” But they might not be worse than average humans either—depending on the task.

So, is AGI a mirage?

Maybe. Or maybe we’re just not there yet.

This isn’t the death of AI. But it is a signal—a loud, inconvenient, and data-backed signal—that much of what’s being marketed as “thinking” by AI tools is really just a performance illusion.

It performs well… until it doesn’t.
It reasons well… until it breaks.
It tries hard… until it gives up entirely.

The real question is, are we ready to stop pretending?

Final thoughts

Let’s be clear. Apple didn’t release this paper to dunk on its competitors. They built these LRMs too. The real value of this research is the transparency and realism it brings to a field too often intoxicated by its own marketing.

Here’s what the research teaches us:

Reasoning is fragile in current AI models.
More tokens ≠ better thinking.
AGI is not just a scale problem. Throwing more data or more parameters may not solve it.

This paper is not an obituary for artificial intelligence. But it is a reality check. And we needed one.

Tags:

AI Featured insights News

Written By

Aakash Jethwani

Founder & Creative Director

Aakash Jethwani, the founder and creative director of Octet Design Studio, aims to help companies disrupt the market through innovative design solutions.

Inspire the next generation of designers

Submit Article