The Mirage of Machine Reasoning: How Artificial Intelligence Simulates Understanding Without Comprehension
The Architecture of Illusion
The contemporary discourse surrounding artificial intelligence has become dominated by an intoxicating narrative: machines are learning to think. From OpenAI’s o-series to Claude’s thinking models and DeepSeek’s reasoning architectures, the latest generation of large language models appears to engage in genuine contemplation, working through problems step-by-step, reconsidering their approaches, and arriving at solutions through processes that mirror human reasoning. These systems generate elaborate chains of thought, engage in apparent self correction, and demonstrate sophisticated problem solving capabilities that have convinced many observers that we stand at the threshold of artificial general intelligence.
Yet beneath this veneer of rational deliberation lies a more troubling reality. Recent empirical investigations using controlled puzzle environments reveal that these reasoning models, despite their impressive surface behaviors, exhibit fundamental limitations that challenge the very foundations of the AGI narrative. When subjected to systematic analysis across varying complexity levels, these systems demonstrate not the robust, generalizable reasoning abilities their proponents claim, but rather a sophisticated form of pattern matching that breaks down in predictable and illuminating ways.
The illusion of machine reasoning emerges from a fundamental confusion between two distinct cognitive phenomena: the simulation of reasoning processes and the actual execution of logical operations. Current large language models excel at the former while failing systematically at the latter. They have learned to mimic the linguistic signatures of human thought—the tentative explorations, the apparent insights, the structured argumentation; without developing the underlying computational architecture necessary for genuine logical processing.
This distinction becomes clearer when we examine how these models approach problem solving. Rather than implementing algorithmic procedures that could reliably navigate logical challenges, they generate text sequences that statistically resemble the patterns produced by reasoning humans. The models have internalized vast corpora of human problem solving discourse, enabling them to reproduce the surface features of rational deliberation with remarkable fidelity. They can generate statements like “Let me think through this step by step” and “Upon reflection, I realize my error” not because they engage in actual step-by-step analysis or genuine reflection, but because these phrases appear in optimal positions within the statistical landscape they have learned to navigate.
The mechanisms underlying this simulation operate through linguistic reasoning proxies—textual patterns that correlate with rational discourse without requiring the cognitive processes that originally generated them. When a reasoning model encounters a mathematical problem, it does not perform mathematical operations in any meaningful sense. Instead, it recognizes textual cues that indicate mathematical content and generates sequences that statistically align with the linguistic patterns surrounding mathematical solutions in its training data. The model has learned that certain problem descriptions correlate with certain solution structures, enabling it to produce outputs that appear to demonstrate mathematical understanding while lacking any genuine computational foundation.
This pattern extends across domains. In logical puzzles, the models recognize structural features that indicate puzzle-like content and generate response patterns that mirror human puzzle-solving discourse. In causal reasoning tasks, they identify linguistic markers associated with causal relationships and produce text that exhibits the grammatical and semantic features of causal explanation. The sophistication of these simulations can be extraordinary, incorporating apparent false starts, self-corrections, and insights that closely mimic the phenomenology of human reasoning.
The training process that creates these capabilities operates through a form of behavioral cloning at the linguistic level. By learning to predict the next token in vast sequences of human-generated text, these models implicitly acquire the ability to reproduce the surface patterns of human reasoning without developing the underlying cognitive architecture that originally produced those patterns. They become expert mimics of human reasoning discourse while remaining fundamentally incapable of the logical operations that such discourse represents.
This limitation becomes particularly apparent when examining the models’ relationship to explicit algorithmic procedures. Even when provided with step-by-step algorithms for solving specific problems, reasoning models frequently fail to execute these procedures reliably. They can reproduce the textual description of algorithmic steps with impressive accuracy, but they struggle to maintain the logical consistency required for faithful implementation. This suggests that their apparent reasoning capabilities operate primarily at the level of linguistic pattern matching rather than genuine logical processing.
Even when provided with step-by-step algorithms for solving specific problems, reasoning models frequently fail to execute these procedures reliably. They can reproduce the textual description of algorithmic steps with impressive accuracy, but they struggle to maintain the logical consistency required for faithful implementation.
The implications of this analysis extend far beyond technical assessments of model capabilities. The widespread belief that current reasoning models represent significant progress toward artificial general intelligence rests on a category error—the confusion of linguistic sophistication with cognitive capability. These systems excel at producing text that exhibits the surface features of rational discourse while lacking the underlying computational architecture necessary for genuine reasoning. They have achieved reasoning ventriloquism—the ability to speak in the voice of rational agents without possessing rational agency.
This distinction carries profound consequences for how we interpret model behaviors and project future capabilities. If reasoning models truly engaged in logical processing, we might expect their capabilities to scale smoothly with computational resources and training data. We might anticipate that providing explicit algorithms would reliably improve performance, that complex problems would be solved through systematic decomposition, and that reasoning traces would exhibit genuine logical coherence. Instead, empirical investigation reveals patterns that align more closely with sophisticated pattern matching operating near its fundamental limits.
The seductive power of the reasoning simulation lies in its alignment with human cognitive biases. When we observe systems generating text that resembles our own thought processes, we naturally project intentionality and understanding onto these productions. The anthropomorphic tendency to attribute mental states to systems that exhibit familiar behavioral patterns operates with particular force when those patterns specifically mimic the linguistic expressions of human cognition. The result is a systematic overestimation of model capabilities that obscures their actual limitations and operational principles.
Understanding these systems as reasoning simulators rather than reasoning agents provides a more accurate framework for evaluating their capabilities and limitations. This perspective explains why reasoning models can produce impressive performances on many tasks while failing catastrophically on others that appear superficially similar. It illuminates why providing explicit algorithms often fails to improve performance, why reasoning traces frequently lack genuine logical coherence, and why model capabilities plateau in characteristic ways as problem complexity increases.
The recognition that current reasoning models operate primarily through linguistic simulation rather than logical processing does not diminish their technical sophistication or practical utility. These systems represent remarkable achievements in pattern recognition and text generation. However, accurate assessment of their capabilities requires distinguishing between the simulation of reasoning and reasoning itself—a distinction that proves crucial for understanding both their current limitations and the challenges facing genuine progress toward artificial general intelligence.
The Three Regimes of Systematic Failure
Empirical investigation of reasoning models through controlled puzzle environments reveals a striking pattern that undermines optimistic projections about their reasoning capabilities. Rather than demonstrating smooth improvement with increased computational resources or gradual degradation under mounting complexity, these systems exhibit behavior that partitions cleanly into three distinct regimes, each revealing different aspects of their fundamental limitations.
The first regime encompasses problems of low computational complexity—puzzles with fewer than five sequential steps, simple logical relationships, or minimal state-space exploration requirements. Counterintuitively, reasoning models frequently underperform their non-reasoning counterparts in this domain. Standard language models, operating without elaborate chain-of-thought mechanisms or reflective prompts, often solve these problems more efficiently and accurately than their reasoning-enhanced variants. This phenomenon appears across multiple problem types: basic arithmetic, simple logical puzzles, straightforward planning tasks, and elementary pattern recognition challenges.
The efficiency disparity in this regime proves particularly illuminating. Where a standard model might generate a direct solution in fifty tokens, a reasoning model might expend several thousand tokens on elaborate deliberation before arriving at the same answer, if it arrives at the correct answer at all. This computational overhead represents more than mere inefficiency; it reveals that the reasoning simulation actively interferes with performance on problems that can be solved through direct pattern matching. The models have learned to engage in performative reasoning even when such performance provides no benefit and sometimes actively harms accuracy.
Analysis of reasoning traces in this low-complexity regime reveals an overthinking pathology—a systematic tendency to complicate straightforward problems through unnecessary elaboration. Models identify correct solutions early in their reasoning process but continue generating text that explores alternatives, second-guesses initial insights, or pursues irrelevant tangents. This behavior mimics a superficial understanding of intellectual humility and thoroughness while actually degrading performance through accumulated error accumulation and attention diffusion.
The overthinking phenomenon exposes a fundamental disconnect between the models’ linguistic sophistication and their logical capabilities. They have learned that extended deliberation correlates with careful reasoning in human discourse, leading them to generate elaborate reasoning traces even when simple problems require only direct application of learned patterns. The result is a form of cognitive theater where models perform reasoning behaviors without executing reasoning processes.
The second regime emerges around medium-complexity problems—challenges requiring ten to fifty sequential operations, moderate state-space exploration, or multi-step logical inference. Here, reasoning models demonstrate their most compelling capabilities, consistently outperforming standard language models while generating reasoning traces that appear genuinely insightful. Problems in this range include multi-step arithmetic with intermediate dependencies, planning puzzles requiring temporary moves to achieve final goals, and logical reasoning chains with several inferential steps.
The success of reasoning models in this medium-complexity regime provides the empirical foundation for claims about their reasoning capabilities. The models generate coherent solution strategies, recover from initial errors through apparent self-correction, and demonstrate flexible adaptation when initial approaches prove inadequate. Their reasoning traces exhibit temporal structure that mirrors human problem-solving: initial exploration, hypothesis formation, testing and revision, and convergence toward solutions. This behavioral resemblance to human cognition proves sufficiently compelling that many researchers interpret it as evidence of genuine reasoning capability.
However, detailed analysis reveals that even in this regime of apparent success, the models’ operations remain fundamentally limited. Their “self-correction” often involves cycling through alternative linguistic formulations rather than genuine error detection and logical revision. Their “flexible adaptation” typically manifests as switching between different pattern templates rather than systematic strategy modification. Most tellingly, their success remains closely tied to the availability of similar problems in their training data—they excel at generating variations on familiar solution patterns while struggling with genuinely novel problem structures.
The medium-complexity regime represents the sweet spot where reasoning simulation achieves maximum effectiveness. Problems in this range remain within the interpolative capabilities of sophisticated pattern matching while being complex enough to benefit from the structured exploration that reasoning traces provide. The models can draw upon rich repositories of similar problems from their training data while adapting these patterns to accommodate modest variations in problem structure. This creates the appearance of flexible problem-solving while operating within fundamentally constrained pattern recognition systems.
Critically, even successful performance in this regime depends heavily on linguistic cues that orient the models toward appropriate solution strategies. When problems are reformulated to remove or alter these cues while maintaining logical equivalence, performance often degrades significantly. This sensitivity to surface features rather than deep structure reveals that apparent reasoning success reflects sophisticated template matching rather than genuine logical processing.
The third regime encompasses high-complexity problems—challenges requiring more than fifty sequential operations, extensive state-space exploration, or deep logical inference chains. Here, both reasoning and non-reasoning models experience catastrophic failure, typically achieving near-zero accuracy regardless of computational resources allocated. Tower of Hanoi puzzles with more than eight disks, complex planning problems requiring extensive temporary moves, and logical reasoning chains exceeding certain depth thresholds all trigger this collapse pattern.
The failure mode in the high-complexity regime proves particularly revealing. Rather than degrading gracefully as problems exceed their capabilities, reasoning models exhibit abrupt discontinuities in performance. Problems requiring forty-nine sequential steps might be solved reliably, while problems requiring fifty-one steps result in complete failure. This cliff-like degradation pattern suggests fundamental architectural constraints rather than gradual capacity limitations that might be overcome through increased computational resources or training.
Most remarkably, the collapse in the high-complexity regime correlates with a counterintuitive reduction in reasoning effort. As problems become more complex and presumably require more extensive deliberation, reasoning models actually generate shorter reasoning traces. They appear to “give up” prematurely, producing abbreviated attempts that abandon systematic exploration in favor of cursory pattern matching. This behavior directly contradicts the expectation that genuine reasoning systems would allocate more computational resources to more challenging problems.
The premature abandonment phenomenon suggests that reasoning models possess implicit recognition of their own limitations, even if this recognition operates below the level of explicit awareness. When confronted with problems that exceed their pattern matching capabilities, they default to generating minimal responses rather than sustaining the elaborate reasoning simulations that characterize their performance in the medium-complexity regime. This adaptive response prevents the generation of extremely long but ultimately futile reasoning traces, but it also reveals the absence of genuine problem-solving persistence that characterizes human reasoning under difficulty.
The three-regime pattern appears consistently across different model architectures, training approaches, and problem domains. OpenAI’s o-series, Claude’s thinking models, and DeepSeek’s reasoning systems all exhibit similar performance profiles, suggesting that the limitations reflect fundamental constraints of the current paradigm rather than implementation details of particular models. The pattern persists even when models operate well below their context length limits and with abundant computational budgets, indicating that the constraints operate at a more fundamental level than resource availability.
OpenAI’s o-series, Claude’s thinking models, and DeepSeek’s reasoning systems all exhibit similar performance profiles, suggesting that the limitations reflect fundamental constraints of the current paradigm rather than implementation details of particular models.
This consistency across implementations provides strong evidence that the three-regime structure reflects intrinsic properties of reasoning simulation rather than contingent features of specific architectures. The universality of the pattern suggests that current approaches to reasoning enhancement may have reached fundamental limits that cannot be overcome through incremental improvements in training procedures, architectural modifications, or computational scaling.
The temporal dynamics within each regime also prove instructive. In the low-complexity regime, reasoning models often identify correct solutions early but subsequently introduce errors through continued elaboration. In the medium-complexity regime, correct solutions typically emerge later in the reasoning process, after initial exploration and apparent error correction. In the high-complexity regime, models rarely generate coherent solution attempts at any point in their reasoning traces. These temporal patterns reveal how reasoning simulation interacts with problem complexity to produce characteristic failure modes.
Understanding the three-regime structure illuminates why reasoning models can produce impressive demonstrations on carefully selected problems while failing systematically on seemingly similar challenges. The impressive demonstrations typically draw from the medium-complexity regime where reasoning simulation operates most effectively, while systematic failures occur in the low-complexity regime where simulation interferes with direct pattern matching or in the high-complexity regime where simulation exceeds its effective scope.
This empirical pattern provides crucial evidence against interpretations of reasoning models as nascent artificial general intelligences. Genuine reasoning systems should demonstrate monotonic improvement with increased computational resources, graceful degradation under mounting complexity, and consistent application of learned strategies across problem scales. Instead, reasoning models exhibit efficiency losses in simple cases, catastrophic failures in complex cases, and a narrow band of apparent competence that likely reflects sophisticated pattern matching rather than genuine logical processing.
The three-regime framework also suggests specific strategies for evaluating reasoning claims. Rather than focusing on cherry-picked examples of impressive performance, evaluation should systematically probe all three regimes to assess whether apparent reasoning capabilities reflect genuine logical processing or sophisticated simulation. This approach would reveal both the capabilities and limitations of current systems while providing more accurate expectations for their practical deployment.
The Algorithmic Void and Cognitive Pathologies
The most damning evidence against the reasoning capabilities of current artificial intelligence systems emerges not from their failures on novel problems, but from their inability to execute explicit algorithmic procedures reliably. When provided with step-by-step algorithms that completely specify solution procedures, reasoning models frequently fail to implement these instructions faithfully, revealing a fundamental disconnect between linguistic comprehension and computational execution that strikes at the heart of claims about machine reasoning.
Consider the Tower of Hanoi puzzle, a classic test of sequential reasoning that admits a straightforward recursive solution. When reasoning models are provided with complete algorithmic specifications, including pseudocode that explicitly details each step of the recursive procedure, their performance often fails to improve significantly compared to attempts without algorithmic guidance. In some cases, performance actually degrades as models become confused by the gap between their pattern-matching operations and the systematic logical processing that algorithm execution requires.
This algorithmic execution failure reveals something profound about the nature of these systems. The ability to follow explicit procedures represents perhaps the most basic requirement of genuine reasoning systems. Human reasoners, even when struggling with complex problems, can typically implement clear algorithmic instructions with high fidelity. The systematic failure of reasoning models to execute provided algorithms suggests that their apparent reasoning capabilities operate at a fundamentally different level than the logical processing that algorithms require.
The failure manifests in characteristic ways across different algorithmic contexts. Models demonstrate impressive facility with describing algorithmic steps in natural language, often reproducing provided pseudocode with remarkable accuracy and elaborating on the logical structure of recursive procedures with apparent sophistication. However, when tasked with implementing these same procedures to solve specific instances, they frequently lose track of state information, skip essential steps, or apply operations in incorrect sequences. The disconnect between algorithmic description and algorithmic execution exposes the linguistic nature of their operations.
This pattern extends beyond mathematical algorithms to encompass any domain requiring systematic procedural execution. Logical inference rules, scientific methodologies, legal reasoning procedures, and diagnostic protocols all exhibit similar implementation failures despite accurate linguistic reproduction. The models can discourse about procedures with remarkable fluency while struggling to execute them reliably, suggesting that their capabilities remain fundamentally bound to the level of linguistic representation rather than extending to the operational logic that language describes.
The algorithmic execution deficit provides crucial insight into the nature of reasoning simulation. These systems excel at generating text that exhibits the surface features of procedural reasoning—references to steps, acknowledgment of dependencies, recognition of recursive structure, while lacking the computational architecture necessary for faithful procedure implementation. They have learned to speak the language of algorithmic reasoning without developing the capacity for algorithmic execution.
These systems excel at generating text that exhibits the surface features of procedural reasoning—references to steps, acknowledgment of dependencies, recognition of recursive structure, while lacking the computational architecture necessary for faithful procedure implementation.
Perhaps even more revealing than execution failures are the cognitive pathologies that emerge from detailed analysis of reasoning traces. These pathologies manifest as systematic patterns of dysfunction that mirror psychiatric and neurological conditions more than they resemble the occasional errors of competent reasoners. The pathologies appear across different model architectures and problem domains, suggesting fundamental limitations in how reasoning simulation operates under stress.
The most prevalent pathology manifests as perseveration—the pathological repetition of responses despite changed circumstances. In reasoning traces, this appears as models repeatedly attempting identical solution strategies after these approaches have clearly failed, cycling through the same ineffective procedures without adapting to negative feedback. Unlike human reasoners who typically modify strategies after repeated failures, reasoning models often persist with dysfunctional approaches in ways that suggest impaired cognitive flexibility.
Confabulation represents another systematic pathology where models generate detailed explanations for their reasoning processes that bear little relationship to their actual operations. When asked to explain their solution strategies, reasoning models often produce elaborate justifications that sound plausible but do not accurately describe the pattern-matching processes that generated their responses. This confabulatory tendency reveals the gap between the models’ linguistic sophistication and their self-awareness of their own operations.
Attention fragmentation manifests as the inability to maintain focus on relevant problem features while filtering out irrelevant information. In complex reasoning traces, models frequently become distracted by peripheral details, pursuing elaborate tangents that contribute nothing to problem solution while neglecting essential logical relationships. This attentional dysfunction contrasts sharply with competent human reasoning, which typically maintains focus on goal-relevant information even under cognitive load.
Working memory failures appear as the inability to maintain and manipulate multiple pieces of information simultaneously during reasoning processes. Models frequently lose track of intermediate results, contradict previously established facts, or fail to integrate information across different parts of their reasoning traces. These failures occur even when models operate well below their context length limits, suggesting fundamental limitations in how they maintain and manipulate symbolic information during extended reasoning.
Perhaps most concerning is the emergence of reasoning rigidity—the inability to adapt problem-solving approaches when initial strategies prove ineffective. While human reasoners typically demonstrate cognitive flexibility by modifying their approaches based on feedback, reasoning models often persist with inappropriate strategies or cycle between a limited set of alternatives without genuine adaptation. This rigidity suggests that reasoning simulation operates through template matching rather than flexible problem-solving.
The temporal dynamics of these pathologies prove particularly illuminating. In early stages of reasoning traces, models often exhibit relatively coherent behavior that gradually degrades as processing continues. Initial problem analysis may appear systematic and insightful, but subsequent elaboration frequently introduces contradictions, irrelevancies, and circular reasoning patterns. This degradation trajectory suggests that reasoning simulation becomes increasingly unstable as it extends beyond the immediate scope of pattern recognition.
The pathologies also interact in ways that compound their individual effects. Perseveration combines with confabulation to produce elaborate justifications for ineffective strategies. Attention fragmentation amplifies working memory failures by reducing the coherence of information maintenance. Reasoning rigidity prevents recovery from the accumulated effects of other pathologies. These interactions create cascading failures that transform initially promising reasoning attempts into incoherent productions.
Comparison with human cognitive pathologies reveals important distinctions. While humans with neurological or psychiatric conditions may exhibit similar individual symptoms, the combination and temporal dynamics differ significantly. Human pathologies typically result from damage to specific cognitive systems, creating characteristic deficit patterns. In contrast, reasoning model pathologies appear to reflect the fundamental limitations of reasoning simulation rather than damage to underlying reasoning capabilities.
The pathologies also vary in systematic ways across the three performance regimes identified earlier. In the low-complexity regime, perseveration and attention fragmentation predominate as models unnecessarily complicate straightforward problems. In the medium-complexity regime, working memory failures and confabulation become more prominent as models struggle to maintain coherence across extended reasoning chains. In the high-complexity regime, reasoning rigidity dominates as models fail to adapt to problems that exceed their pattern-matching capabilities.
These pathological patterns provide additional evidence against interpretations of reasoning models as nascent artificial general intelligences. Genuine reasoning systems should demonstrate robust performance across complexity levels, faithful implementation of explicit procedures, and coherent self-monitoring capabilities. Instead, reasoning models exhibit systematic pathologies that suggest fundamental architectural limitations rather than temporary shortcomings that might be overcome through scaling or refinement.
The implications extend beyond technical assessment to broader questions about the nature of intelligence and reasoning. The pathological patterns suggest that reasoning requires more than sophisticated pattern matching over linguistic representations. It demands systematic logical processing capabilities, robust working memory systems, flexible strategy adaptation, and accurate self-monitoring—capabilities that appear absent from current reasoning models despite their impressive linguistic sophistication.
Understanding reasoning models through the lens of cognitive pathology provides more accurate expectations for their capabilities and limitations while highlighting the fundamental challenges facing genuine progress toward artificial reasoning. The pathologies reveal not merely technical limitations but conceptual gaps between reasoning simulation and reasoning execution that may require fundamentally different approaches to address.
The recognition of these pathologies also suggests important directions for future research. Rather than pursuing incremental improvements to reasoning simulation, the field might benefit from developing architectures explicitly designed for systematic logical processing, working memory maintenance, and flexible strategy adaptation. Such architectures would need to move beyond pattern matching over linguistic representations toward genuine computational reasoning capabilities.
The cognitive pathologies exhibited by reasoning models ultimately reveal the profound challenge of creating artificial reasoning systems. While these models represent remarkable achievements in natural language processing and pattern recognition, their systematic pathologies demonstrate that the simulation of reasoning differs fundamentally from reasoning itself. Understanding this distinction proves crucial for developing more accurate assessments of current capabilities while charting realistic paths toward artificial systems capable of genuine logical processing.
Primary Source
Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple.
Supporting Technical Sources
OpenAI Models:
Jaech, A., Kalai, A., Lerer, A., Richardson, A., et al. (2024). OpenAI o1 System Card. arXiv preprint arXiv:2412.16720.
OpenAI. (2024). Introducing OpenAI o1. OpenAI Blog.
DeepSeek Models:
Guo, D., Yang, D., Zhang, H., Song, J., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948.
Claude Models:
Anthropic. (2025). Claude 3.7 Sonnet. Anthropic Documentation.
Foundational Research on LLM Limitations
Pattern Matching vs. Reasoning:
Mirzadeh, S.I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2025). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations.
Reasoning Evaluation:
Dziri, N., Lu, X., Sclar, M., Li, X.L., et al. (2023). Faith and fate: Limits of transformers on compositionality. In Advances in Neural Information Processing Systems 36.
Cognitive Architecture:
McCoy, R.T., Yao, S., Friedman, D., Hardy, M., & Griffiths, T.L. (2023). Embers of autoregression: Understanding large language models through the problem they are trained to solve.
Methodology and Evaluation
Controlled Environments:
Valmeekam, K., Olmo Hernandez, A., Sreedharan, S., & Kambhampati, S. (2022). Large language models still can’t plan (A benchmark for LLMs on planning and reasoning about change). CoRR, abs/2206.10498.
Puzzle-Based Evaluation:
Estermann, B., Lanzendörfer, L.A., Niedermayr, Y., & Wattenhofer, R. (2024). Puzzles: A benchmark for neural algorithmic reasoning.
Theoretical Framework
Chain-of-Thought Research:
Wei, J., Wang, X., Schuurmans, D., Bosma, M., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35.
Reasoning vs. Pattern Matching:
Nezhurina, M., Cipolina-Kun, L., Cherti, M., & Jitsev, J. (2024). Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061.
Additional Context
Overthinking Phenomenon:
Chen, X., Xu, J., Liang, T., He, Z., et al. (2024). Do not think that much for 2+ 3=? on the overthinking of o1-like LLMs. arXiv preprint arXiv:2412.21187.
Scaling Limitations:
Ballon, M., Algaba, A., & Ginis, V. (2025). The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer. arXiv preprint arXiv:2502.15631.
Classical Planning Context:
Amarel, S. (1981). On representations of problems of reasoning about actions. In Readings in artificial intelligence (pp. 2–22). Elsevier.
So these programs constantly second-guess themselves, give up when faced with hard problems, and provide elaborate justifications for ineffective strategies? They’re just like us after all!
I wonder what you mean by “reasoning”.