Research Explainer

On the Reasoning Gaps of Large Language Models

A formal characterization of why LLMs fail at specific reasoning tasks, which failures chain-of-thought can fix, and which are fundamentally beyond reach.

NeurIPS 2026 · Oddur Sigurdsson

The Problem

LLMs fail at certain types of reasoning — they can't reliably reverse strings, track state through long chains, or solve SAT problems. Everyone knows this, but nobody has a unified explanation for why they fail at these specific things and not others.

The Insight

Transformers, architecturally, are constant-depth circuits. No matter how big you make the model, a single forward pass computes everything in parallel through a fixed number of layers. This puts a hard ceiling on what they can compute in one shot — formally, they live in a complexity class called TC⁰ (constant-depth threshold circuits).

Some problems require sequential depth that a parallel architecture fundamentally can't provide. Sorting a long list requires stepping through it. Evaluating deeply nested boolean expressions requires resolving inner brackets first. These aren't just "hard problems" — they're problems whose structure mismatches the architecture.

The Taxonomy

The paper maps six types of reasoning failures to specific complexity-theoretic boundaries:

Type Name Boundary CoT Helps?
1 Sensitivity Gap AC⁰ boundary Yes, short CoT
2 Depth Gap TC⁰/NC¹ boundary Yes, O(log n) steps
3 Serial Composition Needs O(n) steps Yes, O(n) steps
4 Algorithmic Gap Complex within P In theory; tools better
5 Intractability Gap NP-hard No
6 Architectural Gap Autoregressive constraint No

The Prediction

CoT should dramatically help on Types 1–3 (where the model just needs more sequential steps) and barely help on Types 5–6 (where the problem is fundamentally hard or the architecture itself is the constraint). This is a testable, falsifiable claim.

The Experiment

9 diagnostic benchmarks (one per gap type), each with procedurally generated instances at controlled difficulty levels. Tested across 10 model families — GPT-4o, Claude Haiku, o3, Llama, Qwen, Mistral — under three conditions: direct answer, chain-of-thought, and budget-limited CoT. ~135,000 total evaluations.

The Result

CoT Lift — Types 1–3

+0.271

depth & serial gaps

CoT Lift — Types 5–6

+0.037

intractability & architectural gaps

The theory predicts exactly which failures are fixable and which aren't. The data matches.

Why It Matters

Instead of saying "LLMs are bad at reasoning," you can now say precisely which reasoning they're bad at, why (complexity-theoretic mismatch), and whether prompting techniques like CoT can help. It gives practitioners a principled framework for knowing when to use chain-of-thought vs. tool use vs. "this just won't work."


Glossary

TC⁰ — Threshold Circuit, depth 0

The set of problems solvable by constant-depth circuits with AND, OR, NOT, and MAJORITY gates. This is what a transformer's forward pass can compute. "Constant depth" means adding more parameters doesn't add more sequential steps.

NC¹ — Nick's Class, depth 1

Problems solvable by logarithmic-depth circuits. Slightly more powerful than TC⁰ — includes things like evaluating boolean formulas and arithmetic expressions. Requires sequential depth that grows with input size.

AC⁰ — Alternating Circuit, depth 0

Weaker than TC⁰ — constant-depth circuits without MAJORITY gates. Can't even compute parity (is the count of 1s odd or even?). Relevant because some transformer limitations trace to this boundary.

P — Polynomial Time

Everything solvable in polynomial time by a standard sequential computer. CoT with polynomial steps can theoretically reach P.

NP-hard

Problems where no known polynomial-time algorithm exists (e.g., 3-SAT). Neither more layers nor more CoT steps will reliably solve these.

CoT — Chain-of-Thought

Prompting the model to show its work step by step. Mechanistically, each generated token is an extra compute step, giving the model sequential depth it otherwise lacks.

Forward Pass

One run through all the transformer's layers, producing a single output token. Fixed depth regardless of model size — a 7B model and a 400B model both do the same number of sequential steps per token.

Complexity Class

A category of problems grouped by the computational resources (time, depth, parallelism) needed to solve them. The hierarchy AC⁰ ⊂ TC⁰ ⊆ NC¹ ⊂ P ⊆ NP defines increasingly powerful levels of computation.

Sensitivity

How many input bits can flip the output when changed individually. High-sensitivity functions (like parity) are hard for shallow circuits, which is why transformers struggle with them.

Serial Composition

Chaining operations where each step depends on the previous one — e.g., applying a function k times. Requires k sequential steps, which a constant-depth architecture can't do in one pass.

Autoregressive Generation

How LLMs produce text — one token at a time, left to right. This creates architectural constraints (e.g., can't "look ahead"), which is the source of Type 6 gaps like string reversal.

Phase Transition (3-SAT)

At a specific clause-to-variable ratio (~4.27), SAT problems shift from almost always solvable to almost always unsolvable. The benchmark generates instances at this boundary where problems are hardest.

ReasonGap Benchmark

The diagnostic suite built for this paper. 9 tasks (B1–B9), procedurally generated to avoid training contamination, each targeting a specific complexity boundary with tunable difficulty.