On the Reasoning Gaps of Large Language Models

NeurIPS 2026 IDLE

Phase Progress

Literature Review
Framework
Benchmark Design
Analysis Infra
Evaluation
Paper Writing
Submission

Models

10/12

Combinations

270/324

Instances

134,572

evaluated

Completion

83%

Evaluation Pipeline

View details →

10/12

Models Complete

134,572

Instances Run

83%

Overall Progress

0

Running Now

Decisions

All →
2026-03-11

Run Sonnet 4 and o3 evaluations; defer Opus 4.6

2026-03-11

Switch to autonomous decision workflow

2026-03-11

Deploy VPS infrastructure for 24/7 operation