On the Reasoning Gaps of Large Language Models
NeurIPS 2026 IDLE
Phase Progress
✓ Literature Review
✓ Framework
✓ Benchmark Design
✓ Analysis Infra
Evaluation
Paper Writing
Submission
Models
10/12
Combinations
270/324
Instances
134,572
evaluated
Completion
83%
Evaluation Pipeline
View details →10/12
Models Complete
134,572
Instances Run
83%
Overall Progress
0
Running Now
Decisions
All → 2026-03-11
Run Sonnet 4 and o3 evaluations; defer Opus 4.6
2026-03-11
Switch to autonomous decision workflow
2026-03-11
Deploy VPS infrastructure for 24/7 operation