NeurIPS 2026
On the Reasoning Gaps of Large Language Models
Phase 5 — Evaluation
2 projects · $0 / $1000 this month
Projects
1
0 active
Active Sessions
0
agent processes
Papers in Pipeline
1
targeting venues
Monthly Spend
$0
of $1000 budget
Projects
NeurIPS 2026
Phase 5 — Evaluation
ACL 2027
Phase 1 — Literature Review
No active evaluations
Recent Decisions
Run Sonnet 4 and o3 evaluations; defer Opus 4.6
Sonnet 4 + o3 ($95 combined) provide more marginal value than Opus alone ($272). Two diverse additions better than one expensive one.
Switch to autonomous decision workflow
Human-in-the-loop bottleneck slowing evaluation progress. All decisions logged in status.yaml.
Deploy VPS infrastructure for 24/7 operation
Remote-first enables daemon, API, PostgreSQL running independently of laptop.
Run Haiku 4.5 and GPT-4o-mini in parallel
Both APIs have sufficient rate limits. Maximize throughput across providers.
Use 100 instances per task-model-strategy combination
Balances statistical significance with budget constraints.
System Health