Dashboard — Deepwork Research

Projects

1

0 active

Active Sessions

0

agent processes

Papers in Pipeline

1

targeting venues

Monthly Spend

$0

of $1000 budget

Projects

NeurIPS 2026

On the Reasoning Gaps of Large Language Models

IDLE

Phase 5 — Evaluation

Eval progress 83%

10/12 models 270/324 combos

Last activity: 3d ago reasoning-gaps

ACL 2027

Taxonomy of LLM Agent Failures

PAUSED

Phase 1 — Literature Review

No active evaluations

Last activity: 5d ago agent-failure-taxonomy

Recent Decisions

2026-03-11

reasoning-gaps

Run Sonnet 4 and o3 evaluations; defer Opus 4.6

Sonnet 4 + o3 ($95 combined) provide more marginal value than Opus alone ($272). Two diverse additions better than one expensive one.

2026-03-11

reasoning-gaps

Switch to autonomous decision workflow

Human-in-the-loop bottleneck slowing evaluation progress. All decisions logged in status.yaml.

2026-03-11

reasoning-gaps

Deploy VPS infrastructure for 24/7 operation

Remote-first enables daemon, API, PostgreSQL running independently of laptop.

2026-03-10

reasoning-gaps

Run Haiku 4.5 and GPT-4o-mini in parallel

Both APIs have sufficient rate limits. Maximize throughput across providers.

2026-03-10

reasoning-gaps

Use 100 instances per task-model-strategy combination

Balances statistical significance with budget constraints.

System Health

Orchestrator

operational

Database

operational

Memory

27% used

PORTFOLIO

On the Reasoning Gaps of Large Language Models

Taxonomy of LLM Agent Failures