EVAL MONITOR

agent-failure-taxonomy benchmark evaluation across models and prompting strategies


Accuracy Heatmap — Direct Prompting

Task
haiku-4.5
small
gpt-4o-mini
small
gpt-4o
large
o3
large
llama-3.1-8b
small
llama-3.1-70b
large
ministral-8b
small
mistral-small-24b
large
qwen-2.5-7b
small
qwen-2.5-72b
large
B1
Modus Tollens Propositional Logic
59 % 61 % 77 % 0 % 52 % 61 % 73 % 75 % 61 % 81 %
B2
Syllogistic Categorical Reasoning
53 % 76 % 83 % 11 % 65 % 76 % 70 % 65 % 68 % 79 %
B3
Temporal Ordering Temporal Logic
16 % 21 % 24 % 0 % 21 % 22 % 19 % 20 % 15 % 24 %
B4
Spatial Rotation Spatial Reasoning
35 % 46 % 39 % 0 % 27 % 45 % 26 % 46 % 39 % 46 %
B5
Counting Arithmetic
75 % 77 % 79 % 11 % 65 % 78 % 75 % 74 % 72 % 82 %
B6
Set Membership Set Theory
24 % 28 % 42 % 0 % 24 % 18 % 18 % 16 % 28 % 28 %
B7
Recursive Depth Recursion
10 % 66 % 66 % 5 % 47 % 32 % 53 % 68 % 62 % 67 %
B8
Pattern Matching Induction
99 % 95 % 100 % 44 % 94 % 99 % 98 % 94 % 92 % 99 %
B9
Compositional Compositionality
53 % 54 % 56 % 0 % 54 % 59 % 57 % 50 % 58 % 65 %
<40% 40-69% 70-89% 90%+ Pending

Chain-of-Thought Lift Analysis

haiku-4.5

Anthropic / small

B1
59 / 81 +22
B2
53 / 93 +40
B3
16 / 93 +77
B4
35 / 89 +54
B5
75 / 100 +25
B6
24 / 50 +26
B7
10 / 70 +60
B8
99 / 100 +1
B9
53 / 95 +42
Direct CoT Lift = CoT - Direct

gpt-4o-mini

OpenAI / small

B1
61 / 78 +17
B2
76 / 87 +11
B3
21 / 81 +60
B4
46 / 64 +18
B5
77 / 79 +2
B6
28 / 31 +3
B7
66 / 60 +-6
B8
95 / 94 +-1
B9
54 / 55 +1
Direct CoT Lift = CoT - Direct

gpt-4o

OpenAI / large

B1
77 / 81 +4
B2
83 / 90 +7
B3
24 / 96 +72
B4
39 / 62 +23
B5
79 / 90 +11
B6
42 / 40 +-2
B7
66 / 67 +1
B8
100 / 99 +-1
B9
56 / 58 +2
Direct CoT Lift = CoT - Direct

o3

OpenAI / large

B1
0 / 39 +39
B2
11 / 78 +67
B3
0 / 73 +73
B4
0 / 76 +76
B5
11 / 81 +70
B6
0 / 51 +51
B7
5 / 16 +11
B8
44 / 100 +56
B9
0 / 91 +91
Direct CoT Lift = CoT - Direct

llama-3.1-8b

OpenRouter / small

B1
52 / 56 +4
B2
65 / 69 +4
B3
21 / 33 +12
B4
27 / 61 +34
B5
65 / 46 +-19
B6
24 / 19 +-5
B7
47 / 26 +-21
B8
94 / 98 +4
B9
54 / 51 +-3
Direct CoT Lift = CoT - Direct

llama-3.1-70b

OpenRouter / large

B1
61 / 76 +15
B2
76 / 88 +12
B3
22 / 83 +61
B4
45 / 81 +36
B5
78 / 72 +-6
B6
18 / 33 +15
B7
32 / 36 +4
B8
99 / 100 +1
B9
59 / 84 +25
Direct CoT Lift = CoT - Direct

ministral-8b

OpenRouter / small

B1
73 / 73 +0
B2
70 / 82 +12
B3
19 / 83 +64
B4
26 / 89 +63
B5
75 / 86 +11
B6
18 / 31 +13
B7
53 / 39 +-14
B8
98 / 99 +1
B9
57 / 50 +-7
Direct CoT Lift = CoT - Direct

mistral-small-24b

OpenRouter / large

B1
75 / 76 +1
B2
65 / 84 +19
B3
20 / 83 +63
B4
46 / 80 +34
B5
74 / 76 +2
B6
16 / 34 +18
B7
68 / 45 +-23
B8
94 / 100 +6
B9
50 / 70 +20
Direct CoT Lift = CoT - Direct

qwen-2.5-7b

OpenRouter / small

B1
61 / 75 +14
B2
68 / 81 +13
B3
15 / 45 +30
B4
39 / 60 +21
B5
72 / 67 +-5
B6
28 / 29 +1
B7
62 / 54 +-8
B8
92 / 97 +5
B9
58 / 59 +1
Direct CoT Lift = CoT - Direct

qwen-2.5-72b

OpenRouter / large

B1
81 / 80 +-1
B2
79 / 90 +11
B3
24 / 84 +60
B4
46 / 84 +38
B5
82 / 88 +6
B6
28 / 46 +18
B7
67 / 62 +-5
B8
99 / 99 +0
B9
65 / 82 +17
Direct CoT Lift = CoT - Direct

Scale Analysis — Small vs Large (Within Family)

Family Small Model Large Model B1B2B3B4B5B6B7B8B9
GPT gpt-4o-mini 61 76 21 46 77 28 66 95 54
gpt-4o 77 83 24 39 79 42 66 100 56
Llama llama-3.1-8b 52 65 21 27 65 24 47 94 54
llama-3.1-70b 61 76 22 45 78 18 32 99 59
Mistral ministral-8b 73 70 19 26 75 18 53 98 57
mistral-small-24b 75 65 20 46 74 16 68 94 50
Qwen qwen-2.5-7b 61 68 15 39 72 28 62 92 58
qwen-2.5-72b 81 79 24 46 82 28 67 99 65

Values shown are Direct prompting accuracy. Scale effect = Large - Small accuracy.