EVAL MONITOR
agent-failure-taxonomy benchmark evaluation across models and prompting strategies
Accuracy Heatmap — Direct Prompting
| Task | haiku-4.5 small | gpt-4o-mini small | gpt-4o large | o3 large | llama-3.1-8b small | llama-3.1-70b large | ministral-8b small | mistral-small-24b large | qwen-2.5-7b small | qwen-2.5-72b large |
|---|---|---|---|---|---|---|---|---|---|---|
| B1 Modus Tollens Propositional Logic | 59 % | 61 % | 77 % | 0 % | 52 % | 61 % | 73 % | 75 % | 61 % | 81 % |
| B2 Syllogistic Categorical Reasoning | 53 % | 76 % | 83 % | 11 % | 65 % | 76 % | 70 % | 65 % | 68 % | 79 % |
| B3 Temporal Ordering Temporal Logic | 16 % | 21 % | 24 % | 0 % | 21 % | 22 % | 19 % | 20 % | 15 % | 24 % |
| B4 Spatial Rotation Spatial Reasoning | 35 % | 46 % | 39 % | 0 % | 27 % | 45 % | 26 % | 46 % | 39 % | 46 % |
| B5 Counting Arithmetic | 75 % | 77 % | 79 % | 11 % | 65 % | 78 % | 75 % | 74 % | 72 % | 82 % |
| B6 Set Membership Set Theory | 24 % | 28 % | 42 % | 0 % | 24 % | 18 % | 18 % | 16 % | 28 % | 28 % |
| B7 Recursive Depth Recursion | 10 % | 66 % | 66 % | 5 % | 47 % | 32 % | 53 % | 68 % | 62 % | 67 % |
| B8 Pattern Matching Induction | 99 % | 95 % | 100 % | 44 % | 94 % | 99 % | 98 % | 94 % | 92 % | 99 % |
| B9 Compositional Compositionality | 53 % | 54 % | 56 % | 0 % | 54 % | 59 % | 57 % | 50 % | 58 % | 65 % |
<40% 40-69% 70-89% 90%+ Pending
Chain-of-Thought Lift Analysis
haiku-4.5
Anthropic / small
B1
59 / 81
+22
B2
53 / 93
+40
B3
16 / 93
+77
B4
35 / 89
+54
B5
75 / 100
+25
B6
24 / 50
+26
B7
10 / 70
+60
B8
99 / 100
+1
B9
53 / 95
+42
Direct CoT Lift = CoT - Direct
gpt-4o-mini
OpenAI / small
B1
61 / 78
+17
B2
76 / 87
+11
B3
21 / 81
+60
B4
46 / 64
+18
B5
77 / 79
+2
B6
28 / 31
+3
B7
66 / 60
+-6
B8
95 / 94
+-1
B9
54 / 55
+1
Direct CoT Lift = CoT - Direct
gpt-4o
OpenAI / large
B1
77 / 81
+4
B2
83 / 90
+7
B3
24 / 96
+72
B4
39 / 62
+23
B5
79 / 90
+11
B6
42 / 40
+-2
B7
66 / 67
+1
B8
100 / 99
+-1
B9
56 / 58
+2
Direct CoT Lift = CoT - Direct
o3
OpenAI / large
B1
0 / 39
+39
B2
11 / 78
+67
B3
0 / 73
+73
B4
0 / 76
+76
B5
11 / 81
+70
B6
0 / 51
+51
B7
5 / 16
+11
B8
44 / 100
+56
B9
0 / 91
+91
Direct CoT Lift = CoT - Direct
llama-3.1-8b
OpenRouter / small
B1
52 / 56
+4
B2
65 / 69
+4
B3
21 / 33
+12
B4
27 / 61
+34
B5
65 / 46
+-19
B6
24 / 19
+-5
B7
47 / 26
+-21
B8
94 / 98
+4
B9
54 / 51
+-3
Direct CoT Lift = CoT - Direct
llama-3.1-70b
OpenRouter / large
B1
61 / 76
+15
B2
76 / 88
+12
B3
22 / 83
+61
B4
45 / 81
+36
B5
78 / 72
+-6
B6
18 / 33
+15
B7
32 / 36
+4
B8
99 / 100
+1
B9
59 / 84
+25
Direct CoT Lift = CoT - Direct
ministral-8b
OpenRouter / small
B1
73 / 73
+0
B2
70 / 82
+12
B3
19 / 83
+64
B4
26 / 89
+63
B5
75 / 86
+11
B6
18 / 31
+13
B7
53 / 39
+-14
B8
98 / 99
+1
B9
57 / 50
+-7
Direct CoT Lift = CoT - Direct
mistral-small-24b
OpenRouter / large
B1
75 / 76
+1
B2
65 / 84
+19
B3
20 / 83
+63
B4
46 / 80
+34
B5
74 / 76
+2
B6
16 / 34
+18
B7
68 / 45
+-23
B8
94 / 100
+6
B9
50 / 70
+20
Direct CoT Lift = CoT - Direct
qwen-2.5-7b
OpenRouter / small
B1
61 / 75
+14
B2
68 / 81
+13
B3
15 / 45
+30
B4
39 / 60
+21
B5
72 / 67
+-5
B6
28 / 29
+1
B7
62 / 54
+-8
B8
92 / 97
+5
B9
58 / 59
+1
Direct CoT Lift = CoT - Direct
qwen-2.5-72b
OpenRouter / large
B1
81 / 80
+-1
B2
79 / 90
+11
B3
24 / 84
+60
B4
46 / 84
+38
B5
82 / 88
+6
B6
28 / 46
+18
B7
67 / 62
+-5
B8
99 / 99
+0
B9
65 / 82
+17
Direct CoT Lift = CoT - Direct
Scale Analysis — Small vs Large (Within Family)
| Family | Small Model | Large Model | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT | gpt-4o-mini | 61 | 76 | 21 | 46 | 77 | 28 | 66 | 95 | 54 | |
| gpt-4o | 77 | 83 | 24 | 39 | 79 | 42 | 66 | 100 | 56 | ||
| Llama | llama-3.1-8b | 52 | 65 | 21 | 27 | 65 | 24 | 47 | 94 | 54 | |
| llama-3.1-70b | 61 | 76 | 22 | 45 | 78 | 18 | 32 | 99 | 59 | ||
| Mistral | ministral-8b | 73 | 70 | 19 | 26 | 75 | 18 | 53 | 98 | 57 | |
| mistral-small-24b | 75 | 65 | 20 | 46 | 74 | 16 | 68 | 94 | 50 | ||
| Qwen | qwen-2.5-7b | 61 | 68 | 15 | 39 | 72 | 28 | 62 | 92 | 58 | |
| qwen-2.5-72b | 81 | 79 | 24 | 46 | 82 | 28 | 67 | 99 | 65 |
Values shown are Direct prompting accuracy. Scale effect = Large - Small accuracy.