EVAL MONITOR

agent-failure-taxonomy benchmark evaluation across models and prompting strategies

Accuracy Heatmap — Direct Prompting

Task	haiku-4.5 small	gpt-4o-mini small	gpt-4o large	o3 large	llama-3.1-8b small	llama-3.1-70b large	ministral-8b small	mistral-small-24b large	qwen-2.5-7b small	qwen-2.5-72b large
B1 Modus Tollens Propositional Logic	59 %	61 %	77 %	0 %	52 %	61 %	73 %	75 %	61 %	81 %
B2 Syllogistic Categorical Reasoning	53 %	76 %	83 %	11 %	65 %	76 %	70 %	65 %	68 %	79 %
B3 Temporal Ordering Temporal Logic	16 %	21 %	24 %	0 %	21 %	22 %	19 %	20 %	15 %	24 %
B4 Spatial Rotation Spatial Reasoning	35 %	46 %	39 %	0 %	27 %	45 %	26 %	46 %	39 %	46 %
B5 Counting Arithmetic	75 %	77 %	79 %	11 %	65 %	78 %	75 %	74 %	72 %	82 %
B6 Set Membership Set Theory	24 %	28 %	42 %	0 %	24 %	18 %	18 %	16 %	28 %	28 %
B7 Recursive Depth Recursion	10 %	66 %	66 %	5 %	47 %	32 %	53 %	68 %	62 %	67 %
B8 Pattern Matching Induction	99 %	95 %	100 %	44 %	94 %	99 %	98 %	94 %	92 %	99 %
B9 Compositional Compositionality	53 %	54 %	56 %	0 %	54 %	59 %	57 %	50 %	58 %	65 %

<40% 40-69% 70-89% 90%+ Pending

Chain-of-Thought Lift Analysis

haiku-4.5

Anthropic / small

59 / 81 +22

53 / 93 +40

16 / 93 +77

35 / 89 +54

75 / 100 +25

24 / 50 +26

10 / 70 +60

99 / 100 +1

53 / 95 +42

Direct CoT Lift = CoT - Direct

gpt-4o-mini

OpenAI / small

61 / 78 +17

76 / 87 +11

21 / 81 +60

46 / 64 +18

77 / 79 +2

28 / 31 +3

66 / 60 +-6

95 / 94 +-1

54 / 55 +1

Direct CoT Lift = CoT - Direct

gpt-4o

OpenAI / large

77 / 81 +4

83 / 90 +7

24 / 96 +72

39 / 62 +23

79 / 90 +11

42 / 40 +-2

66 / 67 +1

100 / 99 +-1

56 / 58 +2

Direct CoT Lift = CoT - Direct

o3

OpenAI / large

0 / 39 +39

11 / 78 +67

0 / 73 +73

0 / 76 +76

11 / 81 +70

0 / 51 +51

5 / 16 +11

44 / 100 +56

0 / 91 +91

Direct CoT Lift = CoT - Direct

llama-3.1-8b

OpenRouter / small

52 / 56 +4

65 / 69 +4

21 / 33 +12

27 / 61 +34

65 / 46 +-19

24 / 19 +-5

47 / 26 +-21

94 / 98 +4

54 / 51 +-3

Direct CoT Lift = CoT - Direct

llama-3.1-70b

OpenRouter / large

61 / 76 +15

76 / 88 +12

22 / 83 +61

45 / 81 +36

78 / 72 +-6

18 / 33 +15

32 / 36 +4

99 / 100 +1

59 / 84 +25

Direct CoT Lift = CoT - Direct

ministral-8b

OpenRouter / small

73 / 73 +0

70 / 82 +12

19 / 83 +64

26 / 89 +63

75 / 86 +11

18 / 31 +13

53 / 39 +-14

98 / 99 +1

57 / 50 +-7

Direct CoT Lift = CoT - Direct

mistral-small-24b

OpenRouter / large

75 / 76 +1

65 / 84 +19

20 / 83 +63

46 / 80 +34

74 / 76 +2

16 / 34 +18

68 / 45 +-23

94 / 100 +6

50 / 70 +20

Direct CoT Lift = CoT - Direct

qwen-2.5-7b

OpenRouter / small

61 / 75 +14

68 / 81 +13

15 / 45 +30

39 / 60 +21

72 / 67 +-5

28 / 29 +1

62 / 54 +-8

92 / 97 +5

58 / 59 +1

Direct CoT Lift = CoT - Direct

qwen-2.5-72b

OpenRouter / large

81 / 80 +-1

79 / 90 +11

24 / 84 +60

46 / 84 +38

82 / 88 +6

28 / 46 +18

67 / 62 +-5

99 / 99 +0

65 / 82 +17

Direct CoT Lift = CoT - Direct

Scale Analysis — Small vs Large (Within Family)

Family	Small Model	Large Model	B1	B2	B3	B4	B5	B6	B7	B8	B9
GPT	gpt-4o-mini		61	76	21	46	77	28	66	95	54
GPT		gpt-4o	77	83	24	39	79	42	66	100	56
Llama	llama-3.1-8b		52	65	21	27	65	24	47	94	54
Llama		llama-3.1-70b	61	76	22	45	78	18	32	99	59
Mistral	ministral-8b		73	70	19	26	75	18	53	98	57
Mistral		mistral-small-24b	75	65	20	46	74	16	68	94	50
Qwen	qwen-2.5-7b		61	68	15	39	72	28	62	92	58
Qwen		qwen-2.5-72b	81	79	24	46	82	28	67	99	65

Values shown are Direct prompting accuracy. Scale effect = Large - Small accuracy.