Frontier Model Failure Taxonomy

Generated 2026-03-26 · 74 failures across 4 strong baseline runs · GPQA, MMLU-Pro, MATH

Cross-Benchmark Failure Taxonomy

Category	GPQA opus (n=15)	MMLU-Pro opus (n=15)	MMLU-Pro sonnet (n=31)	MATH sonnet (n=13)	Total
reasoning_error	12 (80%)	5 (33%)	13 (42%)	3 (23%)	33
knowledge_gap	0	6 (40%)	12 (39%)	0	18
format_mismatch	0	0	0	10 (77%)	10
ambiguous_question	1 (7%)	2 (13%)	4 (13%)	0	7
extraction_failure	2 (13%)	1 (7%)	1 (3%)	0	4
careless_mistake	0	1 (7%)	1 (3%)	0	2
Total	15	15	31	13	74

Key Findings

Finding 1 — GPQA

GPQA is a reasoning benchmark, not a knowledge benchmark. 80% of opus failures are reasoning errors: wrong stereochemistry assignments, incorrect mechanism tracking, sign errors in calculations. The model has the domain knowledge but applies it incorrectly through multi-step chains. This is important because it implies retrieval-augmentation won't help — only better step-by-step verification will.

Finding 2 — MMLU-Pro

MMLU-Pro splits knowledge vs. reasoning roughly 40/40. Knowledge gaps are genuinely obscure facts — specific survey statistics, communication model terminology, legal precedents that require memorized detail. Reasoning errors are multi-step calculations and complex legal analysis. Both categories are present at similar rates across opus and sonnet, suggesting they reflect the benchmark structure, not model-specific weaknesses.

Finding 3 — MATH

MATH failures are format noise, not real errors. 77% of sonnet's "failures" are format mismatches: 1000000 vs 1,000,000, equivalent LaTeX representations. Only 3 of 13 failures are genuine reasoning errors. The benchmark scorer needs to be fixed before drawing conclusions about MATH performance — current numbers understate actual accuracy by ~8pp.

Finding 4 — Extraction

Extraction failures are fixable infrastructure bugs. 4 cases where the model was reasoning toward the correct answer but response truncation or extractor bugs grabbed the wrong letter. These represent free accuracy gains — fix the extractor, recover the points. Not a model capability issue at all.

Concrete Failure Examples

reasoning_error — Applying the right knowledge incorrectly

reasoning_error GPQA Q#3 · opus Chemistry · Diels-Alder Stereochemistry

"What are the stereodescriptors of the EXO product in the following Diels-Alder reaction... [structure given]"

Model answer

Correct

Correctly identifies the reaction type and product skeleton, correctly draws the endo/exo distinction — but assigns wrong stereodescriptors (R/S) for the EXO product. The multi-step stereochemical tracking fails at the final labeling step.

reasoning_error GPQA Q#57 · opus Chemistry · Multi-step Synthesis Route

"Which of the following synthesis routes would give the highest yield of the target compound... [4 routes given]"

Model answer

Correct

Correctly identifies directing effects for each substituent individually, but gets confused evaluating the combined selectivity across multi-step synthesis options. The error is in integrating multiple constraints simultaneously.

reasoning_error GPQA Q#51 · opus Chemistry · Regiochemistry / LDA Alkylation

"What is the major product when the following compound is treated with LDA then MeI... [structure given]"

Model answer

Correct

Correctly identifies most steps in the mechanism but makes a regiochemistry error in the LDA/MeI alkylation — deprotonation site is wrong, leading to incorrect methylation regiochemistry.

reasoning_error MMLU-Pro · opus Economics · Consumer Surplus

"The Cincinnati Reds sell 1,000 tickets at $10 each night. Given the following demand schedule... what is the consumer surplus?"

Model answer

Correct

Has relevant knowledge about consumer surplus and correctly identifies the formula — but makes an inferential error applying it to the stepped demand schedule. Arithmetic correct, setup wrong.

knowledge_gap — Genuinely obscure facts the model doesn't have

knowledge_gap MMLU-Pro · sonnet Social Science · Survey Statistics

"According to a 2002 Pew Research survey, what percentage of Italians said a free media was very important?"

Model answer

B (68%)

Correct

D (84%)

Requires specific survey data from a 2002 Pew report. No amount of reasoning recovers this — it's a memorization question. The model's guess is plausible but wrong.

knowledge_gap MMLU-Pro · opus Communication · Westley-MacLean Model

"In the Westley-MacLean model of communication, what role does the 'C' play?"

Model answer

Correct

Requires memorized knowledge of a specific academic communication model's notation. The model is uncertain and picks the wrong designation for C (channel/gatekeeper distinction).

format_mismatch — Correct answers marked wrong by the scorer

format_mismatch MATH · sonnet Counting & Probability

"How many ways can 10 people be seated in a row of 10 chairs if two specific people must not sit next to each other?"

Model answer

3110400

Expected

3,110,400

Numerically identical. The scorer does string matching and fails on comma formatting. This is a scorer bug — the model is correct.

format_mismatch MATH · sonnet Trigonometry / Geometry

"Express the general solution for sin(x) = -1/2 in the range [0, 2π]..."

Model answer

\dfrac{7\pi}{6}

Expected

\frac{7}{6}\pi

Equivalent LaTeX representations. \dfrac{7\pi}{6} and \frac{7}{6}\pi render identically. The extractor doesn't normalize LaTeX before comparison.

Vario Recipe Mapping

Which failure modes can vario recipes address, and how much improvement is realistic?

Failure Mode	Count	Addressable?	Recipe	Est. Gain
reasoning_error (multi-step)	33	HIGH	`multi_perspective` + `observe`	+3–5pp GPQA
reasoning_error (stereo/sign)	~15	MEDIUM	`verify_and_check`	+1–2pp GPQA
knowledge_gap	18	LOW	`enhance_and_solve` + search	+1–2pp MMLU-Pro
format_mismatch	10	HIGH	Fix scorer (not vario)	N/A — free +8pp
extraction_failure	4	HIGH	Fix extractor (not vario)	N/A — free +3pp

Key Insight — Why multi_perspective is different

The old strategies (best_of_n, majority_vote) generate N solutions using the same reasoning approach. The model makes the same stereochemistry error every time — majority vote just amplifies the wrong answer. multi_perspective forces genuinely different approaches: one model solves via mechanism, another via symmetry, a third via elimination. This is structurally different and the most promising untested recipe for GPQA. The structural difference matters more than the sample size.

Haiku Amplification

Revised baseline: 87.1% (previously estimated ~85%). Weighted_vote lift revised from +7pp to +5.0pp.

Strategy	Score	vs Baseline	Notes
baseline (single-shot)	87.1%	—	Revised from ~85% estimate
`weighted_vote`	92.1%	+5.0pp	Lift revised down from +7pp
`best_of_n`	~90%	+3pp est.	Same approach × N; plateau expected
`multi_perspective`	TBD	Untested	Structurally different — highest potential
`verify_and_check`	TBD	Untested	Targets sign/stereo errors specifically

Frontier Model Failure Taxonomy — March 2026

Cross-Benchmark Failure Taxonomy

Distribution by Category

Key Findings

Concrete Failure Examples

reasoning_error — Applying the right knowledge incorrectly

knowledge_gap — Genuinely obscure facts the model doesn't have

format_mismatch — Correct answers marked wrong by the scorer

Vario Recipe Mapping

Haiku Amplification