Frontier Model Failure Taxonomy — March 2026

Where opus and sonnet fail, and what vario recipes could fix

Generated 2026-03-26  ·  74 failures across 4 strong baseline runs  ·  GPQA, MMLU-Pro, MATH

74
Total failures analyzed
4
Benchmark runs
33
Reasoning errors (45%)
18
Knowledge gaps (24%)
10
Format mismatches (14%)
4
Extraction failures (5%)

Cross-Benchmark Failure Taxonomy

Category GPQA opus
(n=15)
MMLU-Pro opus
(n=15)
MMLU-Pro sonnet
(n=31)
MATH sonnet
(n=13)
Total
reasoning_error 12 (80%) 5 (33%) 13 (42%) 3 (23%) 33
knowledge_gap 0 6 (40%) 12 (39%) 0 18
format_mismatch 0 0 0 10 (77%) 10
ambiguous_question 1 (7%) 2 (13%) 4 (13%) 0 7
extraction_failure 2 (13%) 1 (7%) 1 (3%) 0 4
careless_mistake 0 1 (7%) 1 (3%) 0 2
Total 15 15 31 13 74

Distribution by Category

reasoning_error
33
knowledge_gap
18
format_mismatch
10
ambiguous_question
7
extraction_failure
4
careless_mistake
2

Key Findings

Finding 1 — GPQA

GPQA is a reasoning benchmark, not a knowledge benchmark. 80% of opus failures are reasoning errors: wrong stereochemistry assignments, incorrect mechanism tracking, sign errors in calculations. The model has the domain knowledge but applies it incorrectly through multi-step chains. This is important because it implies retrieval-augmentation won't help — only better step-by-step verification will.

Finding 2 — MMLU-Pro

MMLU-Pro splits knowledge vs. reasoning roughly 40/40. Knowledge gaps are genuinely obscure facts — specific survey statistics, communication model terminology, legal precedents that require memorized detail. Reasoning errors are multi-step calculations and complex legal analysis. Both categories are present at similar rates across opus and sonnet, suggesting they reflect the benchmark structure, not model-specific weaknesses.

Finding 3 — MATH

MATH failures are format noise, not real errors. 77% of sonnet's "failures" are format mismatches: 1000000 vs 1,000,000, equivalent LaTeX representations. Only 3 of 13 failures are genuine reasoning errors. The benchmark scorer needs to be fixed before drawing conclusions about MATH performance — current numbers understate actual accuracy by ~8pp.

Finding 4 — Extraction

Extraction failures are fixable infrastructure bugs. 4 cases where the model was reasoning toward the correct answer but response truncation or extractor bugs grabbed the wrong letter. These represent free accuracy gains — fix the extractor, recover the points. Not a model capability issue at all.

Concrete Failure Examples

reasoning_error  — Applying the right knowledge incorrectly

reasoning_error GPQA Q#3 · opus Chemistry · Diels-Alder Stereochemistry
"What are the stereodescriptors of the EXO product in the following Diels-Alder reaction... [structure given]"
Model answer
B
Correct
D
Correctly identifies the reaction type and product skeleton, correctly draws the endo/exo distinction — but assigns wrong stereodescriptors (R/S) for the EXO product. The multi-step stereochemical tracking fails at the final labeling step.
reasoning_error GPQA Q#57 · opus Chemistry · Multi-step Synthesis Route
"Which of the following synthesis routes would give the highest yield of the target compound... [4 routes given]"
Model answer
C
Correct
A
Correctly identifies directing effects for each substituent individually, but gets confused evaluating the combined selectivity across multi-step synthesis options. The error is in integrating multiple constraints simultaneously.
reasoning_error GPQA Q#51 · opus Chemistry · Regiochemistry / LDA Alkylation
"What is the major product when the following compound is treated with LDA then MeI... [structure given]"
Model answer
B
Correct
D
Correctly identifies most steps in the mechanism but makes a regiochemistry error in the LDA/MeI alkylation — deprotonation site is wrong, leading to incorrect methylation regiochemistry.
reasoning_error MMLU-Pro · opus Economics · Consumer Surplus
"The Cincinnati Reds sell 1,000 tickets at $10 each night. Given the following demand schedule... what is the consumer surplus?"
Model answer
A
Correct
C
Has relevant knowledge about consumer surplus and correctly identifies the formula — but makes an inferential error applying it to the stepped demand schedule. Arithmetic correct, setup wrong.

knowledge_gap  — Genuinely obscure facts the model doesn't have

knowledge_gap MMLU-Pro · sonnet Social Science · Survey Statistics
"According to a 2002 Pew Research survey, what percentage of Italians said a free media was very important?"
Model answer
B (68%)
Correct
D (84%)
Requires specific survey data from a 2002 Pew report. No amount of reasoning recovers this — it's a memorization question. The model's guess is plausible but wrong.
knowledge_gap MMLU-Pro · opus Communication · Westley-MacLean Model
"In the Westley-MacLean model of communication, what role does the 'C' play?"
Model answer
A
Correct
B
Requires memorized knowledge of a specific academic communication model's notation. The model is uncertain and picks the wrong designation for C (channel/gatekeeper distinction).

format_mismatch  — Correct answers marked wrong by the scorer

format_mismatch MATH · sonnet Counting & Probability
"How many ways can 10 people be seated in a row of 10 chairs if two specific people must not sit next to each other?"
Model answer
3110400
Expected
3,110,400
Numerically identical. The scorer does string matching and fails on comma formatting. This is a scorer bug — the model is correct.
format_mismatch MATH · sonnet Trigonometry / Geometry
"Express the general solution for sin(x) = -1/2 in the range [0, 2π]..."
Model answer
\dfrac{7\pi}{6}
Expected
\frac{7}{6}\pi
Equivalent LaTeX representations. \dfrac{7\pi}{6} and \frac{7}{6}\pi render identically. The extractor doesn't normalize LaTeX before comparison.

Vario Recipe Mapping

Which failure modes can vario recipes address, and how much improvement is realistic?

Failure Mode Count Addressable? Recipe Est. Gain
reasoning_error (multi-step) 33 HIGH multi_perspective + observe +3–5pp GPQA
reasoning_error (stereo/sign) ~15 MEDIUM verify_and_check +1–2pp GPQA
knowledge_gap 18 LOW enhance_and_solve + search +1–2pp MMLU-Pro
format_mismatch 10 HIGH Fix scorer (not vario) N/A — free +8pp
extraction_failure 4 HIGH Fix extractor (not vario) N/A — free +3pp
Key Insight — Why multi_perspective is different

The old strategies (best_of_n, majority_vote) generate N solutions using the same reasoning approach. The model makes the same stereochemistry error every time — majority vote just amplifies the wrong answer. multi_perspective forces genuinely different approaches: one model solves via mechanism, another via symmetry, a third via elimination. This is structurally different and the most promising untested recipe for GPQA. The structural difference matters more than the sample size.

Haiku Amplification

Revised baseline: 87.1% (previously estimated ~85%). Weighted_vote lift revised from +7pp to +5.0pp.

87.1%
Clean baseline (revised)
+5.0pp
weighted_vote lift (revised)
92.1%
Best strategy target
Strategy Score vs Baseline Notes
baseline (single-shot) 87.1% Revised from ~85% estimate
weighted_vote 92.1% +5.0pp Lift revised down from +7pp
best_of_n ~90% +3pp est. Same approach × N; plateau expected
multi_perspective TBD Untested Structurally different — highest potential
verify_and_check TBD Untested Targets sign/stereo errors specifically