Where opus and sonnet fail, and what vario recipes could fix
| Category | GPQA opus (n=15) |
MMLU-Pro opus (n=15) |
MMLU-Pro sonnet (n=31) |
MATH sonnet (n=13) |
Total |
|---|---|---|---|---|---|
| reasoning_error | 12 (80%) | 5 (33%) | 13 (42%) | 3 (23%) | 33 |
| knowledge_gap | 0 | 6 (40%) | 12 (39%) | 0 | 18 |
| format_mismatch | 0 | 0 | 0 | 10 (77%) | 10 |
| ambiguous_question | 1 (7%) | 2 (13%) | 4 (13%) | 0 | 7 |
| extraction_failure | 2 (13%) | 1 (7%) | 1 (3%) | 0 | 4 |
| careless_mistake | 0 | 1 (7%) | 1 (3%) | 0 | 2 |
| Total | 15 | 15 | 31 | 13 | 74 |
GPQA is a reasoning benchmark, not a knowledge benchmark. 80% of opus failures are reasoning errors: wrong stereochemistry assignments, incorrect mechanism tracking, sign errors in calculations. The model has the domain knowledge but applies it incorrectly through multi-step chains. This is important because it implies retrieval-augmentation won't help — only better step-by-step verification will.
MMLU-Pro splits knowledge vs. reasoning roughly 40/40. Knowledge gaps are genuinely obscure facts — specific survey statistics, communication model terminology, legal precedents that require memorized detail. Reasoning errors are multi-step calculations and complex legal analysis. Both categories are present at similar rates across opus and sonnet, suggesting they reflect the benchmark structure, not model-specific weaknesses.
MATH failures are format noise, not real errors. 77% of sonnet's "failures" are format mismatches: 1000000 vs 1,000,000, equivalent LaTeX representations. Only 3 of 13 failures are genuine reasoning errors. The benchmark scorer needs to be fixed before drawing conclusions about MATH performance — current numbers understate actual accuracy by ~8pp.
Extraction failures are fixable infrastructure bugs. 4 cases where the model was reasoning toward the correct answer but response truncation or extractor bugs grabbed the wrong letter. These represent free accuracy gains — fix the extractor, recover the points. Not a model capability issue at all.
\dfrac{7\pi}{6} and \frac{7}{6}\pi render identically. The extractor doesn't normalize LaTeX before comparison.Which failure modes can vario recipes address, and how much improvement is realistic?
| Failure Mode | Count | Addressable? | Recipe | Est. Gain |
|---|---|---|---|---|
| reasoning_error (multi-step) | 33 | HIGH | multi_perspective + observe |
+3–5pp GPQA |
| reasoning_error (stereo/sign) | ~15 | MEDIUM | verify_and_check |
+1–2pp GPQA |
| knowledge_gap | 18 | LOW | enhance_and_solve + search |
+1–2pp MMLU-Pro |
| format_mismatch | 10 | HIGH | Fix scorer (not vario) | N/A — free +8pp |
| extraction_failure | 4 | HIGH | Fix extractor (not vario) | N/A — free +3pp |
The old strategies (best_of_n, majority_vote) generate N solutions using the same reasoning approach. The model makes the same stereochemistry error every time — majority vote just amplifies the wrong answer. multi_perspective forces genuinely different approaches: one model solves via mechanism, another via symmetry, a third via elimination. This is structurally different and the most promising untested recipe for GPQA. The structural difference matters more than the sample size.
Revised baseline: 87.1% (previously estimated ~85%). Weighted_vote lift revised from +7pp to +5.0pp.
| Strategy | Score | vs Baseline | Notes |
|---|---|---|---|
| baseline (single-shot) | 87.1% | — | Revised from ~85% estimate |
weighted_vote |
92.1% | +5.0pp | Lift revised down from +7pp |
best_of_n |
~90% | +3pp est. | Same approach × N; plateau expected |
multi_perspective |
TBD | Untested | Structurally different — highest potential |
verify_and_check |
TBD | Untested | Targets sign/stereo errors specifically |