Rivus Eval Runner — Multi-model accuracy benchmarks across 9 datasets
We ran Anthropic Claude Haiku 4.5 (fast, no reasoning) on 20 questions from each dataset as a difficulty probe. The results establish a natural ordering from hardest to easiest, used throughout this report.
| Benchmark | Difficulty | Accuracy | Correct | Time (s) | Bar |
|---|---|---|---|---|---|
| GPQA Diamond | Extreme | 25.0% | 5/20 | 23.6 | |
| MMLU-Pro | Hard | 45.0% | 9/20 | 10.8 | |
| MATH | Medium-Hard | 65.0% | 13/20 | 23.1 | |
| MMLU | Medium | 80.0% | 16/20 | 0.3* | |
| HumanEval | Medium | 85.0% | 17/20 | 7.6 | |
| ARC Challenge | Easy | 85.0% | 17/20 | 6.6 | |
| HellaSwag | Easy | 85.0% | 17/20 | 4.9 | |
| TruthfulQA | Easy | 90.0% | 18/20 | 5.0 | |
| Winogrande | Easy | 95.0% | 19/20 | 4.7 | |
GPQA Diamond at 25% = random chance on 4-choice questions. These are graduate-level science questions designed to stump PhD students outside their field.
MMLU-Pro has 10 answer choices (random = 10%), making it much harder than standard MMLU (4-choice, random = 25%).
* MMLU time of 0.3s = cache hit from prior run.
The blue-highlighted rows above — GPQA Diamond, MMLU-Pro, and MATH — are the only datasets that meaningfully separate frontier models. The grayed-out benchmarks below the line see most models scoring 80–95%, leaving little room to differentiate. The rest of this report concentrates on the hard trio, where reasoning capabilities, thinking budgets, and model architecture actually matter.
Our most comprehensive comparison: 100 questions, all major frontier models. MMLU-Pro uses 10 answer choices (vs MMLU's 4), making random guessing worth only 10%.
| Model | Accuracy | Correct | Wrong | No Extract | Time (s) |
|---|---|---|---|---|---|
| OpenAI GPT-5.2 Pro | 86.0% | 86 | 14 | 0 | 247.4 |
| Google Gemini 3 Pro | 84.0% | 84 | 11 | 5 | 308.7 |
| Anthropic Claude Opus 4.6 | 80.0% | 80 | 8 | 12 | 49.7 |
| xAI Grok 4.1 Fast Reasoning | 78.0% | 78 | 8 | 14 | 171.0 |
| Google Gemini 3 Flash | 74.0% | 74 | 26 | 0 | 242.5 |
| Anthropic Claude Haiku 4.5 | 71.0% | 71 | 26 | 3 | ~10 |
| OpenAI GPT-5.2 | 55.0% | 55 | 45 | 0 | 10.4 |
No timeouts in this run (all models responded within limit). "No Extract" = our regex couldn't find a letter answer in the model's verbose response — a parsing failure, not a model failure. Opus (12) and Grok (14) are most affected; with improved extraction most of these would be correct. See Error Breakdown for details.
Flagship/Reasoning tier (OpenAI GPT-5.2 Pro, Google Gemini 3 Pro, Anthropic Claude Opus 4.6, xAI Grok 4.1 Fast Reasoning): all use thinking/reasoning tokens. Scores cluster between 78–86%. Claude Opus is 5x faster than GPT-5.2 Pro at only 6pp lower accuracy.
Fast/Cheap tier (Google Gemini 3 Flash, Anthropic Claude Haiku 4.5): much faster, 10–15pp below flagships.
OpenAI GPT-5.2 non-Pro (55%) is anomalously weak — may not be activating reasoning despite effort: high.
These are not fully maxed configs. reasoning: high = ~4096 thinking tokens. Opus uses 16K thinking budget (could go 32K+).
GPT-5.2 Pro at xhigh is the closest to true max. Gemini scores are depressed by answer extraction failures.
See Model Configurations and Error Breakdown for details.
Simpler 4-choice benchmark. Less discriminating than MMLU-Pro but useful as a cross-check.
| Model | Accuracy | Correct | Total | Time (s) |
|---|---|---|---|---|
| Google Gemini 3 Flash | 93.0% | 93 | 100 | 0.3 |
| OpenAI GPT-5 Mini | 82.0% | 82 | 100 | 0.3 |
| Anthropic Claude Haiku 4.5 | 81.0% | 81 | 100 | 0.3 |
| Random Baseline | 20.0% | 20 | 100 | 0.0 |
Random baseline at 20% confirms 4-choice expected value (25% theoretical, 20% observed at n=100). Wall times near 0 = cached responses from earlier runs.
MiniMax-M2.5 always uses internal chain-of-thought reasoning. Much slower, but significantly more accurate on hard questions — especially GPQA, where it jumps from random-chance to 93% adjusted accuracy.
| Benchmark | Accuracy | Adj. Accuracy | Correct | Answered | Total | Time (s) | Timeouts |
|---|---|---|---|---|---|---|---|
| GPQA Diamond | 70.0% | 93.3% | 14 | 15 | 20 | 660.1 | 5 |
| MMLU-Pro | 80.0% | 80.0% | 16 | 20 | 20 | 300.2 | 0 |
| MATH | 65.0% | 76.5% | 13 | 17 | 20 | 360.0 | 3 |
| MMLU | 85.0% | 89.5% | 17 | 19 | 20 | 262.7 | 1 |
| HumanEval | 55.0% | 55.0% | 11 | 20 | 20 | 130.0 | 0 |
| ARC Challenge | 95.0% | 95.0% | 19 | 20 | 20 | 50.7 | 0 |
| HellaSwag | 90.0% | 90.0% | 18 | 20 | 20 | 151.6 | 0 |
| TruthfulQA | 95.0% | 95.0% | 19 | 20 | 20 | 63.2 | 0 |
| Winogrande | 95.0% | 95.0% | 19 | 20 | 20 | 97.6 | 0 |
Timed-out questions count as incorrect in the "Accuracy" column. "Adj. Accuracy" excludes timeouts entirely: correct / answered. For example, GPQA shows 14 correct out of 20 total (70% raw) but 5 timed out — so 14/15 answered = 93.3% adjusted. This reveals the model's capability when it actually responds in time. The 120s timeout was too aggressive for MiniMax on GPQA/MATH; a 600s timeout would likely eliminate most timeouts.
Both tested on the same 20-question samples (seed 42). Fast non-reasoning model vs always-thinking model.
| Benchmark | Haiku 4.5 | MiniMax-M2.5 | Delta | |||
|---|---|---|---|---|---|---|
| Acc. | Time | Acc. | Adj. | Time | Raw | |
| GPQA Diamond | 25.0% | 23.6s | 70.0% | 93.3% | 660.1s | +45.0 |
| MMLU-Pro | 45.0% | 10.8s | 80.0% | 80.0% | 300.2s | +35.0 |
| MATH | 65.0% | 23.1s | 65.0% | 76.5% | 360.0s | 0.0 |
| MMLU | 80.0% | 0.3s | 85.0% | 89.5% | 262.7s | +5.0 |
| HumanEval | 85.0% | 7.6s | 55.0% | 55.0% | 130.0s | -30.0 |
| ARC Challenge | 85.0% | 6.6s | 95.0% | 95.0% | 50.7s | +10.0 |
| HellaSwag | 85.0% | 4.9s | 90.0% | 90.0% | 151.6s | +5.0 |
| TruthfulQA | 90.0% | 5.0s | 95.0% | 95.0% | 63.2s | +5.0 |
| Winogrande | 95.0% | 4.7s | 95.0% | 95.0% | 97.6s | 0.0 |
Thinking models earn their keep on hard problems: MiniMax gains +45pp on GPQA and +35pp on MMLU-Pro over Haiku. On easy benchmarks (Winogrande, TruthfulQA), thinking provides no measurable lift — both models saturate at 95%.
The HumanEval surprise: MiniMax (55%) is much worse than Haiku (85%) on code generation. Extended thinking doesn't help with code, and MiniMax may lack code-focused training data.
MiniMax is 10–30x slower due to internal reasoning. The 120s timeout was too low for GPQA/MATH — with 600s+ timeout, MiniMax's GPQA adjusted 93.3% would likely become raw accuracy too.
Not all errors are created equal. A model can lose points because it chose the wrong answer, because our regex couldn't extract a letter from a verbose response, or because the response was cut off.
| Model | Config | Correct | Wrong | No Extract | Timeout | Truncated* | True Acc Est. |
|---|---|---|---|---|---|---|---|
| OpenAI GPT-5.2 Pro | reasoning: xhigh | 86 | 14 | 0 | 0 | 0 | 86% |
| Google Gemini 3 Pro | reasoning: high | 84 | 11 | 5 | 0 | 5 | ~87% |
| Anthropic Claude Opus 4.6 | thinking: 16K tokens | 80 | 8 | 12 | 0 | 0 | ~90% |
| xAI Grok 4.1 Fast Reasoning | always-think | 78 | 8 | 14 | 0 | 0 | ~89% |
| Google Gemini 3 Flash | reasoning: high | 74 | 26 | 0 | 0 | 28 | ~85% |
| Anthropic Claude Haiku 4.5 | no reasoning | 71 | 26 | 3 | 0 | 0 | ~73% |
| OpenAI GPT-5.2 | reasoning: high | 55 | 45 | 0 | 0 | 0 | 55% |
Opus and Grok have 12–14 extraction failures each. With improved extraction (LLM fallback when regex fails), both would likely reach ~89–90% true accuracy. This suggests the real leaderboard is tighter than raw scores suggest: GPT-5.2 Pro 86%, Opus ~90%, Gemini Pro ~87%, Grok ~89% — all within a few points of each other.
Config caveat: These are not fully maxed configs. reasoning: high = ~4096 thinking tokens.
Opus thinking: 16K is better but not max. GPT-5.2 Pro at xhigh is closest to true max.
A "truly maxed" run would use thinking: 32K+ for Opus and longer timeouts for MiniMax.
Earlier ensemble/MoE experiments comparing single model vs multi-model voting and synthesis strategies.
| Strategy | Accuracy | Correct | Total | Bar |
|---|
| Model | Model ID | Reasoning Config | Timeout | Max Tokens | Notes |
|---|---|---|---|---|---|
| OpenAI GPT-5.2 Pro | openai/gpt-5.2-pro-2025-12-11 | reasoning_effort: xhigh | 600s | default | Highest reasoning tier |
| Google Gemini 3 Pro | gemini/gemini-3-pro-preview | reasoning_effort: high | 600s | default | Could use higher max_tokens |
| Anthropic Claude Opus 4.6 | anthropic/claude-opus-4-6 | thinking: {budget: 16384} | 600s | auto | 4x default; could go 32K+ |
| xAI Grok 4.1 Fast Reasoning | xai/grok-4-1-fast-reasoning | always-on thinking | 600s | default | No config needed |
| Google Gemini 3 Flash | gemini/gemini-3-flash-preview | reasoning_effort: high | 600s | default (too low) | Truncation — needs higher max_tokens |
| Anthropic Claude Haiku 4.5 | anthropic/claude-haiku-4-5-20251001 | none (no thinking) | 60s | default | Fast, no reasoning support |
| OpenAI GPT-5.2 | openai/gpt-5.2 | reasoning_effort: high | 600s | default | Unexpectedly weak at 55% |
| MiniMax-M2.5 | minimax/MiniMax-M2.5 | always-on thinking | 120s | default | Needs 600s for hard questions |
python -m benchmarks.eval.run — custom runner using lib.llm.call_llm() via litellm. Supports all major providers (Anthropic, OpenAI, Google, xAI, MiniMax).
Fixed seed (42), temperature 0. Smoke tests use 20 questions; full runs use 100. All models see the exact same questions for a given benchmark+sample size.
\boxed{} extraction from model response120s per question with 1 retry. Code execution: 30s for HumanEval (prevents infinite loops). The 120s question timeout is more aggressive than standard — most eval frameworks use no timeout. MiniMax needs 600s+ for hard questions.
| Framework | Method | Failure Rate |
|---|---|---|
| OpenAI simple-evals | Single regex | ~25–30% |
| Our runner | 6 cascading patterns | ~15% |
| HLE | LLM-as-judge + structured output | ~5% |
TODO: Add cheap LLM extraction fallback when regex fails (~$0.001/question using Haiku).