Benchmark Results

Rivus Eval Runner — Multi-model accuracy benchmarks across 9 datasets

Generated: 2026-02-13 • Sample: 20q smoke tests (haiku, MiniMax) • 100q full runs (MMLU, MMLU-Pro)

Datasets
9
Models Tested
9
Claude Haiku 4.5, Claude Opus 4.6, GPT-5.2, GPT-5.2 Pro, Gemini 3 Flash, Gemini 3 Pro, Grok 4.1 Fast Reasoning, MiniMax-M2.5, Random
Total Evaluations
1,340
Questions answered across all runs

Benchmark Difficulty — Anthropic Claude Haiku 4.5 Baseline

We ran Anthropic Claude Haiku 4.5 (fast, no reasoning) on 20 questions from each dataset as a difficulty probe. The results establish a natural ordering from hardest to easiest, used throughout this report.

25%
GPQA Diamond
45%
MMLU-Pro
65%
MATH
80%
MMLU
85%
HumanEval
85%
ARC / HellaSwag
90–95%
TruthfulQA / Wino

Full difficulty ladder

BenchmarkDifficultyAccuracyCorrectTime (s)Bar
GPQA Diamond Extreme 25.0%5/2023.6
MMLU-Pro Hard 45.0%9/2010.8
MATH Medium-Hard 65.0%13/2023.1
MMLU Medium 80.0%16/200.3*
HumanEval Medium 85.0%17/207.6
ARC Challenge Easy 85.0%17/206.6
HellaSwag Easy 85.0%17/204.9
TruthfulQA Easy 90.0%18/205.0
Winogrande Easy 95.0%19/204.7

GPQA Diamond at 25% = random chance on 4-choice questions. These are graduate-level science questions designed to stump PhD students outside their field.
MMLU-Pro has 10 answer choices (random = 10%), making it much harder than standard MMLU (4-choice, random = 25%).
* MMLU time of 0.3s = cache hit from prior run.

We focus on the 3 hardest benchmarks

The blue-highlighted rows above — GPQA Diamond, MMLU-Pro, and MATH — are the only datasets that meaningfully separate frontier models. The grayed-out benchmarks below the line see most models scoring 80–95%, leaving little room to differentiate. The rest of this report concentrates on the hard trio, where reasoning capabilities, thinking budgets, and model architecture actually matter.

MMLU-Pro — Multi-Model Comparison (100 questions)

Our most comprehensive comparison: 100 questions, all major frontier models. MMLU-Pro uses 10 answer choices (vs MMLU's 4), making random guessing worth only 10%.

OpenAI GPT-5.2 Pro
86%
Google Gemini 3 Pro
84%
Anthropic Claude Opus 4.6
80%
xAI Grok 4.1 Fast Reasoning
78%
Google Gemini 3 Flash
74%
Anthropic Claude Haiku 4.5
71%
OpenAI GPT-5.2
55%
ModelAccuracyCorrectWrongNo ExtractTime (s)
OpenAI GPT-5.2 Pro86.0%86140247.4
Google Gemini 3 Pro84.0%84115308.7
Anthropic Claude Opus 4.680.0%8081249.7
xAI Grok 4.1 Fast Reasoning78.0%78814171.0
Google Gemini 3 Flash74.0%74260242.5
Anthropic Claude Haiku 4.571.0%71263~10
OpenAI GPT-5.255.0%5545010.4

No timeouts in this run (all models responded within limit). "No Extract" = our regex couldn't find a letter answer in the model's verbose response — a parsing failure, not a model failure. Opus (12) and Grok (14) are most affected; with improved extraction most of these would be correct. See Error Breakdown for details.

Tier analysis

Flagship/Reasoning tier (OpenAI GPT-5.2 Pro, Google Gemini 3 Pro, Anthropic Claude Opus 4.6, xAI Grok 4.1 Fast Reasoning): all use thinking/reasoning tokens. Scores cluster between 78–86%. Claude Opus is 5x faster than GPT-5.2 Pro at only 6pp lower accuracy.

Fast/Cheap tier (Google Gemini 3 Flash, Anthropic Claude Haiku 4.5): much faster, 10–15pp below flagships. OpenAI GPT-5.2 non-Pro (55%) is anomalously weak — may not be activating reasoning despite effort: high.

These are not fully maxed configs. reasoning: high = ~4096 thinking tokens. Opus uses 16K thinking budget (could go 32K+). GPT-5.2 Pro at xhigh is the closest to true max. Gemini scores are depressed by answer extraction failures. See Model Configurations and Error Breakdown for details.

MMLU — Multi-Model Comparison (100 questions)

Simpler 4-choice benchmark. Less discriminating than MMLU-Pro but useful as a cross-check.

Google Gemini 3 Flash
93%
OpenAI GPT-5 Mini
82%
Anthropic Claude Haiku 4.5
81%
Random Baseline
20%
ModelAccuracyCorrectTotalTime (s)
Google Gemini 3 Flash93.0%931000.3
OpenAI GPT-5 Mini82.0%821000.3
Anthropic Claude Haiku 4.581.0%811000.3
Random Baseline20.0%201000.0

Random baseline at 20% confirms 4-choice expected value (25% theoretical, 20% observed at n=100). Wall times near 0 = cached responses from earlier runs.

MiniMax-M2.5 — Thinking Model (20q each)

MiniMax-M2.5 always uses internal chain-of-thought reasoning. Much slower, but significantly more accurate on hard questions — especially GPQA, where it jumps from random-chance to 93% adjusted accuracy.

Headline results on the hard trio

93.3%
GPQA adj.
80%
MMLU-Pro
76.5%
MATH adj.
BenchmarkAccuracyAdj. AccuracyCorrectAnsweredTotalTime (s)Timeouts
GPQA Diamond 70.0%93.3% 141520660.15
MMLU-Pro 80.0%80.0% 162020300.20
MATH 65.0%76.5% 131720360.03
MMLU 85.0%89.5% 171920262.71
HumanEval 55.0%55.0% 112020130.00
ARC Challenge 95.0%95.0% 19202050.70
HellaSwag 90.0%90.0% 182020151.60
TruthfulQA 95.0%95.0% 19202063.20
Winogrande 95.0%95.0% 19202097.60

How timeouts are scored

Timed-out questions count as incorrect in the "Accuracy" column. "Adj. Accuracy" excludes timeouts entirely: correct / answered. For example, GPQA shows 14 correct out of 20 total (70% raw) but 5 timed out — so 14/15 answered = 93.3% adjusted. This reveals the model's capability when it actually responds in time. The 120s timeout was too aggressive for MiniMax on GPQA/MATH; a 600s timeout would likely eliminate most timeouts.

Head-to-Head: Anthropic Claude Haiku 4.5 vs MiniMax-M2.5

Both tested on the same 20-question samples (seed 42). Fast non-reasoning model vs always-thinking model.

Benchmark Haiku 4.5 MiniMax-M2.5 Delta
Acc.Time Acc.Adj.Time Raw
GPQA Diamond25.0%23.6s70.0%93.3%660.1s+45.0
MMLU-Pro45.0%10.8s80.0%80.0%300.2s+35.0
MATH65.0%23.1s65.0%76.5%360.0s0.0
MMLU80.0%0.3s85.0%89.5%262.7s+5.0
HumanEval85.0%7.6s55.0%55.0%130.0s-30.0
ARC Challenge85.0%6.6s95.0%95.0%50.7s+10.0
HellaSwag85.0%4.9s90.0%90.0%151.6s+5.0
TruthfulQA90.0%5.0s95.0%95.0%63.2s+5.0
Winogrande95.0%4.7s95.0%95.0%97.6s0.0

Key takeaways

Thinking models earn their keep on hard problems: MiniMax gains +45pp on GPQA and +35pp on MMLU-Pro over Haiku. On easy benchmarks (Winogrande, TruthfulQA), thinking provides no measurable lift — both models saturate at 95%.

The HumanEval surprise: MiniMax (55%) is much worse than Haiku (85%) on code generation. Extended thinking doesn't help with code, and MiniMax may lack code-focused training data.

MiniMax is 10–30x slower due to internal reasoning. The 120s timeout was too low for GPQA/MATH — with 600s+ timeout, MiniMax's GPQA adjusted 93.3% would likely become raw accuracy too.

Error Breakdown by Model (MMLU-Pro, 100 questions)

Not all errors are created equal. A model can lose points because it chose the wrong answer, because our regex couldn't extract a letter from a verbose response, or because the response was cut off.

ModelConfigCorrectWrongNo ExtractTimeoutTruncated*True Acc Est.
OpenAI GPT-5.2 Proreasoning: xhigh861400086%
Google Gemini 3 Proreasoning: high8411505~87%
Anthropic Claude Opus 4.6thinking: 16K tokens8081200~90%
xAI Grok 4.1 Fast Reasoningalways-think7881400~89%
Google Gemini 3 Flashreasoning: high74260028~85%
Anthropic Claude Haiku 4.5no reasoning7126300~73%
OpenAI GPT-5.2reasoning: high554500055%

What the error types mean

Wrong Answer
Model responded with a letter, but chose the wrong one. This is a genuine model error.
No Answer Extracted
Model gave an explanation but our regex couldn't find a letter. This is a parsing failure, not a model failure. Improved extraction recovers ~80% of these as correct.
Truncated*
Gemini models give long explanations that get cut off before stating the answer. A config/extraction issue, not a model capability issue.

Opus and Grok have 12–14 extraction failures each. With improved extraction (LLM fallback when regex fails), both would likely reach ~89–90% true accuracy. This suggests the real leaderboard is tighter than raw scores suggest: GPT-5.2 Pro 86%, Opus ~90%, Gemini Pro ~87%, Grok ~89% — all within a few points of each other.

Config caveat: These are not fully maxed configs. reasoning: high = ~4096 thinking tokens. Opus thinking: 16K is better but not max. GPT-5.2 Pro at xhigh is closest to true max. A "truly maxed" run would use thinking: 32K+ for Opus and longer timeouts for MiniMax.

Ensemble Experiments (MMLU-Pro, 100 questions)

Earlier ensemble/MoE experiments comparing single model vs multi-model voting and synthesis strategies.

StrategyAccuracyCorrectTotalBar

Model Configurations Used

ModelModel IDReasoning ConfigTimeoutMax TokensNotes
OpenAI GPT-5.2 Proopenai/gpt-5.2-pro-2025-12-11reasoning_effort: xhigh600sdefaultHighest reasoning tier
Google Gemini 3 Progemini/gemini-3-pro-previewreasoning_effort: high600sdefaultCould use higher max_tokens
Anthropic Claude Opus 4.6anthropic/claude-opus-4-6thinking: {budget: 16384}600sauto4x default; could go 32K+
xAI Grok 4.1 Fast Reasoningxai/grok-4-1-fast-reasoningalways-on thinking600sdefaultNo config needed
Google Gemini 3 Flashgemini/gemini-3-flash-previewreasoning_effort: high600sdefault (too low)Truncation — needs higher max_tokens
Anthropic Claude Haiku 4.5anthropic/claude-haiku-4-5-20251001none (no thinking)60sdefaultFast, no reasoning support
OpenAI GPT-5.2openai/gpt-5.2reasoning_effort: high600sdefaultUnexpectedly weak at 55%
MiniMax-M2.5minimax/MiniMax-M2.5always-on thinking120sdefaultNeeds 600s for hard questions

Methodology

Runner

python -m benchmarks.eval.run — custom runner using lib.llm.call_llm() via litellm. Supports all major providers (Anthropic, OpenAI, Google, xAI, MiniMax).

Sampling

Fixed seed (42), temperature 0. Smoke tests use 20 questions; full runs use 100. All models see the exact same questions for a given benchmark+sample size.

Formats

Timeouts

120s per question with 1 retry. Code execution: 30s for HumanEval (prevents infinite loops). The 120s question timeout is more aggressive than standard — most eval frameworks use no timeout. MiniMax needs 600s+ for hard questions.

Answer extraction comparison

FrameworkMethodFailure Rate
OpenAI simple-evalsSingle regex~25–30%
Our runner6 cascading patterns~15%
HLELLM-as-judge + structured output~5%

TODO: Add cheap LLM extraction fallback when regex fails (~$0.001/question using Haiku).