Benchmark Results

Rivus Eval Runner — Multi-model accuracy benchmarks across 9 datasets

Generated: 2026-02-13 • Sample: 20q smoke tests (haiku, MiniMax) • 100q full runs (MMLU, MMLU-Pro)

Datasets

GPQA, MMLU-Pro, MATH, MMLU, HumanEval, ARC, HellaSwag, TruthfulQA, Winogrande

Models Tested

Claude Haiku 4.5, Claude Opus 4.6, GPT-5.2, GPT-5.2 Pro, Gemini 3 Flash, Gemini 3 Pro, Grok 4.1 Fast Reasoning, MiniMax-M2.5, Random

Total Evaluations

1,340

Questions answered across all runs

Benchmark Difficulty — Anthropic Claude Haiku 4.5 Baseline

We ran Anthropic Claude Haiku 4.5 (fast, no reasoning) on 20 questions from each dataset as a difficulty probe. The results establish a natural ordering from hardest to easiest, used throughout this report.

25%

GPQA Diamond

45%

MMLU-Pro

65%

MATH

80%

MMLU

85%

HumanEval

85%

ARC / HellaSwag

90–95%

TruthfulQA / Wino

Full difficulty ladder

Benchmark	Difficulty	Accuracy	Correct	Time (s)
GPQA Diamond	Extreme	25.0%	5/20	23.6
MMLU-Pro	Hard	45.0%	9/20	10.8
MATH	Medium-Hard	65.0%	13/20	23.1

MMLU	Medium	80.0%	16/20	0.3*
HumanEval	Medium	85.0%	17/20	7.6
ARC Challenge	Easy	85.0%	17/20	6.6
HellaSwag	Easy	85.0%	17/20	4.9
TruthfulQA	Easy	90.0%	18/20	5.0
Winogrande	Easy	95.0%	19/20	4.7

GPQA Diamond at 25% = random chance on 4-choice questions. These are graduate-level science questions designed to stump PhD students outside their field.
MMLU-Pro has 10 answer choices (random = 10%), making it much harder than standard MMLU (4-choice, random = 25%).
* MMLU time of 0.3s = cache hit from prior run.

We focus on the 3 hardest benchmarks

The blue-highlighted rows above — GPQA Diamond, MMLU-Pro, and MATH — are the only datasets that meaningfully separate frontier models. The grayed-out benchmarks below the line see most models scoring 80–95%, leaving little room to differentiate. The rest of this report concentrates on the hard trio, where reasoning capabilities, thinking budgets, and model architecture actually matter.

MMLU-Pro — Multi-Model Comparison (100 questions)

Our most comprehensive comparison: 100 questions, all major frontier models. MMLU-Pro uses 10 answer choices (vs MMLU's 4), making random guessing worth only 10%.

OpenAI GPT-5.2 Pro

86%

Google Gemini 3 Pro

84%

Anthropic Claude Opus 4.6

80%

xAI Grok 4.1 Fast Reasoning

78%

Google Gemini 3 Flash

74%

Anthropic Claude Haiku 4.5

71%

OpenAI GPT-5.2

55%

Model	Accuracy	Correct	Wrong	No Extract	Time (s)
OpenAI GPT-5.2 Pro	86.0%	86	14	0	247.4
Google Gemini 3 Pro	84.0%	84	11	5	308.7
Anthropic Claude Opus 4.6	80.0%	80	8	12	49.7
xAI Grok 4.1 Fast Reasoning	78.0%	78	8	14	171.0
Google Gemini 3 Flash	74.0%	74	26	0	242.5
Anthropic Claude Haiku 4.5	71.0%	71	26	3	~10
OpenAI GPT-5.2	55.0%	55	45	0	10.4

No timeouts in this run (all models responded within limit). "No Extract" = our regex couldn't find a letter answer in the model's verbose response — a parsing failure, not a model failure. Opus (12) and Grok (14) are most affected; with improved extraction most of these would be correct. See Error Breakdown for details.

Tier analysis

Flagship/Reasoning tier (OpenAI GPT-5.2 Pro, Google Gemini 3 Pro, Anthropic Claude Opus 4.6, xAI Grok 4.1 Fast Reasoning): all use thinking/reasoning tokens. Scores cluster between 78–86%. Claude Opus is 5x faster than GPT-5.2 Pro at only 6pp lower accuracy.

Fast/Cheap tier (Google Gemini 3 Flash, Anthropic Claude Haiku 4.5): much faster, 10–15pp below flagships. OpenAI GPT-5.2 non-Pro (55%) is anomalously weak — may not be activating reasoning despite effort: high.

These are not fully maxed configs. reasoning: high = ~4096 thinking tokens. Opus uses 16K thinking budget (could go 32K+). GPT-5.2 Pro at xhigh is the closest to true max. Gemini scores are depressed by answer extraction failures. See Model Configurations and Error Breakdown for details.

MMLU — Multi-Model Comparison (100 questions)

Simpler 4-choice benchmark. Less discriminating than MMLU-Pro but useful as a cross-check.

Google Gemini 3 Flash

93%

OpenAI GPT-5 Mini

82%

Anthropic Claude Haiku 4.5

81%

Random Baseline

20%

Model	Accuracy	Correct	Total	Time (s)
Google Gemini 3 Flash	93.0%	93	100	0.3
OpenAI GPT-5 Mini	82.0%	82	100	0.3
Anthropic Claude Haiku 4.5	81.0%	81	100	0.3
Random Baseline	20.0%	20	100	0.0

Random baseline at 20% confirms 4-choice expected value (25% theoretical, 20% observed at n=100). Wall times near 0 = cached responses from earlier runs.

MiniMax-M2.5 — Thinking Model (20q each)

MiniMax-M2.5 always uses internal chain-of-thought reasoning. Much slower, but significantly more accurate on hard questions — especially GPQA, where it jumps from random-chance to 93% adjusted accuracy.

Headline results on the hard trio

93.3%

GPQA adj.

80%

MMLU-Pro

76.5%

MATH adj.

Benchmark	Accuracy	Adj. Accuracy	Correct	Answered	Total	Time (s)	Timeouts
GPQA Diamond	70.0%	93.3%	14	15	20	660.1	5
MMLU-Pro	80.0%	80.0%	16	20	20	300.2	0
MATH	65.0%	76.5%	13	17	20	360.0	3
MMLU	85.0%	89.5%	17	19	20	262.7	1
HumanEval	55.0%	55.0%	11	20	20	130.0	0
ARC Challenge	95.0%	95.0%	19	20	20	50.7	0
HellaSwag	90.0%	90.0%	18	20	20	151.6	0
TruthfulQA	95.0%	95.0%	19	20	20	63.2	0
Winogrande	95.0%	95.0%	19	20	20	97.6	0

How timeouts are scored

Timed-out questions count as incorrect in the "Accuracy" column. "Adj. Accuracy" excludes timeouts entirely: correct / answered. For example, GPQA shows 14 correct out of 20 total (70% raw) but 5 timed out — so 14/15 answered = 93.3% adjusted. This reveals the model's capability when it actually responds in time. The 120s timeout was too aggressive for MiniMax on GPQA/MATH; a 600s timeout would likely eliminate most timeouts.

Head-to-Head: Anthropic Claude Haiku 4.5 vs MiniMax-M2.5

Both tested on the same 20-question samples (seed 42). Fast non-reasoning model vs always-thinking model.

Benchmark	Haiku 4.5		MiniMax-M2.5			Delta
	Acc.	Time	Acc.	Adj.	Time	Raw
GPQA Diamond	25.0%	23.6s	70.0%	93.3%	660.1s	+45.0
MMLU-Pro	45.0%	10.8s	80.0%	80.0%	300.2s	+35.0
MATH	65.0%	23.1s	65.0%	76.5%	360.0s	0.0
MMLU	80.0%	0.3s	85.0%	89.5%	262.7s	+5.0
HumanEval	85.0%	7.6s	55.0%	55.0%	130.0s	-30.0
ARC Challenge	85.0%	6.6s	95.0%	95.0%	50.7s	+10.0
HellaSwag	85.0%	4.9s	90.0%	90.0%	151.6s	+5.0
TruthfulQA	90.0%	5.0s	95.0%	95.0%	63.2s	+5.0
Winogrande	95.0%	4.7s	95.0%	95.0%	97.6s	0.0

Key takeaways

Thinking models earn their keep on hard problems: MiniMax gains +45pp on GPQA and +35pp on MMLU-Pro over Haiku. On easy benchmarks (Winogrande, TruthfulQA), thinking provides no measurable lift — both models saturate at 95%.

The HumanEval surprise: MiniMax (55%) is much worse than Haiku (85%) on code generation. Extended thinking doesn't help with code, and MiniMax may lack code-focused training data.

MiniMax is 10–30x slower due to internal reasoning. The 120s timeout was too low for GPQA/MATH — with 600s+ timeout, MiniMax's GPQA adjusted 93.3% would likely become raw accuracy too.

Error Breakdown by Model (MMLU-Pro, 100 questions)

Not all errors are created equal. A model can lose points because it chose the wrong answer, because our regex couldn't extract a letter from a verbose response, or because the response was cut off.

Model	Config	Correct	Wrong	No Extract	Truncated*	True Acc Est.
OpenAI GPT-5.2 Pro	`reasoning: xhigh`	86	14	0	0	86%
Google Gemini 3 Pro	`reasoning: high`	84	11	5	5	~87%
Anthropic Claude Opus 4.6	`thinking: 16K tokens`	80	8	12	0	~90%
xAI Grok 4.1 Fast Reasoning	`always-think`	78	8	14	0	~89%
Google Gemini 3 Flash	`reasoning: high`	74	26	0	28	~85%
Anthropic Claude Haiku 4.5	`no reasoning`	71	26	3	0	~73%
OpenAI GPT-5.2	`reasoning: high`	55	45	0	0	55%

What the error types mean

Wrong Answer: Model responded with a letter, but chose the wrong one. This is a genuine model error.
No Answer Extracted: Model gave an explanation but our regex couldn't find a letter. This is a parsing failure, not a model failure. Improved extraction recovers ~80% of these as correct.
Truncated*: Gemini models give long explanations that get cut off before stating the answer. A config/extraction issue, not a model capability issue.

Opus and Grok have 12–14 extraction failures each. With improved extraction (LLM fallback when regex fails), both would likely reach ~89–90% true accuracy. This suggests the real leaderboard is tighter than raw scores suggest: GPT-5.2 Pro 86%, Opus ~90%, Gemini Pro ~87%, Grok ~89% — all within a few points of each other.

Config caveat: These are not fully maxed configs. reasoning: high = ~4096 thinking tokens. Opus thinking: 16K is better but not max. GPT-5.2 Pro at xhigh is closest to true max. A "truly maxed" run would use thinking: 32K+ for Opus and longer timeouts for MiniMax.

Ensemble Experiments (MMLU-Pro, 100 questions)

Earlier ensemble/MoE experiments comparing single model vs multi-model voting and synthesis strategies.

Strategy	Accuracy	Correct	Total	Bar

Model Configurations Used

Model	Model ID	Reasoning Config	Timeout	Max Tokens	Notes
OpenAI GPT-5.2 Pro	`openai/gpt-5.2-pro-2025-12-11`	`reasoning_effort: xhigh`	600s	default	Highest reasoning tier
Google Gemini 3 Pro	`gemini/gemini-3-pro-preview`	`reasoning_effort: high`	600s	default	Could use higher max_tokens
Anthropic Claude Opus 4.6	`anthropic/claude-opus-4-6`	`thinking: {budget: 16384}`	600s	auto	4x default; could go 32K+
xAI Grok 4.1 Fast Reasoning	`xai/grok-4-1-fast-reasoning`	always-on thinking	600s	default	No config needed
Google Gemini 3 Flash	`gemini/gemini-3-flash-preview`	`reasoning_effort: high`	600s	default (too low)	Truncation — needs higher max_tokens
Anthropic Claude Haiku 4.5	`anthropic/claude-haiku-4-5-20251001`	none (no thinking)	60s	default	Fast, no reasoning support
OpenAI GPT-5.2	`openai/gpt-5.2`	`reasoning_effort: high`	600s	default	Unexpectedly weak at 55%
MiniMax-M2.5	`minimax/MiniMax-M2.5`	always-on thinking	120s	default	Needs 600s for hard questions

Methodology

Runner

python -m benchmarks.eval.run — custom runner using lib.llm.call_llm() via litellm. Supports all major providers (Anthropic, OpenAI, Google, xAI, MiniMax).

Sampling

Fixed seed (42), temperature 0. Smoke tests use 20 questions; full runs use 100. All models see the exact same questions for a given benchmark+sample size.

Formats

MCQ (GPQA, MMLU-Pro, MMLU, ARC, HellaSwag, TruthfulQA, Winogrande): Cascading regex extraction for letter answers (6 patterns, ~15% failure rate on verbose models)
MATH: \boxed{} extraction from model response
HumanEval: Code completion + subprocess execution with test cases

Timeouts

120s per question with 1 retry. Code execution: 30s for HumanEval (prevents infinite loops). The 120s question timeout is more aggressive than standard — most eval frameworks use no timeout. MiniMax needs 600s+ for hard questions.

Answer extraction comparison

Framework	Method	Failure Rate
OpenAI simple-evals	Single regex	~25–30%
Our runner	6 cascading patterns	~15%
HLE	LLM-as-judge + structured output	~5%

TODO: Add cheap LLM extraction fallback when regex fails (~$0.001/question using Haiku).