February 17, 2026 — rivus/benchmarks
Each model was run on the same 100-question GPQA Diamond subset (seed=0) twice:
once closed-book (no tools), once with provider-native web search enabled.
All runs used --no-cache to ensure fresh API calls.
| Model | Closed-Book | + Native Web | Delta | Latency (closed) | Latency (web) |
|---|---|---|---|---|---|
| Gemini 3 Pro | 93% | 90% | −3% | 216s | 563s |
| GPT-5.2 Pro (reasoning: high) | 89% | 89% | +0% | 330s | 460s |
| Opus 4.6 | 85% | 87% | +2% | 92s | 89s |
| Grok 4.1 Fast Reasoning | 85% | 87% | +2% | 148s | 182s |
| GPT-5.2 | 75% | 68% | −7% | 67s | 73s |
GPQA Diamond contains PhD-level questions in physics, chemistry, and biology. The correct answers require deep domain knowledge and multi-step reasoning, not facts that can be Googled. Web results for these topics tend to be surface-level (Wikipedia summaries, study guides) that can mislead a model that already has the correct reasoning chain from training. The strongest model (Gemini 3 Pro at 93%) was hurt most — it had the most to lose from second-guessing itself with web noise.
The largest drop. GPT-5.2 (non-Pro, no reasoning budget) appears particularly susceptible
to web search distraction. Without extended reasoning to filter irrelevant search results,
it incorporates misleading information into its answers. The Pro version with
reasoning_effort: high was immune (89% → 89%), suggesting that
reasoning budget acts as a filter against web noise.
| Model | Accuracy | Visual | Avg Latency | Cost | Tier |
|---|---|---|---|---|---|
| Gemini 3 Pro | 93% | 216s | $0* | Frontier | |
| Gemini 3 Pro + web | 90% | 563s | $3.51 | Frontier | |
| GPT-5.2 Pro (re:high) | 89% | 330s | $0* | Frontier | |
| GPT-5.2 Pro + web | 89% | 460s | $0* | Frontier | |
| Opus 4.6 + web | 87% | 89s | $3.31 | Frontier | |
| Grok 4.1 Reasoning + web | 87% | 182s | $0* | Frontier | |
| Opus 4.6 (think 16K) | 86% | 304s | $11.00 | Frontier | |
| Opus 4.6 | 85% | 92s | $3.35 | Strong | |
| Grok 4.1 Reasoning | 85% | 148s | $0* | Strong | |
| MiniMax-M2.5 | 84% | 742s | $0 | Strong | |
| Gemini 3 Flash | 83% | 48s | $0.29 | Strong | |
| GPT-5.2 | 75% | 67s | $0* | Mid | |
| GPT-5 Mini | 74% | 98s | $0* | Mid | |
| GPT-5.2 + web | 68% | 73s | $0* | Hurt by web | |
| Grok 4.1 Non-Reasoning | 64% | 25s | $0* | Budget | |
| Haiku 4.5 | 58% | — | $0 | Budget |
* $0 cost = subscription/free tier or Responses API not returning cost metadata with web search tools.
Tested whether explicit thinking budget improves Opus on GPQA Diamond.
| Configuration | Accuracy | Cost | Latency |
|---|---|---|---|
| Opus 4.6 (no thinking) | 85% | $3.35 | 92s |
| Opus 4.6 (think 16K budget) | 86% | $11.00 | 304s |
| Opus 4.6 + web search | 87% | $3.31 | 89s |
16K thinking budget costs 3.3x more ($11 vs $3.35) and takes 3.3x longer (304s vs 92s) for only +1% accuracy (85% → 86%). Web search achieved +2% at lower cost. Per-question analysis showed 89% agreement between think/no-think: thinking helped on 6 questions but hurt on 5 (overthinking/second-guessing).
| Model | Accuracy | Cost | Avg Latency |
|---|---|---|---|
| Gemini 3 Flash | 95.2% | $0.16 | 5.3s |
| GPT-5 Mini | 93.3% | $0* | 8.6s |
| Haiku 4.5 | 92.4% | $0.27 | 3.2s |
| Grok 4.1 Non-Reasoning | 70.5% | $0* | 3.6s |
70% of questions were answered identically by all 4 models. Of the 32 disagreements, Grok was uniquely wrong on 21 (66%) — dragging down the agreement rate. The top 3 models (Gemini Flash, GPT-5 Mini, Haiku) agreed on 91% of questions.
native_web_search=True):
web_search_options parameterweb_search toolweb_search_options parameterweb_search tool--no-cache for fresh API callsthinking.budget_tokens=16384 via Anthropic APIlib/llm via litellm, with Anthropic OAuth subscription path + API billing fallbackbenchmarks/eval/ — AsyncBatchRunner with rich progress barpython -m benchmarks.eval.gpqa_official <model> --sample 100 --native-web-search --no-cache| Scenario | Recommendation |
|---|---|
| Best GPQA accuracy | Gemini 3 Pro, closed-book (93%) |
| Best accuracy/cost ratio | Gemini 3 Flash (83% at $0.29, 48s) |
| When to use web search | Current-events benchmarks, not knowledge-heavy science. Web search helps when training data is stale, not when deep reasoning is needed. |
| When to use thinking tokens | When you need the highest possible accuracy and cost/latency don't matter. +1% with 3.3x overhead. |
| Fast MATH evaluations | Gemini 3 Flash (95.2%, 5.3s, $0.16/105q) |
Generated from benchmarks/results/gpqa_official/ and benchmarks/results/math_official/.
Run CLI: python -m benchmarks.eval.gpqa_official <model> [--native-web-search] [--sample N]