The Vario Advantage

Reasoning strategies measurably beat direct model calls. Here's the data.

+25pp
Best accuracy lift
19
Strategies
$0.02
Cost per problem
Same model. Different reasoning. +25pp accuracy.
MATH Benchmark -- Haiku 4.5 -- 16 balanced problems. Strategies ordered by accuracy.
0% 25% 50% 75% 100% 50.0% baseline ~$0 62.5% temp_sweep $0.010 62.5% lens_ensemble $0.018 68.8% majority_vote $0.017 68.8% best_of_n $0.017 68.8% weighted_vote $0.017 75.0% gen_&_verify $0.017 +25pp

Cross-Model Comparison

The same strategies applied across four different models. The lift depends on the baseline -- mid-range models benefit most.

Baseline vs. Best Strategy Across Models
MATH Benchmark. Haiku & Gemini: 16 problems. 100-problem runs: generate & verify.
0% 25% 50% 75% 100% Baseline Best strategy 50% 75% Haiku (16p) 81.3% 87.5% Gemini (16p) 87% 91% Gemini (100p) 91% 88% MiniMax (100p) -3pp

Mid-range models benefit most

Haiku at 50% baseline gained +25pp. Gemini Flash at 81-87% gained +4-6pp. The more room there is to improve, the more strategies help.

Ceiling effects are real

Haiku at 92% baseline on 100 problems saw +0pp from generate & verify. When a model already solves nearly everything, there's nothing left to gain.

Overconfident verification hurts

MiniMax M2.5 dropped 3pp with generate & verify. Its self-verification systematically rejected correct answers. Not all models verify well.


Cost & Latency Reality Check

Strategies cost more. Here's how much.

$0.01-0.02
Cost per problem
$1-2
Per 100-problem batch
5-90s
Latency per problem
~1ms
Baseline latency
Detailed Cost & Latency Breakdown
Haiku 4.5, 16-problem MATH benchmark. Sorted by accuracy.
Strategy Accuracy Cost / Problem Latency Problems / $1
baseline 50.0% ~$0 1.1ms --
temperature_sweep 62.5% $0.010 16.4s 100
majority_vote 68.8% $0.017 5.1s 59
best_of_n 68.8% $0.017 14.9s 59
weighted_vote 68.8% $0.017 6.4s 59
generate_and_verify 75.0% $0.017 10.1s 59
lens_ensemble 62.5% $0.018 61.9s 56
The math

For a batch of 100 problems, the total cost is $1-2. The quality improvement is 4-25 percentage points. At $0.017 per problem, majority_vote and generate_and_verify solve ~59 problems per dollar -- while delivering 19-25pp more accuracy than a free baseline call.

Latency context

Strategies add seconds, not minutes. majority_vote is the fastest strategy at 5.1s -- practical for interactive use. lens_ensemble at 62s is best suited for batch processing. The right trade-off depends on whether you need realtime answers or batch quality.


When Strategies Help Most

Not all situations benefit equally. Here's what the data shows about when to use strategies -- and when not to.

Sweet spot: 50-85% baselines

Models scoring in this range have the most headroom. Haiku at 50% gained +25pp. Gemini Flash at 81% gained +6pp. This is where strategies deliver the highest ROI.

Diminishing returns above 90%

Haiku at 92% (100-problem run) saw +0pp lift. When the model already solves 92 out of 100 problems, strategies can't find many errors to fix. The ceiling is real.

Overconfident verifiers backfire

MiniMax M2.5 lost 3pp with generate & verify. The model's verification step systematically rejected correct answers. Strategy selection should be model-aware.

Best bang for the buck

majority_vote: cheapest per-pp-gained, fastest (5.1s), and delivers +18.8pp on Haiku. generate_and_verify: highest absolute accuracy, same cost, moderate latency. Start with these two.


19 Strategies. 10 Stages. 9 Lenses.

Strategies are built from composable stages and analytical lenses. Mix them like building blocks to create new reasoning approaches.

Voting

  • majority_vote
  • weighted_vote
  • best_of_n
  • temperature_sweep

Verification

  • generate_and_verify
  • self_consistency
  • chain_of_verification
  • stepwise_verification

Iteration

  • critique_and_refine
  • debate
  • progressive_refinement
  • red_team_blue_team
  • iterative_deepening

Ensemble & Hybrid

  • lens_ensemble
  • mixture_of_strategies
  • adaptive_strategy
  • meta_reasoning
  • tournament
  • cascade

10 Composable Stages

generate score critique refine verify vote filter rank synthesize decompose

9 Analytical Lenses

game theory economics evolution ecology information theory systems thinking adversarial bayesian thermodynamic
Composability

Each strategy is a pipeline of stages. generate_and_verify = generate + verify + vote. lens_ensemble = generate (per lens) + score + synthesize. New strategies emerge from new combinations -- no code changes needed, just a pipeline definition.


Full Data: Gemini 3-Flash

For completeness -- the same benchmark run on a stronger baseline model.

MATH Benchmark -- Gemini 3-Flash (16 problems)
Higher baseline, smaller lift, longer latency. Note: generate & verify underperformed here.
Strategy Accuracy Cost / Problem Latency
baseline 81.3% ~$0 0.8ms
majority_vote 87.5% $0.009 35.9s
best_of_n 87.5% $0.009 57.4s
generate_and_verify 77.8% $0.009 88.2s
lens_ensemble 68.8% $0.010 210.2s
Observation

On Gemini Flash, majority_vote and best_of_n outperform generate_and_verify. The optimal strategy is model-dependent. This is why Vario supports 19 strategies -- the best one varies by model, domain, and problem difficulty.


100-Problem Validation

Larger-scale runs with generate & verify across four models. More problems, more statistical confidence.

Generate & Verify -- 100-Problem MATH Benchmark
Model Baseline Gen & Verify Lift Cost / Run
Haiku 4.5 92% 92% +0pp $1.79
Gemini 3-Flash 87% 91% +4pp $1.01
Grok-fast 74% 74% +0pp --
MiniMax M2.5 91% 88% -3pp $1.33

Gemini Flash: best cost-effectiveness

+4pp lift at $1.01 total. That's $0.25 per percentage point gained -- the best ROI in this cohort.

Haiku: ceiling effect confirmed

At 92% baseline, there's almost no room to improve. The strategy correctly identifies most answers but can't fix what the model already gets right.


Case Study: Multi-Model Decisions in Practice

Real architectural decisions from a VC intelligence project. Four frontier models consulted independently via maxthink preset.

Setup: Building a VC intelligence tool for TenOneTen Ventures. Four design questions needed answering: (A) how to get full portfolio data, (B) script vs job handler architecture, (C) how to discover founders of portfolio companies, (D) how to show network reachability. Each question was sent to four frontier models — Opus, GPT Pro, Grok, Gemini — via Vario's maxthink preset. Every model answered independently.

Two patterns emerged that demonstrate the value of multi-model consultation.

Unique Insight Changes the Plan

Decision C (Founder Discovery): All 4 models agreed to use Bright Data Crunchbase lookups ($0.01/company) for founder data. But Gemini alone noticed a critical detail the other 3 missed:

"Crucially, Bright Data returns contacts with LinkedIn IDs. Because your Apollo free tier obfuscates org searches but allows lookups by LinkedIn URL, C1 perfectly bridges the gap: use BD to get the founders' LinkedIn URLs, then feed those URLs into your Apollo integration for the rich 8-role employment history." — Gemini

This is a data pipeline insight — Crunchbase gives LinkedIn IDs, Apollo enriches by LinkedIn URL. The other models recommended Crunchbase for structured data but didn't see the CB→Apollo handoff that makes the whole pipeline work better. Without multi-model, this connection would have been missed.

Consensus Validates the Bold Choice

Decision D (Network/Reachability): The tempting shortcut was D2 — skip network analysis for the demo, show a placeholder. It's risky because it requires inference from sparse data (only 40 LinkedIn connections imported).

All 4 models independently said: Build it. It's the demo's punchline.

Opus: "Don't skip this — it's the demo's punchline. Even 1-2 connections is more impressive than a placeholder."

GPT Pro: "Label tiers: direct → 2-hop → inferred. Keep conservative."

Grok: "Existing data, quick inference, cost-free."

Gemini: "A placeholder defeats the purpose. Tip: manually verify at least one real path exists."

4/4 unanimous against the easy shortcut. This consensus gave confidence to invest the extra effort, knowing it wasn't just one model's opinion.

Consensus Exposed a Bad Assumption

Decision A (Getting all 155 investments): GPT Pro recommended using an existing investments.csv Crunchbase export. We checked — the CSV exists but contains stale data from 2013-2015, missing TenOneTen's recent investments entirely.

The other models that recommended live scraping were correct. Multi-model didn't prevent the bad recommendation, but the split (2 said CSV, 2 said live) flagged it as worth verifying — which caught the error.

Takeaway

The value isn't that every model is right — it's that disagreement flags uncertainty and unique insights surface connections no single model sees.


What's Next

The benchmark results are the starting point. Here's where the research goes from here.

Data note

The benchmarks above are from early runs on MATH with budget-tier models and a small problem set (16-100 problems). Comprehensive benchmarks are in progress: more models (including frontier-tier Opus, GPT-Pro, Gemini Pro), more strategies head-to-head, more domains (MMLU-Pro, GPQA, WritingBench), larger sample sizes for statistical significance, and full cost/latency profiling. This page will be updated as better data lands.

Cross-domain validation

MMLU-Pro, GPQA, WritingBench. Does the lift hold on reasoning, science, and creative tasks?

Adaptive strategy selection

Pick the best strategy per problem type automatically. Route easy problems to baseline, hard ones to gen & verify.

Strategy tournaments

Head-to-head comparison on the same problems. Statistical significance testing across strategy pairs.

Ablation studies

Which stage contributes most? Is it the generation diversity, the verification, or the voting? Isolate the signal.

Vario -- part of Rivus. Data from MATH benchmark runs, February 2026.