Cross-Model Comparison
The same strategies applied across four different models. The lift depends on the baseline -- mid-range models benefit most.
Mid-range models benefit most
Haiku at 50% baseline gained +25pp. Gemini Flash at 81-87% gained +4-6pp. The more room there is to improve, the more strategies help.
Ceiling effects are real
Haiku at 92% baseline on 100 problems saw +0pp from generate & verify. When a model already solves nearly everything, there's nothing left to gain.
Overconfident verification hurts
MiniMax M2.5 dropped 3pp with generate & verify. Its self-verification systematically rejected correct answers. Not all models verify well.
Cost & Latency Reality Check
Strategies cost more. Here's how much.
| Strategy | Accuracy | Cost / Problem | Latency | Problems / $1 |
|---|---|---|---|---|
| baseline | 50.0% | ~$0 | 1.1ms | -- |
| temperature_sweep | 62.5% | $0.010 | 16.4s | 100 |
| majority_vote | 68.8% | $0.017 | 5.1s | 59 |
| best_of_n | 68.8% | $0.017 | 14.9s | 59 |
| weighted_vote | 68.8% | $0.017 | 6.4s | 59 |
| generate_and_verify | 75.0% | $0.017 | 10.1s | 59 |
| lens_ensemble | 62.5% | $0.018 | 61.9s | 56 |
For a batch of 100 problems, the total cost is $1-2. The quality improvement is 4-25 percentage points. At $0.017 per problem, majority_vote and generate_and_verify solve ~59 problems per dollar -- while delivering 19-25pp more accuracy than a free baseline call.
Strategies add seconds, not minutes. majority_vote is the fastest strategy at 5.1s -- practical for interactive use. lens_ensemble at 62s is best suited for batch processing. The right trade-off depends on whether you need realtime answers or batch quality.
When Strategies Help Most
Not all situations benefit equally. Here's what the data shows about when to use strategies -- and when not to.
Sweet spot: 50-85% baselines
Models scoring in this range have the most headroom. Haiku at 50% gained +25pp. Gemini Flash at 81% gained +6pp. This is where strategies deliver the highest ROI.
Diminishing returns above 90%
Haiku at 92% (100-problem run) saw +0pp lift. When the model already solves 92 out of 100 problems, strategies can't find many errors to fix. The ceiling is real.
Overconfident verifiers backfire
MiniMax M2.5 lost 3pp with generate & verify. The model's verification step systematically rejected correct answers. Strategy selection should be model-aware.
Best bang for the buck
majority_vote: cheapest per-pp-gained, fastest (5.1s), and delivers +18.8pp on Haiku. generate_and_verify: highest absolute accuracy, same cost, moderate latency. Start with these two.
19 Strategies. 10 Stages. 9 Lenses.
Strategies are built from composable stages and analytical lenses. Mix them like building blocks to create new reasoning approaches.
Voting
- majority_vote
- weighted_vote
- best_of_n
- temperature_sweep
Verification
- generate_and_verify
- self_consistency
- chain_of_verification
- stepwise_verification
Iteration
- critique_and_refine
- debate
- progressive_refinement
- red_team_blue_team
- iterative_deepening
Ensemble & Hybrid
- lens_ensemble
- mixture_of_strategies
- adaptive_strategy
- meta_reasoning
- tournament
- cascade
10 Composable Stages
9 Analytical Lenses
Each strategy is a pipeline of stages. generate_and_verify = generate + verify + vote. lens_ensemble = generate (per lens) + score + synthesize. New strategies emerge from new combinations -- no code changes needed, just a pipeline definition.
Full Data: Gemini 3-Flash
For completeness -- the same benchmark run on a stronger baseline model.
| Strategy | Accuracy | Cost / Problem | Latency |
|---|---|---|---|
| baseline | 81.3% | ~$0 | 0.8ms |
| majority_vote | 87.5% | $0.009 | 35.9s |
| best_of_n | 87.5% | $0.009 | 57.4s |
| generate_and_verify | 77.8% | $0.009 | 88.2s |
| lens_ensemble | 68.8% | $0.010 | 210.2s |
On Gemini Flash, majority_vote and best_of_n outperform generate_and_verify. The optimal strategy is model-dependent. This is why Vario supports 19 strategies -- the best one varies by model, domain, and problem difficulty.
100-Problem Validation
Larger-scale runs with generate & verify across four models. More problems, more statistical confidence.
| Model | Baseline | Gen & Verify | Lift | Cost / Run |
|---|---|---|---|---|
| Haiku 4.5 | 92% | 92% | +0pp | $1.79 |
| Gemini 3-Flash | 87% | 91% | +4pp | $1.01 |
| Grok-fast | 74% | 74% | +0pp | -- |
| MiniMax M2.5 | 91% | 88% | -3pp | $1.33 |
Gemini Flash: best cost-effectiveness
+4pp lift at $1.01 total. That's $0.25 per percentage point gained -- the best ROI in this cohort.
Haiku: ceiling effect confirmed
At 92% baseline, there's almost no room to improve. The strategy correctly identifies most answers but can't fix what the model already gets right.
Case Study: Multi-Model Decisions in Practice
Real architectural decisions from a VC intelligence project. Four frontier models consulted independently via maxthink preset.
Setup: Building a VC intelligence tool for TenOneTen Ventures. Four design questions needed answering: (A) how to get full portfolio data, (B) script vs job handler architecture, (C) how to discover founders of portfolio companies, (D) how to show network reachability. Each question was sent to four frontier models — Opus, GPT Pro, Grok, Gemini — via Vario's maxthink preset. Every model answered independently.
Two patterns emerged that demonstrate the value of multi-model consultation.
Unique Insight Changes the Plan
Decision C (Founder Discovery): All 4 models agreed to use Bright Data Crunchbase lookups ($0.01/company) for founder data. But Gemini alone noticed a critical detail the other 3 missed:
This is a data pipeline insight — Crunchbase gives LinkedIn IDs, Apollo enriches by LinkedIn URL. The other models recommended Crunchbase for structured data but didn't see the CB→Apollo handoff that makes the whole pipeline work better. Without multi-model, this connection would have been missed.
Consensus Validates the Bold Choice
Decision D (Network/Reachability): The tempting shortcut was D2 — skip network analysis for the demo, show a placeholder. It's risky because it requires inference from sparse data (only 40 LinkedIn connections imported).
All 4 models independently said: Build it. It's the demo's punchline.
Opus: "Don't skip this — it's the demo's punchline. Even 1-2 connections is more impressive than a placeholder."
GPT Pro: "Label tiers: direct → 2-hop → inferred. Keep conservative."
Grok: "Existing data, quick inference, cost-free."
Gemini: "A placeholder defeats the purpose. Tip: manually verify at least one real path exists."
4/4 unanimous against the easy shortcut. This consensus gave confidence to invest the extra effort, knowing it wasn't just one model's opinion.
Consensus Exposed a Bad Assumption
Decision A (Getting all 155 investments): GPT Pro recommended using an existing investments.csv Crunchbase export. We checked — the CSV exists but contains stale data from 2013-2015, missing TenOneTen's recent investments entirely.
The other models that recommended live scraping were correct. Multi-model didn't prevent the bad recommendation, but the split (2 said CSV, 2 said live) flagged it as worth verifying — which caught the error.
The value isn't that every model is right — it's that disagreement flags uncertainty and unique insights surface connections no single model sees.
What's Next
The benchmark results are the starting point. Here's where the research goes from here.
Data note
The benchmarks above are from early runs on MATH with budget-tier models and a small problem set (16-100 problems). Comprehensive benchmarks are in progress: more models (including frontier-tier Opus, GPT-Pro, Gemini Pro), more strategies head-to-head, more domains (MMLU-Pro, GPQA, WritingBench), larger sample sizes for statistical significance, and full cost/latency profiling. This page will be updated as better data lands.
Cross-domain validation
MMLU-Pro, GPQA, WritingBench. Does the lift hold on reasoning, science, and creative tasks?
Adaptive strategy selection
Pick the best strategy per problem type automatically. Route easy problems to baseline, hard ones to gen & verify.
Strategy tournaments
Head-to-head comparison on the same problems. Statistical significance testing across strategy pairs.
Ablation studies
Which stage contributes most? Is it the generation diversity, the verification, or the voting? Isolate the signal.