Vario: What If You Could Change One Number?

Every LLM call you make today is already a pipeline — just a very short one. Vario makes the pipeline longer, and everything interesting follows.

What We Know (from the research)

Five findings that shape how vario works. Details and citations in Section 7.

Generate → judge → select works. Up to +19 accuracy points on math reasoning. Gains scale with task difficulty. The single most effective pattern. (OpenAI o1, DeepMind ICLR 2025)
Smaller model + retries beats bigger model single-shot. A cheap model with best-of-N + verifier can match a model 10× its size at lower cost. (Inference Scaling Laws, ICLR 2025)
Self-critique is near-useless. Same model revising itself: +1.8% after 5 rounds. But a different model giving structured feedback: +80%. The bottleneck is error identification, not repair. (RefineBench, NVIDIA 2025)
Debate is oversold. Simple parallel generation + voting captures most gains attributed to multi-agent debate. Debate underperforms voting on 7/9 benchmarks. (NeurIPS 2025, ICLR 2025)
No single strategy wins everywhere. Easy tasks: skip multi-attempt (waste). Hard tasks: multi-attempt is the difference between "good enough" and "correct." The right approach adapts to the problem. (Art of Scaling Test-Time Compute, 2025)
Contents
  1. How You Use an LLM Today
  2. Change One Number
  3. What You Can Do With a Longer Pipeline
  4. What This Looks Like in Practice
  5. All the Way Down
  6. Why Streaming Matters
  7. What the Research Says (details)
  8. Demo: Vario Evaluating Itself
  9. How This Relates to Other Work
  10. Where This Goes
  11. Glossary — hover any dotted termLike this! Technical terms are explained on hover. for a definition

1. How You Use an LLM Today

Here's what happens when you call an LLM:

produce(n=1, model=sonnet) → done

That's a pipeline. A short one. You send a prompt, one model generates one response, and you get text back.

Described in pipeline notation, some things become visible that are easy to overlook:

This isn't a criticism. For quick questions, a single call is efficient and often good enough. But notice: the pipeline notation makes the constraints explicit. Once you see them, you can change them.

2. Change One Number

What if you changed n from 1 to 5?

produce(n=5, model=sonnet) → ???

Now you have five candidates instead of one. But that creates a new question: which one is best? You need a way to evaluate them. So you add a scoring step:

produce(n=5, model=sonnet) → score(rubric) → ???

Now each candidate has a quality score. But you want one answer, not five. So you add selection:

produce(n=5, model=sonnet) → score(rubric) → reduce(top_1)

That's best-of-nGenerate N candidates, score each against a rubric, keep the highest-scoring one. The simplest vario recipe.. The simplest vario recipe. Nothing was invented — we changed a number and followed the implications.

The key insight: A single LLM call is the degenerate case of a pipeline — produce(n=1) with no scoring and no selection. Every improvement is either changing a parameter or adding a stage. Not a new concept.

There's a second insight hiding here: explicit structure can be reasoned over. When the process is a pipeline — not a monolithic prompt hoping the LLM "figures it out" — you can inspect it, optimize it, and run thousands of items through it. An LLM that's told "generate five options, pick the best, and refine it" in a single prompt might do something vaguely like that. An explicit pipeline guarantees five candidates get generated, guarantees each gets scored against the rubric, and guarantees selection happens. The structure is auditable, repeatable, and the same pipeline that works on one question works on ten thousand.

What does this cost? About $0.03 and five seconds. The winning candidate typically scores 15–25 points higher than the median on a 0–100 rubric — because models make different tradeoffs on each generation, and selecting the best is better than hoping for the best.

3. What You Can Do With a Longer Pipeline

A longer pipeline lets you do things that humans naturally do when the stakes are high — but that a single LLM call can't. Each is a recognizable activity, not an abstraction:

Brainstorm

Generate many candidates in parallel, across multiple models. Different models have different training data, different biases, different strengths. You get genuine diversity, not five rephrases of the same idea.

produce(n=5, models=[sonnet, gemini, grok]) → all candidates

Research note: Heterogeneous multi-model generation produces 4–6% accuracy gains over single-model, and 30% fewer factual errors. The diversity is real, not cosmetic.

Review

Score each candidate against explicit criteria. A different model acts as judge — this is critical. The TLDR above explains why: same-model self-critique adds almost nothing (+1.8%), but cross-model structured review is transformative (+80%).

produce(n=5) → score(rubric=[correctness, thoroughness, clarity])

Now each candidate has a quality signal. You can select the best, or use the scores as feedback for revision.

Select the best

Pick the winner. Or take a vote. Or combine the best parts. This is where the +19 accuracy points come from — not from generating better, but from selecting better.

produce(n=5) → score(rubric) → reduce(top_1)

Refine

Take the scoring feedback and use it to improve the output. Then score again. Repeat until quality converges or budget runs out. This is the editorial loop — draft, get feedback, revise — automated with explicit quality criteria.

produce → repeat(score → revise, until=converged)

Research note: Only works when the feedback comes from a different model or structured rubric. "Try again" without specific feedback is near-zero value.

Evaluate at scale

Run the same analysis across a corpus. 50 documents, 3 models, systematic comparison. Reveals tradeoffs that are invisible from a single model on a single document — like which model extracts more claims but at lower precision.

source(documents/) → fan_out(models) → task(analyze) → evaluate(vs_reference)

Search deep

Each pipeline stage can itself contain a pipeline (Section 5). A search pipeline can spawn sub-searches on promising branches, each with its own scoring and selection. How deep to go is controlled by budget and convergence — hard problems get more compute, easy ones resolve quickly.

These are all compositions of the same nine opsStreaming operations that transform Items. Each op is an async generator: Items in, Items out. Nine built-in ops.: produce, score, revise, reduce, source, fan_out, task, evaluate, repeat. No special machinery for each pattern — just different arrangements of the same pieces.

4. What This Looks Like in Practice

Quick quality boost: best-of-n

$ vario run "What are the risks of acquiring Acme Corp?" -r best_of_n -i acme_10k.pdf

  Results: 1 thing(s)
  [1]  score=91  model=sonnet  "Three primary risk categories emerge..."

  Stage            In  Out   Tokens      Cost     Latency
  produce           0    5    2,845    $0.021       3.2s
  score             5    5    1,230    $0.004       1.1s
  reduce            5    1        0    $0.000       0.0s

  Budget: $0.025 spent / $0.05 max

Five candidates generated, each scored on thoroughness and accuracy. Winner: score 91. Runner-up scored 84 and missed regulatory risk entirely. You see the difference because there is a comparison.

Nuanced decision: model debate

$ vario run "Is this founder's pivot history a red flag?" -i timeline.md -r model_debate

Four models generate independently. Each gets scored on reasoning depth and evidence quality. In one real run, Opus focused on the pattern (pivoting before testing product-market fit) while Gemini focused on the outcome (each pivot moved upmarket). Scoring identified which analysis was more actionable — not just which model sounded more confident.

High-stakes output: iterative refinement

Round 1: produce(3) → score → best=72 ("misses edge cases") ↓ revise with feedback Round 2: score → best=81 ("stronger, regulatory risk understated") ↓ revise with feedback Round 3: score → best=88 ("comprehensive") ↓ revise with feedback Round 4: score → best=89 (improvement < 1.0 → converged, stop)

The repeat op handles the loop with stop conditions: score threshold, minimum improvement, drift detection, budget cap. The system decides when to stop — you specify quality criteria, not iteration counts.

Evaluate at scale

source(documents/) → fan_out(models=[sonnet, haiku, grok]) → task(extract_claims) → evaluate(vs_reference)

50 documents × 3 models = 150 evaluations. Reveals that Sonnet extracts 18% more claims than Haiku but at lower precision — a tradeoff invisible from a single model on a single document.

Large jobs: effort allocation beyond context windows

The same pipeline that improves one answer scales to thousands. Each item is processed independently — the pipeline doesn't hold the whole corpus in memory, so there's no context window limit on the job. A 10,000-document extraction runs the same way as a 10-document one.

What makes this more than a for-loop:

This is where vario diverges most from "call the API five times." It's not just about making one answer better — it's infrastructure for running LLM-heavy jobs at scale with the same quality controls and observability you'd want for any production data pipeline.

5. All the Way Down

The task op takes any async function and makes it a pipeline stage. If that function internally runs a vario pipeline, the outer pipeline doesn't know or care. Items in, Items out.

This means vario pipelines nest:

Outer: source(questions) → fan_out([baseline, best_of_n]) → task(run_strategy) → evaluate ↓ Inner (when strategy=best_of_n): produce(n=5) → score → reduce(top_1) ↓ Inner scoring: produce(n=1, model=haiku) → done

vario ab already works this way: the outer evaluation pipeline calls call_with_strategy for each question, which runs an inner recipe pipeline for the strategy arm. Two levels deep, and there's no limit.

This nesting opens up patterns that are hard to build ad hoc:

PatternHow it nests
Eval any LLM jobWrap any function in task, score the output, compare strategies. The function could be a RAG pipeline, an agent loop, a code generator — anything callable.
Deep searchA steer op spawns sub-pipelines to explore promising branches. Each branch is a full pipeline that can itself branch.
Meta-optimizationOuter pipeline A/B tests which inner recipe works best for a problem type. The system learns which process to use, not just which answer to give.
Recursive refinementA refinement round could use best-of-n internally — each revision candidate is itself the winner of a sub-pipeline.

The key property: every level gets the same observability. Inner pipelines produce traces, costs, and provenance that roll up to the outer level. You can audit the full tree.

6. Why Streaming Matters

A detail that changes the user experience: vario pipelines are streaming, not batched.

Batch (traditional)

Generate all 5 candidates. Wait. Score all 5. Wait. Reduce. Return.

You see nothing until everything is done.

Streaming (vario)

Sonnet finishes first — score it immediately. Gemini finishes — score it. Best answer updates in real time.

Fast models yield results while slow models are still thinking.

Technically: every op is an async generatorPython's async generator pattern. Functions yield values one at a time as they become available, enabling pipeline processing without waiting for all data.. produce yields Items as models finish (as_completed). score evaluates each Item as it arrives. The HeapAn anytime priority queue. Items scored as they arrive; best answer always readable mid-run. keeps the best answer always accessible — you can read it or display it at any point, even while the pipeline is still running.

Only reduce is a barrier (it needs all Items to make a selection). Everything before it streams.

7. What the Research Says

The patterns above aren't speculative — each has been studied. Here are the details behind the TLDR.

Generate → judge → select: the strongest pattern

FindingSourceNumbers
Best-of-N with a verifier dramatically outperforms single-shot on reasoning OpenAI o1 system card AIME: pass@1 = 74%, reranked@1000 = 93%. A 19-point lift from selection alone.
Compute-optimal test-time scaling beats naive best-of-N by 4× DeepMind + UC Berkeley, ICLR 2025 Allocating more attempts to harder problems is 4× more efficient than uniform sampling.
Smaller model + best-of-N + verifier matches larger model at lower cost Inference Scaling Laws, ICLR 2025 Pareto-optimal tradeoff: cheap model with retries beats expensive model single-shot.
SWE-bench: pass@5 significantly exceeds pass@1 on hard real-world code tasks SWE-bench Verified/Pro, 2025 Frontier models drop from 70%+ (Verified) to <25% (Pro). Multi-attempt gap grows with difficulty.
Majority vote closes the last gap on near-perfect models OpenAI o4-mini AIME 2025: 99.5% pass@1, 100% consensus@8. Eight samples + voting = perfect.

This maps directly to brainstorm → review → select. The research confirms it's the single most effective pattern, and that the gains scale with task difficulty.

Refinement: only with external feedback

FindingSourceNumbers
Self-refinement without external feedback is near-useless RefineBench, NVIDIA, Nov 2025 Gemini 2.5 Pro: +1.8% after 5 iterations. DeepSeek-R1: −0.1% (got worse).
But guided refinement with structured feedback: +80% gains Same paper Checklist-based feedback → near-perfect scores within 5 turns.
Extended reasoning hurts on tasks with distractors Anthropic: Inverse Scaling in Test-Time Compute, 2025 More thinking amplifies noise on certain task structures.

This is why the refine pattern uses a different model as judge. "Try again" is theater; structured cross-model feedback is transformative. The bottleneck is error identification, not error repair.

Debate: simple voting wins

FindingSource
Majority voting alone captures most gains attributed to multi-agent debate NeurIPS 2025 Spotlight: "Debate or Vote"
Multi-Agent Debate underperforms self-consistency on 7/9 benchmarks ICLR 2025 MAD evaluation
Best-of-N degrades at high N (>16) — verifier gets gamed ICML 2025

This shapes the brainstorm pattern: parallel independent generation + scoring beats complex debate protocols. Vario's default N (5) is in the empirically-validated sweet spot of 3–8.

When to skip multi-attempt

The gains scale with task difficulty. Easy tasks: skip. Hard tasks: multi-attempt is the difference between "good enough" and "correct."

8. Demo: Vario Evaluating Itself

The most direct way to measure whether vario improves LLM output: run a benchmark with and without it.

$ vario ab math haiku --strategy best_of_n --sample 5 --report

  Phase 1: BASELINE (single-shot) — 35 questions
  baseline: 24/35 = 68.6%

  Phase 2: STRATEGY (best_of_n) — 35 questions
  strategy: 28/35 = 80.0%

  ┌──────────┬──────────┬──────────┬─────────┐
  │ Metric   │ Baseline │ Strategy │ Delta   │
  ├──────────┼──────────┼──────────┼─────────┤
  │ Accuracy │ 68.6%    │ 80.0%    │ +11.4pp │
  │ Cost     │ $0.0082  │ $0.0387  │ 4.7x    │
  │ Latency  │ 1,204ms  │ 3,891ms  │ 3.2x    │
  └──────────┴──────────┴──────────┴─────────┘

  Wins: 6 | Losses: 2 | Net: +4

Vario runs the comparison through its own pipeline — loading benchmark questions, calling call_with_strategy for each arm, checking correctness against gold answers. The system evaluating itself with its own tools.

The output is a comparison.json + optional HTML report. Every number is traceable to specific questions that flipped from wrong to right (or right to wrong).

Run a real demo and replace these placeholder numbers with actuals. Command: vario ab math haiku --strategy best_of_n --sample 5 --report

9. How This Relates to Other Work

Vario doesn't compete with most LLM tools — it composes with them. The distinction is what layer each operates at:

SystemLayerRelationship to vario
DSPy Prompt optimization Complementary — a DSPy-optimized prompt works better inside a vario pipeline
LMQL / Guidance Constrained generation Complementary — constrain structure, then use vario to select the best among valid outputs
LangChain Integration framework Different goal — LangChain connects LLMs to tools; vario makes LLM output quality better
Best-of-N sampling Single technique Subsumed — best-of-N is one recipe; vario generalizes to refinement, debate, corpus processing
RLHF / Constitutional AI Model training Different phase — training improves the baseline; vario improves the ceiling at inference time

The simplest way to think about it: vario operates at the process layer, between the model (fixed) and the application (specific). It makes any model's output better through systematic generation, evaluation, and selection — without training, fine-tuning, or changing the model itself.

10. Where This Goes

Near-term: the recipe disappears

Today you choose a recipe: best_of_n, model_debate, refine_until_converged. Near-term, the system chooses for you. You describe the problem and specify quality criteria; vario designs the pipeline. This already works in prototype:

$ vario run "Evaluate this acquisition" -r "try multiple perspectives, verify facts"
  Designed: model_debate_with_verify — "Multi-model generation + factual verification pass"

Medium-term: the pipeline adapts mid-run

A steer op observes pipeline state and decides what to do next. If all candidates scored >90, skip refinement. If scores are bimodal, split into two debate tracks. If budget is running low, switch to cheaper models for remaining stages. The pipeline becomes adaptive, not just sequential.

North star: specify outcomes, not process

You say: "I need a thorough analysis of this company. Budget: $2. Confidence threshold: 85." Vario designs the process, selects the models, runs the pipeline, detects when quality is sufficient, and delivers a result with full provenance. The recipe is emergent — not hand-authored.

Why this compounds: Every scored run generates data about which models and processes work best for which problem types. The system gets better at designing processes as it accumulates execution history. You shift from crafting prompts to specifying quality criteria — a more natural way to direct AI work.

Glossary

Item
Vario's universal datum. Content (the text) + accumulated properties (model, score, cost) + provenance history tracking what each pipeline stage added.
Op
A streaming function that transforms Items. Nine built-in: produce, score, revise, reduce, source, fan_out, task, evaluate, repeat.
Recipe
A named pipeline configuration: which ops, in what order, with what parameters. Defined in YAML, or auto-designed from natural language.
Best-of-N
The simplest recipe: produce N candidates, score each, keep the best. produce(n=5) → score → reduce(top_1).
Heap
An anytime priority queue. Items are sorted by score as they arrive; the best answer is always readable, even while the pipeline is still running.
Async Generator
Python pattern where functions yield values one at a time as they become available. Enables pipeline processing without waiting for all data.
Provenance
The audit trail of an Item: which model generated it, what score it received and why, how many rounds of revision, and what each stage cost.

Vario is part of Rivus. Built by Tim Chklovski, 2025–2026.

Report v2. static.localhost/present/vario/report.html | v1