Every LLM call you make today is already a pipeline — just a very short one. Vario makes the pipeline longer, and everything interesting follows.
Five findings that shape how vario works. Details and citations in Section 7.
Here's what happens when you call an LLM:
That's a pipeline. A short one. You send a prompt, one model generates one response, and you get text back.
Described in pipeline notation, some things become visible that are easy to overlook:
This isn't a criticism. For quick questions, a single call is efficient and often good enough. But notice: the pipeline notation makes the constraints explicit. Once you see them, you can change them.
What if you changed n from 1 to 5?
Now you have five candidates instead of one. But that creates a new question: which one is best? You need a way to evaluate them. So you add a scoring step:
Now each candidate has a quality score. But you want one answer, not five. So you add selection:
That's best-of-nGenerate N candidates, score each against a rubric, keep the highest-scoring one. The simplest vario recipe.. The simplest vario recipe. Nothing was invented — we changed a number and followed the implications.
produce(n=1) with no scoring and no selection. Every improvement is either changing a parameter or adding a stage. Not a new concept.There's a second insight hiding here: explicit structure can be reasoned over. When the process is a pipeline — not a monolithic prompt hoping the LLM "figures it out" — you can inspect it, optimize it, and run thousands of items through it. An LLM that's told "generate five options, pick the best, and refine it" in a single prompt might do something vaguely like that. An explicit pipeline guarantees five candidates get generated, guarantees each gets scored against the rubric, and guarantees selection happens. The structure is auditable, repeatable, and the same pipeline that works on one question works on ten thousand.
What does this cost? About $0.03 and five seconds. The winning candidate typically scores 15–25 points higher than the median on a 0–100 rubric — because models make different tradeoffs on each generation, and selecting the best is better than hoping for the best.
A longer pipeline lets you do things that humans naturally do when the stakes are high — but that a single LLM call can't. Each is a recognizable activity, not an abstraction:
Generate many candidates in parallel, across multiple models. Different models have different training data, different biases, different strengths. You get genuine diversity, not five rephrases of the same idea.
Research note: Heterogeneous multi-model generation produces 4–6% accuracy gains over single-model, and 30% fewer factual errors. The diversity is real, not cosmetic.
Score each candidate against explicit criteria. A different model acts as judge — this is critical. The TLDR above explains why: same-model self-critique adds almost nothing (+1.8%), but cross-model structured review is transformative (+80%).
Now each candidate has a quality signal. You can select the best, or use the scores as feedback for revision.
Pick the winner. Or take a vote. Or combine the best parts. This is where the +19 accuracy points come from — not from generating better, but from selecting better.
Take the scoring feedback and use it to improve the output. Then score again. Repeat until quality converges or budget runs out. This is the editorial loop — draft, get feedback, revise — automated with explicit quality criteria.
Research note: Only works when the feedback comes from a different model or structured rubric. "Try again" without specific feedback is near-zero value.
Run the same analysis across a corpus. 50 documents, 3 models, systematic comparison. Reveals tradeoffs that are invisible from a single model on a single document — like which model extracts more claims but at lower precision.
Each pipeline stage can itself contain a pipeline (Section 5). A search pipeline can spawn sub-searches on promising branches, each with its own scoring and selection. How deep to go is controlled by budget and convergence — hard problems get more compute, easy ones resolve quickly.
These are all compositions of the same nine opsStreaming operations that transform Items. Each op is an async generator: Items in, Items out. Nine built-in ops.: produce, score, revise, reduce, source, fan_out, task, evaluate, repeat. No special machinery for each pattern — just different arrangements of the same pieces.
$ vario run "What are the risks of acquiring Acme Corp?" -r best_of_n -i acme_10k.pdf
Results: 1 thing(s)
[1] score=91 model=sonnet "Three primary risk categories emerge..."
Stage In Out Tokens Cost Latency
produce 0 5 2,845 $0.021 3.2s
score 5 5 1,230 $0.004 1.1s
reduce 5 1 0 $0.000 0.0s
Budget: $0.025 spent / $0.05 max
Five candidates generated, each scored on thoroughness and accuracy. Winner: score 91. Runner-up scored 84 and missed regulatory risk entirely. You see the difference because there is a comparison.
$ vario run "Is this founder's pivot history a red flag?" -i timeline.md -r model_debate
Four models generate independently. Each gets scored on reasoning depth and evidence quality. In one real run, Opus focused on the pattern (pivoting before testing product-market fit) while Gemini focused on the outcome (each pivot moved upmarket). Scoring identified which analysis was more actionable — not just which model sounded more confident.
The repeat op handles the loop with stop conditions: score threshold, minimum improvement, drift detection, budget cap. The system decides when to stop — you specify quality criteria, not iteration counts.
source(documents/) → fan_out(models=[sonnet, haiku, grok]) → task(extract_claims) → evaluate(vs_reference)
50 documents × 3 models = 150 evaluations. Reveals that Sonnet extracts 18% more claims than Haiku but at lower precision — a tradeoff invisible from a single model on a single document.
The same pipeline that improves one answer scales to thousands. Each item is processed independently — the pipeline doesn't hold the whole corpus in memory, so there's no context window limit on the job. A 10,000-document extraction runs the same way as a 10-document one.
What makes this more than a for-loop:
This is where vario diverges most from "call the API five times." It's not just about making one answer better — it's infrastructure for running LLM-heavy jobs at scale with the same quality controls and observability you'd want for any production data pipeline.
The task op takes any async function and makes it a pipeline stage. If that function internally runs a vario pipeline, the outer pipeline doesn't know or care. Items in, Items out.
This means vario pipelines nest:
vario ab already works this way: the outer evaluation pipeline calls call_with_strategy for each question, which runs an inner recipe pipeline for the strategy arm. Two levels deep, and there's no limit.
This nesting opens up patterns that are hard to build ad hoc:
| Pattern | How it nests |
|---|---|
| Eval any LLM job | Wrap any function in task, score the output, compare strategies. The function could be a RAG pipeline, an agent loop, a code generator — anything callable. |
| Deep search | A steer op spawns sub-pipelines to explore promising branches. Each branch is a full pipeline that can itself branch. |
| Meta-optimization | Outer pipeline A/B tests which inner recipe works best for a problem type. The system learns which process to use, not just which answer to give. |
| Recursive refinement | A refinement round could use best-of-n internally — each revision candidate is itself the winner of a sub-pipeline. |
The key property: every level gets the same observability. Inner pipelines produce traces, costs, and provenance that roll up to the outer level. You can audit the full tree.
A detail that changes the user experience: vario pipelines are streaming, not batched.
Generate all 5 candidates. Wait. Score all 5. Wait. Reduce. Return.
You see nothing until everything is done.
Sonnet finishes first — score it immediately. Gemini finishes — score it. Best answer updates in real time.
Fast models yield results while slow models are still thinking.
Technically: every op is an async generatorPython's async generator pattern. Functions yield values one at a time as they become available, enabling pipeline processing without waiting for all data.. produce yields Items as models finish (as_completed). score evaluates each Item as it arrives. The HeapAn anytime priority queue. Items scored as they arrive; best answer always readable mid-run. keeps the best answer always accessible — you can read it or display it at any point, even while the pipeline is still running.
Only reduce is a barrier (it needs all Items to make a selection). Everything before it streams.
The patterns above aren't speculative — each has been studied. Here are the details behind the TLDR.
| Finding | Source | Numbers |
|---|---|---|
| Best-of-N with a verifier dramatically outperforms single-shot on reasoning | OpenAI o1 system card | AIME: pass@1 = 74%, reranked@1000 = 93%. A 19-point lift from selection alone. |
| Compute-optimal test-time scaling beats naive best-of-N by 4× | DeepMind + UC Berkeley, ICLR 2025 | Allocating more attempts to harder problems is 4× more efficient than uniform sampling. |
| Smaller model + best-of-N + verifier matches larger model at lower cost | Inference Scaling Laws, ICLR 2025 | Pareto-optimal tradeoff: cheap model with retries beats expensive model single-shot. |
| SWE-bench: pass@5 significantly exceeds pass@1 on hard real-world code tasks | SWE-bench Verified/Pro, 2025 | Frontier models drop from 70%+ (Verified) to <25% (Pro). Multi-attempt gap grows with difficulty. |
| Majority vote closes the last gap on near-perfect models | OpenAI o4-mini | AIME 2025: 99.5% pass@1, 100% consensus@8. Eight samples + voting = perfect. |
This maps directly to brainstorm → review → select. The research confirms it's the single most effective pattern, and that the gains scale with task difficulty.
| Finding | Source | Numbers |
|---|---|---|
| Self-refinement without external feedback is near-useless | RefineBench, NVIDIA, Nov 2025 | Gemini 2.5 Pro: +1.8% after 5 iterations. DeepSeek-R1: −0.1% (got worse). |
| But guided refinement with structured feedback: +80% gains | Same paper | Checklist-based feedback → near-perfect scores within 5 turns. |
| Extended reasoning hurts on tasks with distractors | Anthropic: Inverse Scaling in Test-Time Compute, 2025 | More thinking amplifies noise on certain task structures. |
This is why the refine pattern uses a different model as judge. "Try again" is theater; structured cross-model feedback is transformative. The bottleneck is error identification, not error repair.
| Finding | Source |
|---|---|
| Majority voting alone captures most gains attributed to multi-agent debate | NeurIPS 2025 Spotlight: "Debate or Vote" |
| Multi-Agent Debate underperforms self-consistency on 7/9 benchmarks | ICLR 2025 MAD evaluation |
| Best-of-N degrades at high N (>16) — verifier gets gamed | ICML 2025 |
This shapes the brainstorm pattern: parallel independent generation + scoring beats complex debate protocols. Vario's default N (5) is in the empirically-validated sweet spot of 3–8.
The gains scale with task difficulty. Easy tasks: skip. Hard tasks: multi-attempt is the difference between "good enough" and "correct."
The most direct way to measure whether vario improves LLM output: run a benchmark with and without it.
$ vario ab math haiku --strategy best_of_n --sample 5 --report
Phase 1: BASELINE (single-shot) — 35 questions
baseline: 24/35 = 68.6%
Phase 2: STRATEGY (best_of_n) — 35 questions
strategy: 28/35 = 80.0%
┌──────────┬──────────┬──────────┬─────────┐
│ Metric │ Baseline │ Strategy │ Delta │
├──────────┼──────────┼──────────┼─────────┤
│ Accuracy │ 68.6% │ 80.0% │ +11.4pp │
│ Cost │ $0.0082 │ $0.0387 │ 4.7x │
│ Latency │ 1,204ms │ 3,891ms │ 3.2x │
└──────────┴──────────┴──────────┴─────────┘
Wins: 6 | Losses: 2 | Net: +4
Vario runs the comparison through its own pipeline — loading benchmark questions, calling call_with_strategy for each arm, checking correctness against gold answers. The system evaluating itself with its own tools.
The output is a comparison.json + optional HTML report. Every number is traceable to specific questions that flipped from wrong to right (or right to wrong).
vario ab math haiku --strategy best_of_n --sample 5 --reportVario doesn't compete with most LLM tools — it composes with them. The distinction is what layer each operates at:
| System | Layer | Relationship to vario |
|---|---|---|
| DSPy | Prompt optimization | Complementary — a DSPy-optimized prompt works better inside a vario pipeline |
| LMQL / Guidance | Constrained generation | Complementary — constrain structure, then use vario to select the best among valid outputs |
| LangChain | Integration framework | Different goal — LangChain connects LLMs to tools; vario makes LLM output quality better |
| Best-of-N sampling | Single technique | Subsumed — best-of-N is one recipe; vario generalizes to refinement, debate, corpus processing |
| RLHF / Constitutional AI | Model training | Different phase — training improves the baseline; vario improves the ceiling at inference time |
The simplest way to think about it: vario operates at the process layer, between the model (fixed) and the application (specific). It makes any model's output better through systematic generation, evaluation, and selection — without training, fine-tuning, or changing the model itself.
Today you choose a recipe: best_of_n, model_debate, refine_until_converged. Near-term, the system chooses for you. You describe the problem and specify quality criteria; vario designs the pipeline. This already works in prototype:
$ vario run "Evaluate this acquisition" -r "try multiple perspectives, verify facts"
Designed: model_debate_with_verify — "Multi-model generation + factual verification pass"
A steer op observes pipeline state and decides what to do next. If all candidates scored >90, skip refinement. If scores are bimodal, split into two debate tracks. If budget is running low, switch to cheaper models for remaining stages. The pipeline becomes adaptive, not just sequential.
Why this compounds: Every scored run generates data about which models and processes work best for which problem types. The system gets better at designing processes as it accumulates execution history. You shift from crafting prompts to specifying quality criteria — a more natural way to direct AI work.
produce(n=5) → score → reduce(top_1).Vario is part of Rivus. Built by Tim Chklovski, 2025–2026.
Report v2. static.localhost/present/vario/report.html | v1