A streaming pipeline engine that orchestrates multiple AI models — generating, scoring, debating, and refining — so that complex questions get the systematic treatment they deserve.
You paste a complex question into an AI chat. Maybe it's "Should we acquire this company?" or "What are the failure modes of this architecture?" or "Evaluate this founder's track record." You get back a single answer. It sounds confident. It might be excellent. It might be mediocre. You have no way to tell.
This is the fundamental problem with single-shot LLM usage: you're rolling the dice once and hoping for the best.
Human experts don't work this way. A good analyst writes a draft, gets a second opinion, argues with a colleague, revises their reasoning, and only then shares their conclusion. A good hiring committee doesn't rely on one interviewer. A good investment memo incorporates multiple perspectives and a devil's advocate review.
Yet the standard AI workflow is: prompt → single response → hope for the best.
What if we could give AI the same systematic rigor that makes human expert processes reliable?
Vario turns a single LLM call into a streaming pipeline of generation, scoring, debate, and refinement — so you get the best answer multiple models can produce, not just whatever one model happened to say first.
The analogy: Vario is to a single LLM call what a committee is to a single opinion. It doesn't just ask one model once — it asks several models, scores their answers, has them debate the hard parts, iteratively refines the best candidates, and gives you a ranked result with provenance for every judgment.
Three things make this non-obvious:
produce empty gains a model and cost. When it passes through score, it gains a score and reason. Through revise, its content improves. No schema enforcement, no type hierarchies — just props accumulating as data flows.Vario pipelines are chains of opsStreaming operations that transform Items. Signature: async generator in, async generator out. Nine built-in ops cover generation, scoring, revision, reduction, and more. — streaming operations that transform Items. Each op is an async generator: it consumes Items from upstream and yields Items downstream.
The key insight: ops don't wait. produce fires 5 parallel LLM calls and yields each result the instant it completes. score starts judging the first candidate while produce is still generating the rest. The pipeline is pull-driven — the collector at the end drives execution through the chain.
| Op | What It Does | Streaming? |
|---|---|---|
| produce | Generate N candidates via parallel LLM calls | Yes — yields as models finish |
| score | Judge quality (numeric score, verification, or both) | Yes — scores each item as it arrives |
| revise | Improve content using feedback from scoring | Yes — revises each item independently |
| reduce | Select or combine: top-k, vote, consensus, synthesis | Barrier — collects all, emits best |
| source | Load corpus items from files, JSONL, or lists | Yes |
| fan_out | Cross-product: each item × each model | Yes |
| task | Call arbitrary Python functions | Yes |
| evaluate | Programmatic evaluation against references | Yes |
| repeat | Loop sub-stages with convergence detection | Barrier per round |
Every pipeline run produces a structured RunLogA structured narrative of what happened during execution: stages, timing, cost, token counts, and outcome summary. Persisted to SQLite for analysis. — a record of what happened at each stage: how many items entered and exited, how many tokens were used, how much it cost, and how long it took. This makes every run auditable and every cost predictable.
Recipe: best_of_n | Problem: "Evaluate this acquisition target"
Stage 0: produce → 0 in, 5 out | 2,845 tokens | $0.021 | 3.2s
Stage 1: score → 5 in, 5 out | 1,230 tokens | $0.004 | 1.1s
Stage 2: reduce → 5 in, 1 out | 0 tokens | $0.000 | 0.0s
Outcome: 1 item, best score 91 | Total: $0.025 | 4.3s
Every Item carries a history — a list of what happened to it at each stage:
item.history = [
{"stage_id": "stage_0.produce", "added": {"model": "sonnet", "cost": 0.003}},
{"stage_id": "stage_1.score", "added": {"score": 91, "reason": "thorough risk analysis"}},
]
You can always trace why an Item has a particular score, which model generated it, and how much each stage cost. No black boxes.
The simplest pattern: ask multiple times, score each answer, keep the best.
Ask Claude once: "What are the risks of this acquisition?"
Get one answer. Sounds reasonable. But did it miss the regulatory angle? The integration risk? You don't know.
Generate 5 candidates across models. Score each on thoroughness, accuracy, actionability. The winner (score: 91) covers regulatory, integration, financial, and cultural risk. Runner-up (score: 84) missed cultural risk entirely.
vario run "What are the risks of acquiring Acme Corp?" -r best_of_n
Cost: ~$0.03. Time: ~5 seconds. Quality lift: The winning answer consistently outperforms any single call.
For nuanced questions where different models have different strengths.
Four models (Opus, Gemini, GPT, Grok) each generate their assessment independently. Each gets scored on reasoning depth and evidence quality. A verification pass checks factual claims. The top-scored perspective wins — but you can also read the dissenting views.
In one real run, Opus and Gemini disagreed sharply on whether a three-pivot history was a red flag. Opus focused on the pattern (pivoting away from markets before testing product-market fit), while Gemini focused on the outcome (each pivot moved upmarket). The scoring stage identified that Opus's analysis was more actionable because it predicted a specific failure mode — not just described what happened.
vario run "Is this founder's pivot history a red flag?" \
-i founder_timeline.md -r model_debate
For high-stakes output where good enough isn't good enough.
Vario's repeat op handles the loop automatically with multiple stop conditions:
vario run "Write an investment memo for Acme Corp" \
-i acme_data.md -r refine_until_converged --budget 0.50
Key insight: Human experts naturally do this — write a draft, get feedback, revise, repeat. Vario automates the same loop with explicit quality criteria instead of gut feeling.
When you need to apply the same analysis to many documents.
steps:
- op: source # Load 50 documents from directory
params:
directory: corpus/
- op: fan_out # Each doc × 3 models = 150 items
params:
models: [sonnet, haiku, grok-fast]
- op: task # Extract claims from each
params:
handler: lib.extract.extract_claims
- op: evaluate # Compare against reference claims
params:
metrics: [precision, recall, f1]
Result: 150 evaluations, each with metrics. Reveals that Sonnet extracts 18% more claims than Haiku but at lower precision — a tradeoff you wouldn't discover from a single model on a single document.
When you just need diverse viewpoints, fast.
vario run "What's the most overlooked risk in semiconductor supply chains?" -r ask
Fans out to 4+ top-tier models. No scoring, no reduction — just raw perspectives. Takes ~5 seconds, costs ~$0.04. Useful as input to your own thinking, not as a final answer.
| System | What It Does | How Vario Differs | Why It Matters |
|---|---|---|---|
| DSPy | Prompt optimization via gradient-like search | Vario optimizes the process (pipeline of ops), not the prompt. Compatible — a DSPy-optimized prompt can be used inside a Vario pipeline | Process improvement and prompt improvement are complementary; doing both > either alone |
| LangChain / LangGraph | Framework for chaining LLM calls with tools | Vario is streaming-native (async generators, not batch), quality-aware (built-in scoring + convergence), and focused on one thing: making LLM outputs better through process | LangChain is a general integration framework; Vario is a quality engine. Different goals. |
| LMQL / Guidance | Constrained generation (grammar, types) | Vario doesn't constrain generation — it evaluates and selects. Post-generation quality control vs pre-generation constraint | Complementary: constrain structure with LMQL, then use Vario to pick the best among valid outputs |
| Best-of-N sampling | Sample N from one model, pick by reward model | Vario generalizes beyond sampling: multi-model, iterative refinement, debate, corpus processing. Sampling is one recipe among many. | Sampling is a special case; the pipeline model is the general case |
| Constitutional AI / RLHF | Train models to be better via feedback | Vario works at inference time — no training required. Uses existing models as-is, composes their strengths through process | Training improves the baseline; Vario improves the ceiling for any given baseline |
| Mixture of Agents | Route or blend outputs from multiple models | Vario adds explicit scoring, iterative refinement, and convergence detection on top of multi-model generation. Not just blending — systematic improvement. | Blending without quality signals can average down; scoring ensures the best perspective wins |
Score distribution: In a best-of-5 run, the winning candidate typically scores 15-25 points higher (on a 0-100 rubric) than the median candidate. This is consistent across problem types — the spread exists because models make different tradeoffs on each generation.
Convergence in refinement: The refine_until_converged recipe typically converges in 3-4 rounds. Round 1 catches the biggest gaps (10-15 point improvement). Round 2 addresses secondary issues (5-8 points). Rounds 3+ yield diminishing returns (<2 points), which triggers the min_improvement stop condition.
| Metric | What It Tells You |
|---|---|
| Score variance across candidates | How much quality varies per generation — high variance means process matters more |
| Convergence round count | How much refinement a problem type needs |
| Cost per quality point | ROI of additional pipeline stages |
| Model win rate | Which models tend to produce winning candidates for which problem types |
| Judge agreement | Whether scoring is consistent (high agreement = reliable rubrics) |
Single LLM call. No quality signal. No way to know if the answer is the model's best effort or a mediocre generation. Retrying manually is ad hoc — you don't have rubrics, you're judging by feel.
Multiple candidates, explicit rubric-based scoring, provenance for every judgment. The winning answer comes with a score, a reason, and the runner-ups for comparison. Cost: pennies. Time: seconds.
# Fan out to top models, see all perspectives
$ vario run "What's the biggest risk in AI agent frameworks?" -r ask
# Items stream in as models respond:
# [sonnet] → "Security boundaries between agent and tools..."
# [gemini] → "State management across long-running tasks..."
# [grok] → "Catastrophic action authorization..."
# [opus] → "Eval-production gap in agent behavior..."
# Score 5 candidates against a rubric, get the best
$ vario run "Summarize the key risks" -i quarterly_report.pdf -r best_of_n
# Result:
# Score: 88 | Model: sonnet | Cost: $0.03
# "Three primary risk categories emerge from Q4..."
# Describe what you want, Vario designs the pipeline
$ vario design "generate 3 drafts, have them critique each other, \
revise based on critiques, pick the best"
# Output: YAML recipe with produce → fan_out → score → revise → reduce
# Refine until quality converges or budget exhausted
$ vario run "Write an investment memo for Acme Corp" \
-i acme_10k.pdf -r refine_until_converged --budget 0.50
# Round 1: score 72 → "misses competitive landscape"
# Round 2: score 83 → "stronger, but regulatory risk understated"
# Round 3: score 89 → "comprehensive" (converged, Δ < 1.0)
from vario.run.runner import execute, load_recipe
recipe = load_recipe("best_of_n")
result = await execute(recipe, "Evaluate this startup", limits={"usd": 0.10})
best = result.items[0]
print(f"Score: {best.score} | Model: {best.model}")
print(f"Cost: ${result.total_cost:.3f} | Time: {result.duration_ms:.0f}ms")
print(best.content)
produce("...") | score(rubric) | top(1). Enables IDE autocomplete, type checking, and Python-native composition.brainstorm_angles(), steelman_counterargs(), extract_risks(). The prompt engineering gets done once and shared.Why this compounds: Every recipe that works well becomes a reusable pattern. Every scored run generates data about which models and processes work best for which problems. The system gets better at designing processes as it accumulates execution history. The human's job shifts from crafting prompts to specifying quality criteria — a much more natural way to direct AI work.
| Recipe | Pattern | Best For |
|---|---|---|
best_of_n | produce → score → top-1 | Quick quality boost |
confirm | fast produce → maxthink verify → revise | Draft fast, verify with frontier |
ask | fan_out to 4+ top models | Diverse perspectives |
ask_fast | fan_out to 5 cheap models | Quick brainstorm |
model_debate | multi-model → score+verify → top-k | Nuanced decisions |
majority_vote | produce(7) → score → vote | Consensus questions |
weighted_vote | produce(7) → score → weighted | Score-weighted consensus |
refine_once | produce → score → revise → top-1 | One-pass improvement |
refine_until_converged | produce → repeat(score → revise) | High-stakes output |
summarize | produce(3) → score → combine | Multi-perspective synthesis |
generate_and_verify | produce(5) → verify → top-k | Factual accuracy |
| File | Purpose |
|---|---|
vario/item.py | Item + Source definitions |
vario/ops/*.py | Nine streaming operations |
vario/run/runner.py | Recipe executor + repeat logic |
vario/run/context.py | Execution context (budget, traces) |
vario/workflows/*.yaml | Recipe library |
vario/heap.py | Anytime priority queue |
vario/cli.py | Click CLI |
vario/dsl.py | Python DSL (alternative to YAML) |
AsyncIterator[Item] → AsyncIterator[Item]. Ops compose via pipe() or YAML recipes.~/.vario/runs.db.async def f() -> AsyncIterator pattern. Functions that yield values one at a time as they become available, enabling pipeline processing without waiting for all data to be ready.Vario is part of Rivus, a system for amplifying human effort through AI orchestration. Built by Tim Chklovski, 2025–2026.
Report generated March 2026. View at static.localhost/present/vario/report.html.