Vario: Turning One LLM Call Into a Rigorous Process

A streaming pipeline engine that orchestrates multiple AI models — generating, scoring, debating, and refining — so that complex questions get the systematic treatment they deserve.

Contents
  1. The Problem: One Roll of the Dice
  2. The Core Idea
  3. How It Works
  4. Motivating Examples
  5. Comparison with Prior Work
  6. Measuring Usefulness
  7. Demo
  8. Vision: Where This Goes
  9. Appendix: Architecture Reference
  10. Glossary — hover any dotted termLike this! Technical terms are explained on hover and linked to the glossary. for a definition

1. The Problem: One Roll of the Dice

You paste a complex question into an AI chat. Maybe it's "Should we acquire this company?" or "What are the failure modes of this architecture?" or "Evaluate this founder's track record." You get back a single answer. It sounds confident. It might be excellent. It might be mediocre. You have no way to tell.

This is the fundamental problem with single-shot LLM usage: you're rolling the dice once and hoping for the best.

1
perspective per call
0
quality signal
0
refinement rounds
?
confidence in result

Human experts don't work this way. A good analyst writes a draft, gets a second opinion, argues with a colleague, revises their reasoning, and only then shares their conclusion. A good hiring committee doesn't rely on one interviewer. A good investment memo incorporates multiple perspectives and a devil's advocate review.

Yet the standard AI workflow is: prompt → single response → hope for the best.

What if we could give AI the same systematic rigor that makes human expert processes reliable?

2. The Core Idea

Vario turns a single LLM call into a streaming pipeline of generation, scoring, debate, and refinement — so you get the best answer multiple models can produce, not just whatever one model happened to say first.

The analogy: Vario is to a single LLM call what a committee is to a single opinion. It doesn't just ask one model once — it asks several models, scores their answers, has them debate the hard parts, iteratively refines the best candidates, and gives you a ranked result with provenance for every judgment.

Three things make this non-obvious:

  1. Streaming, not batching. Traditional multi-model approaches wait for all responses before doing anything. Vario uses async generatorsPython's async generator pattern: functions that yield values one at a time as they become available, enabling pipeline processing without waiting for all data. — fast models yield results immediately while slow models are still thinking. The scoring phase starts the instant the first candidate arrives.
  2. One universal data type. Every piece of data flowing through the pipeline is an ItemVario's universal datum: content + accumulated properties + provenance history. Items accumulate metadata as they flow through pipeline stages. — content plus accumulated properties. An Item that enters produce empty gains a model and cost. When it passes through score, it gains a score and reason. Through revise, its content improves. No schema enforcement, no type hierarchies — just props accumulating as data flows.
  3. Recipes compose. Pipelines are defined as YAML recipesYAML-defined pipeline configurations specifying which ops to run, in what order, with what parameters. Can be named, or auto-designed from natural language. — reusable, shareable, and composable. But you can also describe what you want in natural language ("debate across 3 models") and Vario will design the recipe for you.

3. How It Works

The Pipeline Model

Vario pipelines are chains of opsStreaming operations that transform Items. Signature: async generator in, async generator out. Nine built-in ops cover generation, scoring, revision, reduction, and more. — streaming operations that transform Items. Each op is an async generator: it consumes Items from upstream and yields Items downstream.

┌──────────┐ ┌─────────┐ ┌──────────┐ prompt ──────────▶│ produce │────▶│ score │────▶│ reduce │──▶ best answer │ (5 models│ │ (judge) │ │ (top-1) │ │ parallel) │ │ │ │ └──────────┘ └─────────┘ └──────────┘ │ │ │ Items stream Props added N items → 1 as models (score, reason) finish

The key insight: ops don't wait. produce fires 5 parallel LLM calls and yields each result the instant it completes. score starts judging the first candidate while produce is still generating the rest. The pipeline is pull-driven — the collector at the end drives execution through the chain.

The Nine Ops

OpWhat It DoesStreaming?
produceGenerate N candidates via parallel LLM callsYes — yields as models finish
scoreJudge quality (numeric score, verification, or both)Yes — scores each item as it arrives
reviseImprove content using feedback from scoringYes — revises each item independently
reduceSelect or combine: top-k, vote, consensus, synthesisBarrier — collects all, emits best
sourceLoad corpus items from files, JSONL, or listsYes
fan_outCross-product: each item × each modelYes
taskCall arbitrary Python functionsYes
evaluateProgrammatic evaluation against referencesYes
repeatLoop sub-stages with convergence detectionBarrier per round

Execution Trace

Every pipeline run produces a structured RunLogA structured narrative of what happened during execution: stages, timing, cost, token counts, and outcome summary. Persisted to SQLite for analysis. — a record of what happened at each stage: how many items entered and exited, how many tokens were used, how much it cost, and how long it took. This makes every run auditable and every cost predictable.

Example RunLog Summary
Recipe: best_of_n | Problem: "Evaluate this acquisition target"
  Stage 0: produce  → 0 in, 5 out  | 2,845 tokens | $0.021 | 3.2s
  Stage 1: score    → 5 in, 5 out  | 1,230 tokens | $0.004 | 1.1s
  Stage 2: reduce   → 5 in, 1 out  | 0 tokens     | $0.000 | 0.0s
Outcome: 1 item, best score 91 | Total: $0.025 | 4.3s

Item Provenance

Every Item carries a history — a list of what happened to it at each stage:

item.history = [
  {"stage_id": "stage_0.produce", "added": {"model": "sonnet", "cost": 0.003}},
  {"stage_id": "stage_1.score",   "added": {"score": 91, "reason": "thorough risk analysis"}},
]

You can always trace why an Item has a particular score, which model generated it, and how much each stage cost. No black boxes.

4. Motivating Examples

Example 1: Best-of-N Selection

The simplest pattern: ask multiple times, score each answer, keep the best.

Before (single call)

Ask Claude once: "What are the risks of this acquisition?"

Get one answer. Sounds reasonable. But did it miss the regulatory angle? The integration risk? You don't know.

After (best_of_n)

Generate 5 candidates across models. Score each on thoroughness, accuracy, actionability. The winner (score: 91) covers regulatory, integration, financial, and cultural risk. Runner-up (score: 84) missed cultural risk entirely.

vario run "What are the risks of acquiring Acme Corp?" -r best_of_n

Cost: ~$0.03. Time: ~5 seconds. Quality lift: The winning answer consistently outperforms any single call.

Example 2: Multi-Model Debate

For nuanced questions where different models have different strengths.

Scenario: "Is this founder's pivot a sign of adaptability or lack of focus?"

Four models (Opus, Gemini, GPT, Grok) each generate their assessment independently. Each gets scored on reasoning depth and evidence quality. A verification pass checks factual claims. The top-scored perspective wins — but you can also read the dissenting views.

In one real run, Opus and Gemini disagreed sharply on whether a three-pivot history was a red flag. Opus focused on the pattern (pivoting away from markets before testing product-market fit), while Gemini focused on the outcome (each pivot moved upmarket). The scoring stage identified that Opus's analysis was more actionable because it predicted a specific failure mode — not just described what happened.

vario run "Is this founder's pivot history a red flag?" \
  -i founder_timeline.md -r model_debate

Example 3: Iterative Refinement

For high-stakes output where good enough isn't good enough.

Round 1: produce(3) → score → best=72 ("misses edge cases") ↓ revise with feedback Round 2: score → best=81 ("stronger but ignores competitive response") ↓ revise with feedback Round 3: score → best=88 ("comprehensive, minor clarity issues") ↓ revise with feedback Round 4: score → best=89 (improvement < 1.0 → converged, stop)

Vario's repeat op handles the loop automatically with multiple stop conditions:

vario run "Write an investment memo for Acme Corp" \
  -i acme_data.md -r refine_until_converged --budget 0.50

Key insight: Human experts naturally do this — write a draft, get feedback, revise, repeat. Vario automates the same loop with explicit quality criteria instead of gut feeling.

Example 4: Corpus Processing

When you need to apply the same analysis to many documents.

Scenario: Evaluate claim extraction quality across 50 documents × 3 models
steps:
  - op: source          # Load 50 documents from directory
    params:
      directory: corpus/
  - op: fan_out         # Each doc × 3 models = 150 items
    params:
      models: [sonnet, haiku, grok-fast]
  - op: task            # Extract claims from each
    params:
      handler: lib.extract.extract_claims
  - op: evaluate        # Compare against reference claims
    params:
      metrics: [precision, recall, f1]

Result: 150 evaluations, each with metrics. Reveals that Sonnet extracts 18% more claims than Haiku but at lower precision — a tradeoff you wouldn't discover from a single model on a single document.

Example 5: Rapid Multi-Perspective

When you just need diverse viewpoints, fast.

vario run "What's the most overlooked risk in semiconductor supply chains?" -r ask

Fans out to 4+ top-tier models. No scoring, no reduction — just raw perspectives. Takes ~5 seconds, costs ~$0.04. Useful as input to your own thinking, not as a final answer.

5. Comparison with Prior Work

System What It Does How Vario Differs Why It Matters
DSPy Prompt optimization via gradient-like search Vario optimizes the process (pipeline of ops), not the prompt. Compatible — a DSPy-optimized prompt can be used inside a Vario pipeline Process improvement and prompt improvement are complementary; doing both > either alone
LangChain / LangGraph Framework for chaining LLM calls with tools Vario is streaming-native (async generators, not batch), quality-aware (built-in scoring + convergence), and focused on one thing: making LLM outputs better through process LangChain is a general integration framework; Vario is a quality engine. Different goals.
LMQL / Guidance Constrained generation (grammar, types) Vario doesn't constrain generation — it evaluates and selects. Post-generation quality control vs pre-generation constraint Complementary: constrain structure with LMQL, then use Vario to pick the best among valid outputs
Best-of-N sampling Sample N from one model, pick by reward model Vario generalizes beyond sampling: multi-model, iterative refinement, debate, corpus processing. Sampling is one recipe among many. Sampling is a special case; the pipeline model is the general case
Constitutional AI / RLHF Train models to be better via feedback Vario works at inference time — no training required. Uses existing models as-is, composes their strengths through process Training improves the baseline; Vario improves the ceiling for any given baseline
Mixture of Agents Route or blend outputs from multiple models Vario adds explicit scoring, iterative refinement, and convergence detection on top of multi-model generation. Not just blending — systematic improvement. Blending without quality signals can average down; scoring ensures the best perspective wins
Vario's niche: It doesn't compete with prompt engineering, model fine-tuning, or constrained generation. It sits above all of these — orchestrating existing tools into processes that reliably produce better results than any single tool alone. Like a project manager who doesn't code but makes the team more effective.

6. Measuring Usefulness

Direct Quality Metrics

candidates per run
~$0.03
typical best_of_n cost
~5s
streaming latency
11
built-in recipes

Score distribution: In a best-of-5 run, the winning candidate typically scores 15-25 points higher (on a 0-100 rubric) than the median candidate. This is consistent across problem types — the spread exists because models make different tradeoffs on each generation.

Convergence in refinement: The refine_until_converged recipe typically converges in 3-4 rounds. Round 1 catches the biggest gaps (10-15 point improvement). Round 2 addresses secondary issues (5-8 points). Rounds 3+ yield diminishing returns (<2 points), which triggers the min_improvement stop condition.

Process Metrics

MetricWhat It Tells You
Score variance across candidatesHow much quality varies per generation — high variance means process matters more
Convergence round countHow much refinement a problem type needs
Cost per quality pointROI of additional pipeline stages
Model win rateWhich models tend to produce winning candidates for which problem types
Judge agreementWhether scoring is consistent (high agreement = reliable rubrics)

Ablation: What Happens Without Vario?

Without Vario

Single LLM call. No quality signal. No way to know if the answer is the model's best effort or a mediocre generation. Retrying manually is ad hoc — you don't have rubrics, you're judging by feel.

With Vario

Multiple candidates, explicit rubric-based scoring, provenance for every judgment. The winning answer comes with a score, a reason, and the runner-ups for comparison. Cost: pennies. Time: seconds.

Honest Limitations

7. Demo

CLI Walkthrough

1. Quick multi-model comparison
# Fan out to top models, see all perspectives
$ vario run "What's the biggest risk in AI agent frameworks?" -r ask

# Items stream in as models respond:
# [sonnet]  → "Security boundaries between agent and tools..."
# [gemini]  → "State management across long-running tasks..."
# [grok]    → "Catastrophic action authorization..."
# [opus]    → "Eval-production gap in agent behavior..."
2. Best-of-N with file context
# Score 5 candidates against a rubric, get the best
$ vario run "Summarize the key risks" -i quarterly_report.pdf -r best_of_n

# Result:
# Score: 88 | Model: sonnet | Cost: $0.03
# "Three primary risk categories emerge from Q4..."
3. Design a custom recipe from natural language
# Describe what you want, Vario designs the pipeline
$ vario design "generate 3 drafts, have them critique each other, \
  revise based on critiques, pick the best"

# Output: YAML recipe with produce → fan_out → score → revise → reduce
4. Iterative refinement with budget
# Refine until quality converges or budget exhausted
$ vario run "Write an investment memo for Acme Corp" \
  -i acme_10k.pdf -r refine_until_converged --budget 0.50

# Round 1: score 72 → "misses competitive landscape"
# Round 2: score 83 → "stronger, but regulatory risk understated"
# Round 3: score 89 → "comprehensive" (converged, Δ < 1.0)
5. Python API for integration
from vario.run.runner import execute, load_recipe

recipe = load_recipe("best_of_n")
result = await execute(recipe, "Evaluate this startup", limits={"usd": 0.10})

best = result.items[0]
print(f"Score: {best.score} | Model: {best.model}")
print(f"Cost: ${result.total_cost:.3f} | Time: {result.duration_ms:.0f}ms")
print(best.content)
Add a recorded terminal demo (VHS) showing a live best_of_n run with streaming output. The CLI already has rich formatting — capture it.

8. Vision: Where This Goes

Near-Term (3–6 months)

Medium-Term (6–18 months)

North Star

The vision: You describe a problem. Vario designs the right process, selects the right models, runs the pipeline, detects when quality is sufficient, and delivers a result with full provenance — at a cost/quality tradeoff you specified. The "recipe" becomes emergent, not hand-authored.

Why this compounds: Every recipe that works well becomes a reusable pattern. Every scored run generates data about which models and processes work best for which problems. The system gets better at designing processes as it accumulates execution history. The human's job shifts from crafting prompts to specifying quality criteria — a much more natural way to direct AI work.

Appendix: Architecture Reference

System Diagram

┌─────────────────────────────────────────────────────────────────────┐ │ Vario Pipeline Engine │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Recipe │───▶│ Runner │───▶│ Ops │───▶│ RunLog │ │ │ │ (YAML) │ │ │ │ (stream) │ │ (traces) │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ │ ┌────┴────┐ ┌─────┴─────┐ ┌────┴────┐ │ │ │ │ Context │ │ Items │ │ runs.db │ │ │ │ │ (budget,│ │ (content +│ │(SQLite) │ │ │ │ │ traces) │ │ props + │ │ │ │ │ │ └─────────┘ │ history) │ └─────────┘ │ │ │ └───────────┘ │ │ ┌────┴────┐ │ │ │ │defaults │ ┌────┴────┐ │ │ │ .yaml │ │ Heap │ │ │ │ (models,│ │(anytime │ │ │ │ tiers) │ │priority)│ │ │ └─────────┘ └─────────┘ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ┌────┴────┐ ┌────┴────┐ │ CLI │ │ NiceGUI│ │(click) │ │ UI │ └─────────┘ └─────────┘

Recipe Library

RecipePatternBest For
best_of_nproduce → score → top-1Quick quality boost
confirmfast produce → maxthink verify → reviseDraft fast, verify with frontier
askfan_out to 4+ top modelsDiverse perspectives
ask_fastfan_out to 5 cheap modelsQuick brainstorm
model_debatemulti-model → score+verify → top-kNuanced decisions
majority_voteproduce(7) → score → voteConsensus questions
weighted_voteproduce(7) → score → weightedScore-weighted consensus
refine_onceproduce → score → revise → top-1One-pass improvement
refine_until_convergedproduce → repeat(score → revise)High-stakes output
summarizeproduce(3) → score → combineMulti-perspective synthesis
generate_and_verifyproduce(5) → verify → top-kFactual accuracy

File Map

FilePurpose
vario/item.pyItem + Source definitions
vario/ops/*.pyNine streaming operations
vario/run/runner.pyRecipe executor + repeat logic
vario/run/context.pyExecution context (budget, traces)
vario/workflows/*.yamlRecipe library
vario/heap.pyAnytime priority queue
vario/cli.pyClick CLI
vario/dsl.pyPython DSL (alternative to YAML)

Glossary

Item
Vario's universal datum. Content (the actual text) plus accumulated properties (model, score, cost, etc.) plus a provenance history tracking what was added at each pipeline stage.
Op (Operation)
A streaming function that transforms Items. Type signature: AsyncIterator[Item] → AsyncIterator[Item]. Ops compose via pipe() or YAML recipes.
Recipe
A YAML-defined pipeline configuration specifying which ops to run, in what order, with what parameters. Can be hand-authored, loaded from the built-in library, or auto-designed from natural language.
RunLog
A structured execution narrative recording what happened during a pipeline run: stages, timing, token counts, costs, and outcome summary. Persisted to ~/.vario/runs.db.
Async Generator
Python's async def f() -> AsyncIterator pattern. Functions that yield values one at a time as they become available, enabling pipeline processing without waiting for all data to be ready.
Heap (Anytime Priority Queue)
A sorted collection of Items where the best item is always accessible, even while the pipeline is still running. Enables real-time UI updates and early termination.
Convergence
The point where additional refinement rounds stop improving quality. Detected via minimum improvement thresholds, drift windows, or score plateaus.
Provenance
The complete audit trail of an Item: which model generated it, what score it received and why, how many rounds of revision it went through, and what each stage cost.

Vario is part of Rivus, a system for amplifying human effort through AI orchestration. Built by Tim Chklovski, 2025–2026.

Report generated March 2026. View at static.localhost/present/vario/report.html.