Vario: Turning One LLM Call Into a Rigorous Process

A streaming pipeline engine that orchestrates multiple AI models — generating, scoring, debating, and refining — so that complex questions get the systematic treatment they deserve.

1. The Problem: One Roll of the Dice

You paste a complex question into an AI chat. Maybe it's "Should we acquire this company?" or "What are the failure modes of this architecture?" or "Evaluate this founder's track record." You get back a single answer. It sounds confident. It might be excellent. It might be mediocre. You have no way to tell.

This is the fundamental problem with single-shot LLM usage: you're rolling the dice once and hoping for the best.

Human experts don't work this way. A good analyst writes a draft, gets a second opinion, argues with a colleague, revises their reasoning, and only then shares their conclusion. A good hiring committee doesn't rely on one interviewer. A good investment memo incorporates multiple perspectives and a devil's advocate review.

Yet the standard AI workflow is: prompt → single response → hope for the best.

What if we could give AI the same systematic rigor that makes human expert processes reliable?

2. The Core Idea

The analogy: Vario is to a single LLM call what a committee is to a single opinion. It doesn't just ask one model once — it asks several models, scores their answers, has them debate the hard parts, iteratively refines the best candidates, and gives you a ranked result with provenance for every judgment.

3. How It Works

The Pipeline Model

Vario pipelines are chains of opsStreaming operations that transform Items. Signature: async generator in, async generator out. Nine built-in ops cover generation, scoring, revision, reduction, and more. — streaming operations that transform Items. Each op is an async generator: it consumes Items from upstream and yields Items downstream.

┌──────────┐ ┌─────────┐ ┌──────────┐ prompt ──────────▶│ produce │────▶│ score │────▶│ reduce │──▶ best answer │ (5 models│ │ (judge) │ │ (top-1) │ │ parallel) │ │ │ │ └──────────┘ └─────────┘ └──────────┘ │ │ │ Items stream Props added N items → 1 as models (score, reason) finish

The key insight: ops don't wait. produce fires 5 parallel LLM calls and yields each result the instant it completes. score starts judging the first candidate while produce is still generating the rest. The pipeline is pull-driven — the collector at the end drives execution through the chain.

The Nine Ops

Execution Trace

Op	What It Does	Streaming?
produce	Generate N candidates via parallel LLM calls	Yes — yields as models finish
score	Judge quality (numeric score, verification, or both)	Yes — scores each item as it arrives
revise	Improve content using feedback from scoring	Yes — revises each item independently
reduce	Select or combine: top-k, vote, consensus, synthesis	Barrier — collects all, emits best
source	Load corpus items from files, JSONL, or lists	Yes
fan_out	Cross-product: each item × each model	Yes
task	Call arbitrary Python functions	Yes
evaluate	Programmatic evaluation against references	Yes
repeat	Loop sub-stages with convergence detection	Barrier per round

Every pipeline run produces a structured RunLogA structured narrative of what happened during execution: stages, timing, cost, token counts, and outcome summary. Persisted to SQLite for analysis. — a record of what happened at each stage: how many items entered and exited, how many tokens were used, how much it cost, and how long it took. This makes every run auditable and every cost predictable.

Example RunLog Summary

Recipe: best_of_n | Problem: "Evaluate this acquisition target"
  Stage 0: produce  → 0 in, 5 out  | 2,845 tokens | $0.021 | 3.2s
  Stage 1: score    → 5 in, 5 out  | 1,230 tokens | $0.004 | 1.1s
  Stage 2: reduce   → 5 in, 1 out  | 0 tokens     | $0.000 | 0.0s
Outcome: 1 item, best score 91 | Total: $0.025 | 4.3s

Item Provenance

You can always trace why an Item has a particular score, which model generated it, and how much each stage cost. No black boxes.

4. Motivating Examples

Example 1: Best-of-N Selection

Before (single call)

Ask Claude once: "What are the risks of this acquisition?"

Get one answer. Sounds reasonable. But did it miss the regulatory angle? The integration risk? You don't know.

After (best_of_n)

Generate 5 candidates across models. Score each on thoroughness, accuracy, actionability. The winner (score: 91) covers regulatory, integration, financial, and cultural risk. Runner-up (score: 84) missed cultural risk entirely.

Cost: ~$0.03. Time: ~5 seconds. Quality lift: The winning answer consistently outperforms any single call.

Example 2: Multi-Model Debate

Scenario: "Is this founder's pivot a sign of adaptability or lack of focus?"

Four models (Opus, Gemini, GPT, Grok) each generate their assessment independently. Each gets scored on reasoning depth and evidence quality. A verification pass checks factual claims. The top-scored perspective wins — but you can also read the dissenting views.

In one real run, Opus and Gemini disagreed sharply on whether a three-pivot history was a red flag. Opus focused on the pattern (pivoting away from markets before testing product-market fit), while Gemini focused on the outcome (each pivot moved upmarket). The scoring stage identified that Opus's analysis was more actionable because it predicted a specific failure mode — not just described what happened.

Example 3: Iterative Refinement

Round 1: produce(3) → score → best=72 ("misses edge cases") ↓ revise with feedback Round 2: score → best=81 ("stronger but ignores competitive response") ↓ revise with feedback Round 3: score → best=88 ("comprehensive, minor clarity issues") ↓ revise with feedback Round 4: score → best=89 (improvement < 1.0 → converged, stop)

Vario's repeat op handles the loop automatically with multiple stop conditions:

Key insight: Human experts naturally do this — write a draft, get feedback, revise, repeat. Vario automates the same loop with explicit quality criteria instead of gut feeling.

Example 4: Corpus Processing

Scenario: Evaluate claim extraction quality across 50 documents × 3 models

steps:
  - op: source          # Load 50 documents from directory
    params:
      directory: corpus/
  - op: fan_out         # Each doc × 3 models = 150 items
    params:
      models: [sonnet, haiku, grok-fast]
  - op: task            # Extract claims from each
    params:
      handler: lib.extract.extract_claims
  - op: evaluate        # Compare against reference claims
    params:
      metrics: [precision, recall, f1]

Result: 150 evaluations, each with metrics. Reveals that Sonnet extracts 18% more claims than Haiku but at lower precision — a tradeoff you wouldn't discover from a single model on a single document.

Example 5: Rapid Multi-Perspective

Fans out to 4+ top-tier models. No scoring, no reduction — just raw perspectives. Takes ~5 seconds, costs ~$0.04. Useful as input to your own thinking, not as a final answer.

5. Comparison with Prior Work

6. Measuring Usefulness

Direct Quality Metrics

System	What It Does	How Vario Differs	Why It Matters
DSPy	Prompt optimization via gradient-like search	Vario optimizes the process (pipeline of ops), not the prompt. Compatible — a DSPy-optimized prompt can be used inside a Vario pipeline	Process improvement and prompt improvement are complementary; doing both > either alone
LangChain / LangGraph	Framework for chaining LLM calls with tools	Vario is streaming-native (async generators, not batch), quality-aware (built-in scoring + convergence), and focused on one thing: making LLM outputs better through process	LangChain is a general integration framework; Vario is a quality engine. Different goals.
LMQL / Guidance	Constrained generation (grammar, types)	Vario doesn't constrain generation — it evaluates and selects. Post-generation quality control vs pre-generation constraint	Complementary: constrain structure with LMQL, then use Vario to pick the best among valid outputs
Best-of-N sampling	Sample N from one model, pick by reward model	Vario generalizes beyond sampling: multi-model, iterative refinement, debate, corpus processing. Sampling is one recipe among many.	Sampling is a special case; the pipeline model is the general case
Constitutional AI / RLHF	Train models to be better via feedback	Vario works at inference time — no training required. Uses existing models as-is, composes their strengths through process	Training improves the baseline; Vario improves the ceiling for any given baseline
Mixture of Agents	Route or blend outputs from multiple models	Vario adds explicit scoring, iterative refinement, and convergence detection on top of multi-model generation. Not just blending — systematic improvement.	Blending without quality signals can average down; scoring ensures the best perspective wins

Score distribution: In a best-of-5 run, the winning candidate typically scores 15-25 points higher (on a 0-100 rubric) than the median candidate. This is consistent across problem types — the spread exists because models make different tradeoffs on each generation.

Convergence in refinement: The refine_until_converged recipe typically converges in 3-4 rounds. Round 1 catches the biggest gaps (10-15 point improvement). Round 2 addresses secondary issues (5-8 points). Rounds 3+ yield diminishing returns (<2 points), which triggers the min_improvement stop condition.

Process Metrics

Ablation: What Happens Without Vario?

Metric	What It Tells You
Score variance across candidates	How much quality varies per generation — high variance means process matters more
Convergence round count	How much refinement a problem type needs
Cost per quality point	ROI of additional pipeline stages
Model win rate	Which models tend to produce winning candidates for which problem types
Judge agreement	Whether scoring is consistent (high agreement = reliable rubrics)

Without Vario

Single LLM call. No quality signal. No way to know if the answer is the model's best effort or a mediocre generation. Retrying manually is ad hoc — you don't have rubrics, you're judging by feel.

With Vario

Multiple candidates, explicit rubric-based scoring, provenance for every judgment. The winning answer comes with a score, a reason, and the runner-ups for comparison. Cost: pennies. Time: seconds.

Honest Limitations

7. Demo

CLI Walkthrough

1. Quick multi-model comparison

# Fan out to top models, see all perspectives
$ vario run "What's the biggest risk in AI agent frameworks?" -r ask

# Items stream in as models respond:
# [sonnet]  → "Security boundaries between agent and tools..."
# [gemini]  → "State management across long-running tasks..."
# [grok]    → "Catastrophic action authorization..."
# [opus]    → "Eval-production gap in agent behavior..."

2. Best-of-N with file context

# Score 5 candidates against a rubric, get the best
$ vario run "Summarize the key risks" -i quarterly_report.pdf -r best_of_n

# Result:
# Score: 88 | Model: sonnet | Cost: $0.03
# "Three primary risk categories emerge from Q4..."

3. Design a custom recipe from natural language

# Describe what you want, Vario designs the pipeline
$ vario design "generate 3 drafts, have them critique each other, \
  revise based on critiques, pick the best"

# Output: YAML recipe with produce → fan_out → score → revise → reduce

4. Iterative refinement with budget

# Refine until quality converges or budget exhausted
$ vario run "Write an investment memo for Acme Corp" \
  -i acme_10k.pdf -r refine_until_converged --budget 0.50

# Round 1: score 72 → "misses competitive landscape"
# Round 2: score 83 → "stronger, but regulatory risk understated"
# Round 3: score 89 → "comprehensive" (converged, Δ < 1.0)

5. Python API for integration

from vario.run.runner import execute, load_recipe

recipe = load_recipe("best_of_n")
result = await execute(recipe, "Evaluate this startup", limits={"usd": 0.10})

best = result.items[0]
print(f"Score: {best.score} | Model: {best.model}")
print(f"Cost: ${result.total_cost:.3f} | Time: {result.duration_ms:.0f}ms")
print(best.content)

8. Vision: Where This Goes

Near-Term (3–6 months)

Medium-Term (6–18 months)

North Star

Why this compounds: Every recipe that works well becomes a reusable pattern. Every scored run generates data about which models and processes work best for which problems. The system gets better at designing processes as it accumulates execution history. The human's job shifts from crafting prompts to specifying quality criteria — a much more natural way to direct AI work.

Appendix: Architecture Reference

System Diagram

┌─────────────────────────────────────────────────────────────────────┐ │ Vario Pipeline Engine │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Recipe │───▶│ Runner │───▶│ Ops │───▶│ RunLog │ │ │ │ (YAML) │ │ │ │ (stream) │ │ (traces) │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ │ ┌────┴────┐ ┌─────┴─────┐ ┌────┴────┐ │ │ │ │ Context │ │ Items │ │ runs.db │ │ │ │ │ (budget,│ │ (content +│ │(SQLite) │ │ │ │ │ traces) │ │ props + │ │ │ │ │ │ └─────────┘ │ history) │ └─────────┘ │ │ │ └───────────┘ │ │ ┌────┴────┐ │ │ │ │defaults │ ┌────┴────┐ │ │ │ .yaml │ │ Heap │ │ │ │ (models,│ │(anytime │ │ │ │ tiers) │ │priority)│ │ │ └─────────┘ └─────────┘ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ┌────┴────┐ ┌────┴────┐ │ CLI │ │ NiceGUI│ │(click) │ │ UI │ └─────────┘ └─────────┘

Recipe	Pattern	Best For
`best_of_n`	produce → score → top-1	Quick quality boost
`confirm`	fast produce → maxthink verify → revise	Draft fast, verify with frontier
`ask`	fan_out to 4+ top models	Diverse perspectives
`ask_fast`	fan_out to 5 cheap models	Quick brainstorm
`model_debate`	multi-model → score+verify → top-k	Nuanced decisions
`majority_vote`	produce(7) → score → vote	Consensus questions
`weighted_vote`	produce(7) → score → weighted	Score-weighted consensus
`refine_once`	produce → score → revise → top-1	One-pass improvement
`refine_until_converged`	produce → repeat(score → revise)	High-stakes output
`summarize`	produce(3) → score → combine	Multi-perspective synthesis
`generate_and_verify`	produce(5) → verify → top-k	Factual accuracy

File	Purpose
`vario/item.py`	Item + Source definitions
`vario/ops/*.py`	Nine streaming operations
`vario/run/runner.py`	Recipe executor + repeat logic
`vario/run/context.py`	Execution context (budget, traces)
`vario/workflows/*.yaml`	Recipe library
`vario/heap.py`	Anytime priority queue
`vario/cli.py`	Click CLI
`vario/dsl.py`	Python DSL (alternative to YAML)