How structured generate → evaluate → learn loops close the quality gap — turning AI output from "good enough" to measurably excellent, one iteration at a time.
AI systems make the same mistakes repeatedly. An LLM asked to draw a diagram will produce misaligned boxes. Asked again tomorrow, it will produce the same misaligned boxes. An extraction pipeline will let newsletter CTAs bleed into clean text. Run it next week — same CTAs survive.
The root cause is not capability. Modern LLMs can draw aligned diagrams and can clean text well. The problem is threefold:
No measurement: Without a score, you can not tell if output got better or worse. "Looks fine" is not a metric. A diagram that appears aligned in a proportional-width editor might be off by one character in a terminal — and you would never know.
No feedback loop: Even when someone notices a problem and fixes it manually, the fix stays in that session. The next session starts fresh. The insight — "use Unicode box-drawing, not ASCII +---+" — evaporates.
No compounding: Without measurement and a loop, improvements do not stack. Each improvement is local to one instance. Ten manual fixes produce ten slightly-better outputs — not one systematically-better system.
The result: human experts spend their time correcting the same classes of errors, session after session, instead of teaching the system to avoid them permanently.
A gym is a steering module that scores a system's output, surfaces failures, and closes the quality loop. It is not a training framework — it does not change model weights. It is not a prompt optimizer — it does not search prompt space. It is a quality feedback system that produces reusable artifacts: skills, conventions, validation tools.
One cycle of the loop:
The key difference from "just iterate on the prompt": every cycle produces durable artifacts that benefit all future work, not just the next run. A skill file created in the ASCII diagram gym improves every diagram the system draws, forever.
Each gym is defined as a vario recipeA YAML configuration that defines a data pipeline: source items, processing steps, and evaluation functions. Vario is the pipeline engine that executes recipes. — a YAML file that wires together source items, task functions, and scoring functions. The gym CLI orchestrates execution:
gym run ascii_diagrams # Run the full gen-eval-learn cycle
gym run extraction # Run extraction quality evaluation
gym list # Show all registered gyms and their status
Evaluation is hybrid by design. Programmatic validators catch structural defects (misaligned borders, missing content anchors) with 100% reliability and zero cost. LLM judges assess qualitative dimensions (visual balance, readability, semantic coherence) that no validator can capture. The combination covers more ground than either approach alone.
The ASCII diagram gym is the most complete demonstration of the gen-eval-learn loop. It took diagrams from a quality score of 0.42 to 0.86 — a 105% improvement — by identifying five systematic failure patterns and encoding fixes as a permanent skill.
Ten diagram challenges were generated without any skill or convention guidance. The prompt was simply: "Draw an ASCII diagram of: {description}." The results revealed five systematic failure patterns:
NAIVE Pipeline diagram
+----------+ +----------+ +--------+ | produce | --> | score | --> | reduce | +----------+ +----------+ +--------+
Score: 0.42 — Plain ASCII, no hierarchy, no container, ragged borders
SKILLED Pipeline diagram
╔═══════════════════════════════════════╗ ║ Pipeline ║ ╠═══════════════════════════════════════╣ ║ ║ ║ ╭──────────╮ ╭──────────╮ ║ ║ │ produce │━━▶│ score │ ║ ║ ╰──────────╯ ╰─────┬────╯ ║ ║ │ ║ ║ ▼ ║ ║ ╭──────────╮ ║ ║ │ reduce │ ║ ║ ╰──────────╯ ║ ╚═══════════════════════════════════════╝
Score: 0.88 — Border hierarchy, Unicode arrows, all lines width 41
NAIVE Nested containers
Cloud
|
+-- Region A
| +-- Service 1
| +-- Service 2
|
+-- Region B
+-- Service 3
+-- Service 4
Score: 0.28 — Plain text tree, zero box-drawing characters
SKILLED Nested containers
╔════════════════════════════════════════════════╗ ║ Cloud ║ ╠════════════════════════════════════════════════╣ ║ ║ ║ ┌────────────────────┐ ┌───────────────────┐ ║ ║ │ Region A │ │ Region B │ ║ ║ │ ╭──────╮ ╭──────╮ │ │ ╭─────╮ ╭─────╮ │ ║ ║ │ │Svc 1 │ │Svc 2 │ │ │ │Svc 3│ │Svc 4│ │ ║ ║ │ ╰──────╯ ╰──────╯ │ │ ╰─────╯ ╰─────╯ │ ║ ║ └────────────────────┘ └───────────────────┘ ║ ║ ║ ╚════════════════════════════════════════════════╝
Score: 0.83 — Three-level hierarchy, all lines width 50. Largest improvement: +0.55
| Pattern | Frequency | Root Cause | Fix |
|---|---|---|---|
| Single border style for all levels | 9/10 | No convention exists in training data | 4-level border lookup table in skill |
| Ragged right borders | 8/10 | LLMs generate left-to-right; no "ruler" | Post-draw verification procedure |
| ASCII arrows instead of Unicode | 7/10 | Training data favors --> over ━━▶ | Unicode arrow reference table |
| Missing structural elements | 5/10 | No systematic process; model "simplifies" | Inside-out drawing process (6 steps) |
| No self-verification | 10/10 | No instruction to re-examine output | Post-generation verification checklist |
All five patterns were encoded into a 187-line skill file (~/.claude/skills/ascii-diagram/SKILL.md) that is loaded into every future session. The skill includes:
display_width() Python function as a concrete example| Axis | Naive (R1) | Skilled (R2) | Delta | Interpretation |
|---|---|---|---|---|
| Alignment | 0.43 | 0.85 | +0.42 | Verification procedure works |
| Features | 0.62 | 1.00 | +0.38 | Inside-out process captures all elements |
| Conventions | 0.23 | 0.83 | +0.60 | Biggest gain — border hierarchy taught |
| Judge | 0.39 | 0.78 | +0.39 | Cascading aesthetic improvement |
| Overall | 0.42 | 0.86 | +0.44 |
The extraction gym measures how well the system converts raw HTML pages — full of navigation bars, newsletter CTAs, social share buttons, and ad scaffolding — into clean article text. The challenge: strip the 90–99% boilerplate while preserving every word of actual content.
Extraction runs a three-stage pipeline: fetch (HTTP request) → extract (readability + markdownify, strips HTML structure) → clean (optional LLM pass to remove surviving UI chrome). Each stage is independently measurable.
The scoring function uses three deterministic checks, weighted by importance:
| Check | Weight | What It Measures | How |
|---|---|---|---|
| Anchor retention | 0.50 | Does the output contain expected content phrases? | Substring match against known article phrases |
| Intrusion removal | 0.30 | Were known CTA/UI patterns stripped? | Substring match against known intrusions ("Subscribe for free", "Share") |
| Must-not-contain | 0.20 | Are forbidden patterns absent? | Substring match against patterns like like/comment/share counts |
A 178KB HTML page yields 11KB of article text after readability extraction — 94% compression. But readability leaves behind the Substack CTA block: "Thanks for reading! Subscribe for free... 6 3 1 Share". The baseline score is 0.50 — content preserved, but intrusions survive.
With the LLM cleaning step (gemini-lite, temperature=0), the CTA block is reliably removed. Score jumps to 1.00. Cost per cleaning call: <$0.001. The LLM step takes ~10 seconds — acceptable for batch processing, too slow for interactive use.
| Site | Category | Baseline | + LLM Clean | Delta |
|---|---|---|---|---|
| substack_waxman | Newsletter | 0.50 | 1.00 | +0.50 |
| substack_mauboussin | Newsletter | 0.35 | 0.35 | — |
| reuters_article | News | 1.00 | 1.00 | — |
| paulgraham | Blog | 1.00 | 1.00 | — |
| medium_article | Blog | 1.00 | 1.00 | — |
| python_docs | Docs | 1.00 | 1.00 | — |
| Average | 0.81 | 0.89 | +0.08 |
Paywall detection: The substack_mauboussin site extracts only 76 characters from 89KB of HTML — a paywall blocks content access. The gym's scoring correctly identifies this as a failure (0.35), and the LLM cleaning step cannot fix a content-access problem. This surfaced the need for paywall-aware fetch strategies.
Compression ratios as diagnostics: The gym tracks raw HTML → extracted → cleaned character counts. A 178KB → 11KB → 10.6KB progression (94% compression) is healthy. A 89KB → 76 byte extraction signals a structural problem before any scoring runs.
This is the meta-gym — the system that checks its own checkers. If your LLM judge gives a score of 0.85, what does that mean? Is it well-calibrated? Would it give a lower score to objectively worse output? The judge calibration system answers these questions through two key mechanisms.
The core idea is elegant: if a judge is well-calibrated, deliberately degraded text should score lower than the original. The system applies seven perturbation types, scores both versions, and measures whether the judge consistently detects the degradation:
| Perturbation | What It Does | What It Tests |
|---|---|---|
remove_evidence | Strips numbers, percentages, code references | Does the judge value specificity? |
add_fluff | Inserts plausible but irrelevant filler sentences | Does the judge penalize verbosity? |
vague_ify | Replaces specific values with "several", "some" | Does the judge detect loss of precision? |
inject_errors | Randomly multiplies/divides numbers (0.1x–10x) | Does the judge catch factual corruption? |
scramble_order | Shuffles paragraphs or lines | Does the judge value coherent structure? |
duplicate_content | Repeats ~25% of non-empty lines | Does the judge detect redundancy? |
strip_actionability | Removes imperative sentences (Use/Run/Always/Never) | Does the judge value actionable guidance? |
For each perturbation type, three metrics determine whether the judge passes:
A judge passes a perturbation type if mean_drop > 0 AND effect_size > 0.5. The overall calibration score is the fraction of perturbation types passed.
The screening criteria reject judges that:
| Approach | What It Does | How Gyms Differ | Why It Matters |
|---|---|---|---|
| Self-play AlphaGo, AlphaZero |
Agent plays against itself to discover strategies | Gyms use domain-specific corpora, not self-generated challenges. Quality is measured against human-defined expectations, not game outcomes. | Real-world tasks have external quality criteria that self-play cannot discover. |
| RLHF InstructGPT, ChatGPT |
Changes model weights using human preference signals | Gyms change context (skills, conventions), not weights. Faster iteration, fully interpretable, instantly reversible. | Weight updates require GPU clusters and are opaque. Skill files are text you can read and edit. |
| Constitutional AI Anthropic |
AI critiques its own output against a constitution | Gyms use programmatic + LLM hybrid evaluation, not LLM-only critique. Structural issues require code, not opinions. | LLM judges cannot count characters. Off-by-one alignment errors require programmatic checks. |
| DSPy Stanford NLP |
Compiles declarative LLM programs by optimizing prompts and demonstrations | Gyms produce skills and conventions, not optimized prompts. Output is human-readable knowledge, not opaque prompt strings. | A skill file teaches the model why to use double borders for containers. An optimized prompt just says "do this" without transferable understanding. |
| TextGrad MIT |
Backpropagation through text — LLM-generated gradients for prompt optimization | Gyms are domain-specific with real corpora. TextGrad optimizes generic prompts; gyms build specialized expertise for specific capabilities. | A gym for ASCII diagrams produces a diagram skill. TextGrad would produce a marginally better prompt. |
| Voyager Minecraft agent |
Grows a skill library through exploration and self-verification | Closest analog. Gyms add programmatic evaluation and calibrated LLM judges to the skill acquisition loop. | Voyager verifies by execution (did the code run?). Gyms verify by quality measurement (is the output good?). |
| Metric | Value | Notes |
|---|---|---|
| Overall quality | 0.42 → 0.86 | +105% across 10 challenges |
| Convention adherence | 0.23 → 0.83 | +260% — highest-leverage axis |
| Feature completeness | 0.62 → 1.00 | Inside-out process captures all structural elements |
| Alignment accuracy | 0.43 → 0.85 | Verification procedure works |
| LLM judge score | 0.39 → 0.78 | Judges are stricter on aesthetics than validators |
| Metric | Value | Notes |
|---|---|---|
| Average overall | 0.81 → 0.89 | LLM clean step at <$0.001 per call |
| Intrusion removal | 0.75 → 0.92 | +23% — newsletter CTAs reliably stripped |
| Anchor retention | 0.83 → 0.83 | LLM clean preserves all article content |
| Must-not-contain | 0.83 → 1.00 | Like/comment/share counts eliminated |
| Compression ratio | 67%–99.6% | 178KB HTML → 10.6KB clean text (Substack) |
Using the parallelogram scale for overall gym maturity:
| Gym | Corpus | Scoring | Iteration | Skill Output | Overall |
|---|---|---|---|---|---|
| ASCII Diagrams | ▰▰▰▰▱ | ▰▰▰▰▰ | ▰▰▰▰▰ | ▰▰▰▰▰ | ▰▰▰▰▱ |
| Extraction | ▰▰▱▱▱ | ▰▰▰▰▱ | ▰▰▰▱▱ | ▰▰▱▱▱ | ▰▰▰▱▱ |
| Judge Calibration | ▰▰▰▱▱ | ▰▰▰▰▰ | ▰▰▱▱▱ | ▰▰▰▱▱ | ▰▰▰▱▱ |
The gym system runs from the command line. Here is what a typical session looks like:
# Run the ASCII diagram gym
$ gym run ascii_diagrams
[INFO] Loading recipe: learning/gyms/ascii_diagrams/gym.yaml
[INFO] Source: 10 challenges loaded
[INFO] Generating: pipeline... architecture... nested_containers... rating...
[INFO] Evaluating: alignment=0.85, features=1.00, conventions=0.83, judge=0.78
[INFO] Overall: 0.86 (was 0.42 on last naive run)
# Run the extraction gym
$ gym run extraction
[INFO] Loading recipe: learning/gyms/extraction/gym.yaml
[INFO] Source: 6 sites loaded
[INFO] Fetching: substack_waxman... reuters_article... paulgraham...
[INFO] Scoring: intrusion=0.92, anchor=0.83, must_not=1.00
[INFO] Overall: 0.89
# Check all gym statuses
$ gym list
NAME STATUS LAST SCORE LAST RUN
ascii_diagrams Done 0.86 2026-03-21
extraction Active 0.89 2026-03-21
badge Done — 2026-02-28
fetchability Done — 2026-03-01
kb Done — 2026-03-02
assessor Done — 2026-03-05
# Check a specific file for ASCII diagram alignment issues
$ python tools/check_ascii.py present/gyms/report.html
Checking present/gyms/report.html...
Block at line 142: 19 lines, all width 64. ALIGNED.
236 code blocks scanned across repo.
585 alignment issues found in 236 blocks.
In a live demo, the key moments are:
+---+ borders and ragged alignment. This is the starting point.The system currently has 14 registered gyms across extraction, badge quality, fetchability, knowledge extraction, code cleanup, and more. The near-term goal is each gym actively running, each scoring its parent system's output, each producing measurable improvement deltas. Every gym produces reusable artifacts — skills, conventions, validators — that compound across the entire system.
| Gym | Status | What It Improves |
|---|---|---|
| Principle | Next | Agent behavior via evidence-backed principles |
| Extraction | Active | HTML extraction + LLM cleaning quality |
| ASCII Diagrams | Done | Box-drawing diagram quality |
| Badge | Done | Session topic summarization |
| Fetchability | Done | Proxy/fetch method selection |
| KB | Done | Knowledge extraction from web sources |
| Assessor | Done | Person scoring prompt calibration |
| Claim Extraction | Active | Claim completeness across 4 domains |
| Sidekick | TODO | Intervention timing and helpfulness |
| Code Cleanup | TODO | Codebase hygiene suggestions |
Each gym becomes a scheduled job that runs automatically — nightly, after code changes, or on-demand. Results flow into the system dashboard, creating a quality scoreboard where you can see at a glance which capabilities are improving and which are regressing. Regressions trigger alerts; improvements trigger celebrations.
The job integration also enables continuous calibration: the judge calibration gym runs whenever a new model is added to the evaluation pipeline, automatically screening it for discriminative power before it can influence scores.
The north star is a system where you wake up and it has gotten measurably better overnight. Not because someone pushed a model update or tuned a prompt — but because the gym loop ran, found failures, encoded fixes, verified them, and deployed. The improvement is automatic, measured, and auditable.
Each gym is a small instance of this vision. The ASCII diagram gym took quality from 0.42 to 0.86 in a single iteration cycle. Scale that to 14 gyms, each running continuously, each producing durable artifacts that benefit the entire system — and the compounding begins. A skill created by one gym improves the input quality for another. Better extraction feeds better knowledge. Better knowledge feeds better principles. Better principles feed better everything.
Report generated March 2026. Data from the rivus gym system. ASCII diagram gym data from 10 challenges with programmatic + LLM hybrid scoring. Extraction gym data from 6 corpus sites. Judge calibration framework from lib/eval/calibration/. Source: static.localhost/present/gyms/report.html.