The Gym System: Self-Improving AI Through Measured Practice

How structured generate → evaluate → learn loops close the quality gap — turning AI output from "good enough" to measurably excellent, one iteration at a time.

1. The Problem: Same Mistakes, No Measurement

AI systems make the same mistakes repeatedly. An LLM asked to draw a diagram will produce misaligned boxes. Asked again tomorrow, it will produce the same misaligned boxes. An extraction pipeline will let newsletter CTAs bleed into clean text. Run it next week — same CTAs survive.

The root cause is not capability. Modern LLMs can draw aligned diagrams and can clean text well. The problem is threefold:

Three compounding gaps

No measurement: Without a score, you can not tell if output got better or worse. "Looks fine" is not a metric. A diagram that appears aligned in a proportional-width editor might be off by one character in a terminal — and you would never know.

No feedback loop: Even when someone notices a problem and fixes it manually, the fix stays in that session. The next session starts fresh. The insight — "use Unicode box-drawing, not ASCII +---+" — evaporates.

No compounding: Without measurement and a loop, improvements do not stack. Each improvement is local to one instance. Ten manual fixes produce ten slightly-better outputs — not one systematically-better system.

The result: human experts spend their time correcting the same classes of errors, session after session, instead of teaching the system to avoid them permanently.

2. The Core Idea: Gyms as Steering Modules

A gym is a steering module that scores a system's output, surfaces failures, and closes the quality loop. It is not a training framework — it does not change model weights. It is not a prompt optimizer — it does not search prompt space. It is a quality feedback system that produces reusable artifacts: skills, conventions, validation tools.

The key difference from "just iterate on the prompt": every cycle produces durable artifacts that benefit all future work, not just the next run. A skill file created in the ASCII diagram gym improves every diagram the system draws, forever.

3. How It Works

Each gym is defined as a vario recipeA YAML configuration that defines a data pipeline: source items, processing steps, and evaluation functions. Vario is the pipeline engine that executes recipes. — a YAML file that wires together source items, task functions, and scoring functions. The gym CLI orchestrates execution:

Evaluation is hybrid by design. Programmatic validators catch structural defects (misaligned borders, missing content anchors) with 100% reliability and zero cost. LLM judges assess qualitative dimensions (visual balance, readability, semantic coherence) that no validator can capture. The combination covers more ground than either approach alone.

Why not just LLM judges? An LLM judge rates text visually — it reads characters, not pixel positions. A diagram line at width 40 vs 41 looks identical to an LLM. But in a monospace terminal, it is visibly broken. Programmatic validators catch the structural issues that LLM judges literally cannot see. Conversely, an LLM judge catches "this diagram is technically correct but visually lopsided" — a qualitative assessment no regex can make.

4. Gym A: ASCII Diagrams

The ASCII diagram gym is the most complete demonstration of the gen-eval-learn loop. It took diagrams from a quality score of 0.42 to 0.86 — a 105% improvement — by identifying five systematic failure patterns and encoding fixes as a permanent skill.

Round 1: Naive Generation

Ten diagram challenges were generated without any skill or convention guidance. The prompt was simply: "Draw an ASCII diagram of: {description}." The results revealed five systematic failure patterns:

NAIVE Pipeline diagram

+----------+     +----------+     +--------+
| produce  | --> |  score   | --> | reduce |
+----------+     +----------+     +--------+

Score: 0.42 — Plain ASCII, no hierarchy, no container, ragged borders

SKILLED Pipeline diagram

╔═══════════════════════════════════════╗
║            Pipeline                   ║
╠═══════════════════════════════════════╣
║                                       ║
║  ╭──────────╮   ╭──────────╮          ║
║  │ produce  │━━▶│  score   │          ║
║  ╰──────────╯   ╰─────┬────╯          ║
║                        │              ║
║                        ▼              ║
║                 ╭──────────╮          ║
║                 │  reduce  │          ║
║                 ╰──────────╯          ║
╚═══════════════════════════════════════╝

Score: 0.88 — Border hierarchy, Unicode arrows, all lines width 41

NAIVE Nested containers

Cloud
|
+-- Region A
|   +-- Service 1
|   +-- Service 2
|
+-- Region B
    +-- Service 3
    +-- Service 4

Score: 0.28 — Plain text tree, zero box-drawing characters

SKILLED Nested containers

╔════════════════════════════════════════════════╗
║                     Cloud                      ║
╠════════════════════════════════════════════════╣
║                                                ║
║  ┌────────────────────┐ ┌───────────────────┐  ║
║  │   Region A         │ │   Region B        │  ║
║  │ ╭──────╮ ╭──────╮  │ │ ╭─────╮ ╭─────╮   │  ║
║  │ │Svc 1 │ │Svc 2 │  │ │ │Svc 3│ │Svc 4│   │  ║
║  │ ╰──────╯ ╰──────╯  │ │ ╰─────╯ ╰─────╯   │  ║
║  └────────────────────┘ └───────────────────┘  ║
║                                                ║
╚════════════════════════════════════════════════╝

Score: 0.83 — Three-level hierarchy, all lines width 50. Largest improvement: +0.55

The Five Failure Patterns

The Skill: From Patterns to Permanent Fix

Pattern	Frequency	Root Cause	Fix
Single border style for all levels	9/10	No convention exists in training data	4-level border lookup table in skill
Ragged right borders	8/10	LLMs generate left-to-right; no "ruler"	Post-draw verification procedure
ASCII arrows instead of Unicode	7/10	Training data favors `-->` over `━━▶`	Unicode arrow reference table
Missing structural elements	5/10	No systematic process; model "simplifies"	Inside-out drawing process (6 steps)
No self-verification	10/10	No instruction to re-examine output	Post-generation verification checklist

All five patterns were encoded into a 187-line skill file (~/.claude/skills/ascii-diagram/SKILL.md) that is loaded into every future session. The skill includes:

Round 2: Measured Improvement

Axis	Naive (R1)	Skilled (R2)	Delta	Interpretation
Alignment	0.43	0.85	+0.42	Verification procedure works
Features	0.62	1.00	+0.38	Inside-out process captures all elements
Conventions	0.23	0.83	+0.60	Biggest gain — border hierarchy taught
Judge	0.39	0.78	+0.39	Cascading aesthetic improvement
Overall	0.42	0.86	+0.44

The killer detail: Even hand-crafted "correct" diagrams needed 3–6 validation iterations to achieve perfect alignment. The skill's own example diagram had alignment bugs that the validator caught immediately. If a careful human cannot get alignment right on the first try, an LLM certainly cannot either. This validates the need for programmatic post-generation validation.

Convention adherence is the highest-leverage axis. The +0.60 improvement from 0.23 to 0.83 means the model essentially had no concept of border hierarchy until explicitly taught. This is pure knowledge injection — there is no emergent reasoning that would discover "containers should use double borders." The fix is a lookup table, not a reasoning breakthrough.

5. Gym B: Content Extraction

The extraction gym measures how well the system converts raw HTML pages — full of navigation bars, newsletter CTAs, social share buttons, and ad scaffolding — into clean article text. The challenge: strip the 90–99% boilerplate while preserving every word of actual content.

The Pipeline

Extraction runs a three-stage pipeline: fetch (HTTP request) → extract (readability + markdownify, strips HTML structure) → clean (optional LLM pass to remove surviving UI chrome). Each stage is independently measurable.

Programmatic Scoring

The Motivating Example: Substack Newsletters

Check	Weight	What It Measures	How
Anchor retention	0.50	Does the output contain expected content phrases?	Substring match against known article phrases
Intrusion removal	0.30	Were known CTA/UI patterns stripped?	Substring match against known intrusions ("Subscribe for free", "Share")
Must-not-contain	0.20	Are forbidden patterns absent?	Substring match against patterns like like/comment/share counts

substack_waxman: "Money at Machine Speed"

A 178KB HTML page yields 11KB of article text after readability extraction — 94% compression. But readability leaves behind the Substack CTA block: "Thanks for reading! Subscribe for free... 6 3 1 Share". The baseline score is 0.50 — content preserved, but intrusions survive.

With the LLM cleaning step (gemini-lite, temperature=0), the CTA block is reliably removed. Score jumps to 1.00. Cost per cleaning call: <$0.001. The LLM step takes ~10 seconds — acceptable for batch processing, too slow for interactive use.

Baseline vs. LLM-Cleaned Results

What the Gym Revealed

Site	Category	Baseline	+ LLM Clean	Delta
substack_waxman	Newsletter	0.50	1.00	+0.50
substack_mauboussin	Newsletter	0.35	0.35	—
reuters_article	News	1.00	1.00	—
paulgraham	Blog	1.00	1.00	—
medium_article	Blog	1.00	1.00	—
python_docs	Docs	1.00	1.00	—
Average		0.81	0.89	+0.08

Paywall detection: The substack_mauboussin site extracts only 76 characters from 89KB of HTML — a paywall blocks content access. The gym's scoring correctly identifies this as a failure (0.35), and the LLM cleaning step cannot fix a content-access problem. This surfaced the need for paywall-aware fetch strategies.

Compression ratios as diagnostics: The gym tracks raw HTML → extracted → cleaned character counts. A 178KB → 11KB → 10.6KB progression (94% compression) is healthy. A 89KB → 76 byte extraction signals a structural problem before any scoring runs.

6. Gym C: Judge Calibration

This is the meta-gym — the system that checks its own checkers. If your LLM judge gives a score of 0.85, what does that mean? Is it well-calibrated? Would it give a lower score to objectively worse output? The judge calibration system answers these questions through two key mechanisms.

Monotonicity Testing

The core idea is elegant: if a judge is well-calibrated, deliberately degraded text should score lower than the original. The system applies seven perturbation types, scores both versions, and measures whether the judge consistently detects the degradation:

Calibration Metrics

Perturbation	What It Does	What It Tests
`remove_evidence`	Strips numbers, percentages, code references	Does the judge value specificity?
`add_fluff`	Inserts plausible but irrelevant filler sentences	Does the judge penalize verbosity?
`vague_ify`	Replaces specific values with "several", "some"	Does the judge detect loss of precision?
`inject_errors`	Randomly multiplies/divides numbers (0.1x–10x)	Does the judge catch factual corruption?
`scramble_order`	Shuffles paragraphs or lines	Does the judge value coherent structure?
`duplicate_content`	Repeats ~25% of non-empty lines	Does the judge detect redundancy?
`strip_actionability`	Removes imperative sentences (Use/Run/Always/Never)	Does the judge value actionable guidance?

A judge passes a perturbation type if mean_drop > 0 AND effect_size > 0.5. The overall calibration score is the fraction of perturbation types passed.

Discrimination Analysis

Why this matters: Without calibration, you do not know if your evaluation system is trustworthy. A judge that clusters 60%+ of scores in a 20-point band cannot distinguish good from mediocre output. A judge that gives identical scores to original and degraded text is measuring noise, not quality. The calibration gym ensures every judge used in every other gym meets minimum discrimination standards.

7. Prior Work & How Gyms Differ

Approach	What It Does	How Gyms Differ	Why It Matters
Self-play AlphaGo, AlphaZero	Agent plays against itself to discover strategies	Gyms use domain-specific corpora, not self-generated challenges. Quality is measured against human-defined expectations, not game outcomes.	Real-world tasks have external quality criteria that self-play cannot discover.
RLHF InstructGPT, ChatGPT	Changes model weights using human preference signals	Gyms change context (skills, conventions), not weights. Faster iteration, fully interpretable, instantly reversible.	Weight updates require GPU clusters and are opaque. Skill files are text you can read and edit.
Constitutional AI Anthropic	AI critiques its own output against a constitution	Gyms use programmatic + LLM hybrid evaluation, not LLM-only critique. Structural issues require code, not opinions.	LLM judges cannot count characters. Off-by-one alignment errors require programmatic checks.
DSPy Stanford NLP	Compiles declarative LLM programs by optimizing prompts and demonstrations	Gyms produce skills and conventions, not optimized prompts. Output is human-readable knowledge, not opaque prompt strings.	A skill file teaches the model why to use double borders for containers. An optimized prompt just says "do this" without transferable understanding.
TextGrad MIT	Backpropagation through text — LLM-generated gradients for prompt optimization	Gyms are domain-specific with real corpora. TextGrad optimizes generic prompts; gyms build specialized expertise for specific capabilities.	A gym for ASCII diagrams produces a diagram skill. TextGrad would produce a marginally better prompt.
Voyager Minecraft agent	Grows a skill library through exploration and self-verification	Closest analog. Gyms add programmatic evaluation and calibrated LLM judges to the skill acquisition loop.	Voyager verifies by execution (did the code run?). Gyms verify by quality measurement (is the output good?).

What is genuinely novel: The combination of (1) domain-specific evaluation corpora, (2) hybrid programmatic + LLM scoring, (3) durable artifact output (skills/conventions, not weight updates or optimized prompts), and (4) calibrated judges with monotonicity testing. Each element exists elsewhere; the combination — and the emphasis on producing human-readable, transferable knowledge — is distinctive.

8. Metrics

ASCII Diagram Gym

Extraction Gym

Quality Ratings (where applicable)

9. Demo

The gym system runs from the command line. Here is what a typical session looks like:

Running a Gym

Running the Validator Directly

What to Watch For

10. Vision: Where This Goes

Near-term: All Gyms Active

The system currently has 14 registered gyms across extraction, badge quality, fetchability, knowledge extraction, code cleanup, and more. The near-term goal is each gym actively running, each scoring its parent system's output, each producing measurable improvement deltas. Every gym produces reusable artifacts — skills, conventions, validators — that compound across the entire system.

Medium-term: Gyms as Jobs with Dashboard Integration

Metric	Value	Notes
Overall quality	0.42 → 0.86	+105% across 10 challenges
Convention adherence	0.23 → 0.83	+260% — highest-leverage axis
Feature completeness	0.62 → 1.00	Inside-out process captures all structural elements
Alignment accuracy	0.43 → 0.85	Verification procedure works
LLM judge score	0.39 → 0.78	Judges are stricter on aesthetics than validators

Metric	Value	Notes
Average overall	0.81 → 0.89	LLM clean step at <$0.001 per call
Intrusion removal	0.75 → 0.92	+23% — newsletter CTAs reliably stripped
Anchor retention	0.83 → 0.83	LLM clean preserves all article content
Must-not-contain	0.83 → 1.00	Like/comment/share counts eliminated
Compression ratio	67%–99.6%	178KB HTML → 10.6KB clean text (Substack)

Gym	Corpus	Scoring	Iteration	Skill Output	Overall
ASCII Diagrams	▰▰▰▰▱	▰▰▰▰▰	▰▰▰▰▰	▰▰▰▰▰	▰▰▰▰▱
Extraction	▰▰▱▱▱	▰▰▰▰▱	▰▰▰▱▱	▰▰▱▱▱	▰▰▰▱▱
Judge Calibration	▰▰▰▱▱	▰▰▰▰▰	▰▰▱▱▱	▰▰▰▱▱	▰▰▰▱▱

Gym	Status	What It Improves
Principle	Next	Agent behavior via evidence-backed principles
Extraction	Active	HTML extraction + LLM cleaning quality
ASCII Diagrams	Done	Box-drawing diagram quality
Badge	Done	Session topic summarization
Fetchability	Done	Proxy/fetch method selection
KB	Done	Knowledge extraction from web sources
Assessor	Done	Person scoring prompt calibration
Claim Extraction	Active	Claim completeness across 4 domains
Sidekick	TODO	Intervention timing and helpfulness
Code Cleanup	TODO	Codebase hygiene suggestions

Each gym becomes a scheduled job that runs automatically — nightly, after code changes, or on-demand. Results flow into the system dashboard, creating a quality scoreboard where you can see at a glance which capabilities are improving and which are regressing. Regressions trigger alerts; improvements trigger celebrations.

The job integration also enables continuous calibration: the judge calibration gym runs whenever a new model is added to the evaluation pipeline, automatically screening it for discriminative power before it can influence scores.

North Star: The System That Improves While You Sleep

Each gym is a small instance of this vision. The ASCII diagram gym took quality from 0.42 to 0.86 in a single iteration cycle. Scale that to 14 gyms, each running continuously, each producing durable artifacts that benefit the entire system — and the compounding begins. A skill created by one gym improves the input quality for another. Better extraction feeds better knowledge. Better knowledge feeds better principles. Better principles feed better everything.

The compounding effect: When gym A (extraction) produces cleaner input text, gym B (knowledge extraction) works on higher-quality data and produces better knowledge. That knowledge makes gym C (principle) produce better principles. Those principles make the system smarter, which improves the output that gym A scores. Each gym's improvement amplifies the others — this is not linear progress, it is a positive feedback loop.

Report generated March 2026. Data from the rivus gym system. ASCII diagram gym data from 10 challenges with programmatic + LLM hybrid scoring. Extraction gym data from 6 corpus sites. Judge calibration framework from lib/eval/calibration/. Source: static.localhost/present/gyms/report.html.