The Gym System: Self-Improving AI Through Measured Practice

How structured generate → evaluate → learn loops close the quality gap — turning AI output from "good enough" to measurably excellent, one iteration at a time.

Contents
  1. The Problem: Same Mistakes, No Measurement
  2. The Core Idea: Gyms as Steering Modules
  3. How It Works
  4. Gym A: ASCII Diagrams
  5. Gym B: Content Extraction
  6. Gym C: Judge Calibration
  7. Prior Work & How Gyms Differ
  8. Metrics
  9. Demo
  10. Vision: Where This Goes

1. The Problem: Same Mistakes, No Measurement

AI systems make the same mistakes repeatedly. An LLM asked to draw a diagram will produce misaligned boxes. Asked again tomorrow, it will produce the same misaligned boxes. An extraction pipeline will let newsletter CTAs bleed into clean text. Run it next week — same CTAs survive.

The root cause is not capability. Modern LLMs can draw aligned diagrams and can clean text well. The problem is threefold:

Three compounding gaps

No measurement: Without a score, you can not tell if output got better or worse. "Looks fine" is not a metric. A diagram that appears aligned in a proportional-width editor might be off by one character in a terminal — and you would never know.

No feedback loop: Even when someone notices a problem and fixes it manually, the fix stays in that session. The next session starts fresh. The insight — "use Unicode box-drawing, not ASCII +---+" — evaporates.

No compounding: Without measurement and a loop, improvements do not stack. Each improvement is local to one instance. Ten manual fixes produce ten slightly-better outputs — not one systematically-better system.

The result: human experts spend their time correcting the same classes of errors, session after session, instead of teaching the system to avoid them permanently.

The insight: You can not improve what you can not measure, and improvements do not compound without a loop. The gym system closes both gaps: it scores output programmatically, extracts patterns from failures, encodes fixes as reusable skills, and re-generates to verify the fix worked.

2. The Core Idea: Gyms as Steering Modules

A gym is a steering module that scores a system's output, surfaces failures, and closes the quality loop. It is not a training framework — it does not change model weights. It is not a prompt optimizer — it does not search prompt space. It is a quality feedback system that produces reusable artifacts: skills, conventions, validation tools.

One cycle of the loop:

1
Generate: Run the system on a corpus of challenges. Produce output using the current approach (naive or skill-augmented).
2
Evaluate: Score output using both programmatic validators (deterministic, fast, catches structural issues) and LLM judges (nuanced, catches aesthetic and semantic issues).
3
Learn: Extract failure patterns. Group by root cause. Identify the 3–5 systematic issues that explain most failures.
4
Apply: Encode fixes as permanent artifacts — a skill file, a convention, a validation tool, a set of examples. These persist across sessions.
5
Re-generate: Run the same corpus again with fixes applied. Measure the delta. Did it actually improve? By how much?

The key difference from "just iterate on the prompt": every cycle produces durable artifacts that benefit all future work, not just the next run. A skill file created in the ASCII diagram gym improves every diagram the system draws, forever.

3. How It Works

╔══════════════════════════════════════════════════════════════╗ ║ Gym System ║ ╠══════════════════════════════════════════════════════════════╣ ║ ║ ║ ┌──────────┐ ┌──────────────┐ ┌───────────────────┐ ║ ║ │ Corpus │───▶│ Generate │───▶│ Evaluate │ ║ ║ │ (tasks) │ │ (LLM / sys) │ │ (prog + judge) │ ║ ║ └──────────┘ └──────────────┘ └────────┬──────────┘ ║ ║ │ ║ ║ ┌─────────────────────┘ ║ ║ ▼ ║ ║ ╭─────────────╮ ╭──────────────╮ ║ ║ │ Learn │────────▶│ Apply │ ║ ║ │ (patterns) │ │ (skill/conv) │ ║ ║ ╰─────────────╯ ╰──────┬───────╯ ║ ║ │ ║ ║ ┌───────────────────────┘ ║ ║ ▼ ║ ║ ┌──────────────┐ ║ ║ │ Re-generate │─── measure delta ──▶ done ║ ║ └──────────────┘ ║ ║ ║ ╚══════════════════════════════════════════════════════════════╝

Each gym is defined as a vario recipeA YAML configuration that defines a data pipeline: source items, processing steps, and evaluation functions. Vario is the pipeline engine that executes recipes. — a YAML file that wires together source items, task functions, and scoring functions. The gym CLI orchestrates execution:

gym run ascii_diagrams    # Run the full gen-eval-learn cycle
gym run extraction        # Run extraction quality evaluation
gym list                  # Show all registered gyms and their status

Evaluation is hybrid by design. Programmatic validators catch structural defects (misaligned borders, missing content anchors) with 100% reliability and zero cost. LLM judges assess qualitative dimensions (visual balance, readability, semantic coherence) that no validator can capture. The combination covers more ground than either approach alone.

Why not just LLM judges? An LLM judge rates text visually — it reads characters, not pixel positions. A diagram line at width 40 vs 41 looks identical to an LLM. But in a monospace terminal, it is visibly broken. Programmatic validators catch the structural issues that LLM judges literally cannot see. Conversely, an LLM judge catches "this diagram is technically correct but visually lopsided" — a qualitative assessment no regex can make.

4. Gym A: ASCII Diagrams

The ASCII diagram gym is the most complete demonstration of the gen-eval-learn loop. It took diagrams from a quality score of 0.42 to 0.86 — a 105% improvement — by identifying five systematic failure patterns and encoding fixes as a permanent skill.

0.42
naive quality score
0.86
skilled quality score
+105%
+0.60
convention adherence gain
585
alignment issues found repo-wide

Round 1: Naive Generation

Ten diagram challenges were generated without any skill or convention guidance. The prompt was simply: "Draw an ASCII diagram of: {description}." The results revealed five systematic failure patterns:

NAIVE Pipeline diagram

+----------+     +----------+     +--------+
| produce  | --> |  score   | --> | reduce |
+----------+     +----------+     +--------+

Score: 0.42 — Plain ASCII, no hierarchy, no container, ragged borders

SKILLED Pipeline diagram

╔═══════════════════════════════════════╗
║            Pipeline                   ║
╠═══════════════════════════════════════╣
║                                       ║
║  ╭──────────╮   ╭──────────╮          ║
║  │ produce  │━━▶│  score   │          ║
║  ╰──────────╯   ╰─────┬────╯          ║
║                        │              ║
║                        ▼              ║
║                 ╭──────────╮          ║
║                 │  reduce  │          ║
║                 ╰──────────╯          ║
╚═══════════════════════════════════════╝

Score: 0.88 — Border hierarchy, Unicode arrows, all lines width 41

NAIVE Nested containers

Cloud
|
+-- Region A
|   +-- Service 1
|   +-- Service 2
|
+-- Region B
    +-- Service 3
    +-- Service 4

Score: 0.28 — Plain text tree, zero box-drawing characters

SKILLED Nested containers

╔════════════════════════════════════════════════╗
║                     Cloud                      ║
╠════════════════════════════════════════════════╣
║                                                ║
║  ┌────────────────────┐ ┌───────────────────┐  ║
║  │   Region A         │ │   Region B        │  ║
║  │ ╭──────╮ ╭──────╮  │ │ ╭─────╮ ╭─────╮   │  ║
║  │ │Svc 1 │ │Svc 2 │  │ │ │Svc 3│ │Svc 4│   │  ║
║  │ ╰──────╯ ╰──────╯  │ │ ╰─────╯ ╰─────╯   │  ║
║  └────────────────────┘ └───────────────────┘  ║
║                                                ║
╚════════════════════════════════════════════════╝

Score: 0.83 — Three-level hierarchy, all lines width 50. Largest improvement: +0.55

The Five Failure Patterns

PatternFrequencyRoot CauseFix
Single border style for all levels9/10No convention exists in training data4-level border lookup table in skill
Ragged right borders8/10LLMs generate left-to-right; no "ruler"Post-draw verification procedure
ASCII arrows instead of Unicode7/10Training data favors --> over ━━▶Unicode arrow reference table
Missing structural elements5/10No systematic process; model "simplifies"Inside-out drawing process (6 steps)
No self-verification10/10No instruction to re-examine outputPost-generation verification checklist

The Skill: From Patterns to Permanent Fix

All five patterns were encoded into a 187-line skill file (~/.claude/skills/ascii-diagram/SKILL.md) that is loaded into every future session. The skill includes:

Round 2: Measured Improvement

AxisNaive (R1)Skilled (R2)DeltaInterpretation
Alignment0.430.85+0.42Verification procedure works
Features0.621.00+0.38Inside-out process captures all elements
Conventions0.230.83+0.60Biggest gain — border hierarchy taught
Judge0.390.78+0.39Cascading aesthetic improvement
Overall0.420.86+0.44
The killer detail: Even hand-crafted "correct" diagrams needed 3–6 validation iterations to achieve perfect alignment. The skill's own example diagram had alignment bugs that the validator caught immediately. If a careful human cannot get alignment right on the first try, an LLM certainly cannot either. This validates the need for programmatic post-generation validation.
Convention adherence is the highest-leverage axis. The +0.60 improvement from 0.23 to 0.83 means the model essentially had no concept of border hierarchy until explicitly taught. This is pure knowledge injection — there is no emergent reasoning that would discover "containers should use double borders." The fix is a lookup table, not a reasoning breakthrough.

5. Gym B: Content Extraction

The extraction gym measures how well the system converts raw HTML pages — full of navigation bars, newsletter CTAs, social share buttons, and ad scaffolding — into clean article text. The challenge: strip the 90–99% boilerplate while preserving every word of actual content.

0.81
baseline score (no LLM)
0.89
with LLM clean
+10%
94%
compression (Substack)
6
corpus sites scored

The Pipeline

Extraction runs a three-stage pipeline: fetch (HTTP request) → extract (readability + markdownify, strips HTML structure) → clean (optional LLM pass to remove surviving UI chrome). Each stage is independently measurable.

Programmatic Scoring

The scoring function uses three deterministic checks, weighted by importance:

CheckWeightWhat It MeasuresHow
Anchor retention0.50Does the output contain expected content phrases?Substring match against known article phrases
Intrusion removal0.30Were known CTA/UI patterns stripped?Substring match against known intrusions ("Subscribe for free", "Share")
Must-not-contain0.20Are forbidden patterns absent?Substring match against patterns like like/comment/share counts

The Motivating Example: Substack Newsletters

substack_waxman: "Money at Machine Speed"

A 178KB HTML page yields 11KB of article text after readability extraction — 94% compression. But readability leaves behind the Substack CTA block: "Thanks for reading! Subscribe for free... 6 3 1 Share". The baseline score is 0.50 — content preserved, but intrusions survive.

With the LLM cleaning step (gemini-lite, temperature=0), the CTA block is reliably removed. Score jumps to 1.00. Cost per cleaning call: <$0.001. The LLM step takes ~10 seconds — acceptable for batch processing, too slow for interactive use.

Baseline vs. LLM-Cleaned Results

SiteCategoryBaseline+ LLM CleanDelta
substack_waxmanNewsletter0.501.00+0.50
substack_mauboussinNewsletter0.350.35
reuters_articleNews1.001.00
paulgrahamBlog1.001.00
medium_articleBlog1.001.00
python_docsDocs1.001.00
Average0.810.89+0.08

What the Gym Revealed

Category pattern: Newsletters are the hardest extraction target. News sites, blogs, and documentation extract cleanly with just readability. The LLM cleaning step only matters for sites with heavy UI chrome — meaning a targeted approach (apply LLM clean only to newsletters/social) is more cost-effective than a blanket strategy.

Paywall detection: The substack_mauboussin site extracts only 76 characters from 89KB of HTML — a paywall blocks content access. The gym's scoring correctly identifies this as a failure (0.35), and the LLM cleaning step cannot fix a content-access problem. This surfaced the need for paywall-aware fetch strategies.

Compression ratios as diagnostics: The gym tracks raw HTML → extracted → cleaned character counts. A 178KB → 11KB → 10.6KB progression (94% compression) is healthy. A 89KB → 76 byte extraction signals a structural problem before any scoring runs.

Expand the corpus from 6 to 20+ sites for statistically meaningful coverage. Add more newsletter and social media targets where LLM cleaning provides the most value. Run multiple iterations to track score trajectories over time.

6. Gym C: Judge Calibration

This is the meta-gym — the system that checks its own checkers. If your LLM judge gives a score of 0.85, what does that mean? Is it well-calibrated? Would it give a lower score to objectively worse output? The judge calibration system answers these questions through two key mechanisms.

Monotonicity Testing

The core idea is elegant: if a judge is well-calibrated, deliberately degraded text should score lower than the original. The system applies seven perturbation types, scores both versions, and measures whether the judge consistently detects the degradation:

PerturbationWhat It DoesWhat It Tests
remove_evidenceStrips numbers, percentages, code referencesDoes the judge value specificity?
add_fluffInserts plausible but irrelevant filler sentencesDoes the judge penalize verbosity?
vague_ifyReplaces specific values with "several", "some"Does the judge detect loss of precision?
inject_errorsRandomly multiplies/divides numbers (0.1x–10x)Does the judge catch factual corruption?
scramble_orderShuffles paragraphs or linesDoes the judge value coherent structure?
duplicate_contentRepeats ~25% of non-empty linesDoes the judge detect redundancy?
strip_actionabilityRemoves imperative sentences (Use/Run/Always/Never)Does the judge value actionable guidance?

Calibration Metrics

For each perturbation type, three metrics determine whether the judge passes:

A judge passes a perturbation type if mean_drop > 0 AND effect_size > 0.5. The overall calibration score is the fraction of perturbation types passed.

Discrimination Analysis

Why this matters: Without calibration, you do not know if your evaluation system is trustworthy. A judge that clusters 60%+ of scores in a 20-point band cannot distinguish good from mediocre output. A judge that gives identical scores to original and degraded text is measuring noise, not quality. The calibration gym ensures every judge used in every other gym meets minimum discrimination standards.

The screening criteria reject judges that:

Run a full calibration report across all models used as judges in the gym system (gemini-lite, Sonnet, Grok). Publish per-model monotonicity pass rates and effect sizes. This data exists in the calibration test suite but has not been aggregated into a presentation-ready report.

7. Prior Work & How Gyms Differ

Approach What It Does How Gyms Differ Why It Matters
Self-play
AlphaGo, AlphaZero
Agent plays against itself to discover strategies Gyms use domain-specific corpora, not self-generated challenges. Quality is measured against human-defined expectations, not game outcomes. Real-world tasks have external quality criteria that self-play cannot discover.
RLHF
InstructGPT, ChatGPT
Changes model weights using human preference signals Gyms change context (skills, conventions), not weights. Faster iteration, fully interpretable, instantly reversible. Weight updates require GPU clusters and are opaque. Skill files are text you can read and edit.
Constitutional AI
Anthropic
AI critiques its own output against a constitution Gyms use programmatic + LLM hybrid evaluation, not LLM-only critique. Structural issues require code, not opinions. LLM judges cannot count characters. Off-by-one alignment errors require programmatic checks.
DSPy
Stanford NLP
Compiles declarative LLM programs by optimizing prompts and demonstrations Gyms produce skills and conventions, not optimized prompts. Output is human-readable knowledge, not opaque prompt strings. A skill file teaches the model why to use double borders for containers. An optimized prompt just says "do this" without transferable understanding.
TextGrad
MIT
Backpropagation through text — LLM-generated gradients for prompt optimization Gyms are domain-specific with real corpora. TextGrad optimizes generic prompts; gyms build specialized expertise for specific capabilities. A gym for ASCII diagrams produces a diagram skill. TextGrad would produce a marginally better prompt.
Voyager
Minecraft agent
Grows a skill library through exploration and self-verification Closest analog. Gyms add programmatic evaluation and calibrated LLM judges to the skill acquisition loop. Voyager verifies by execution (did the code run?). Gyms verify by quality measurement (is the output good?).
What is genuinely novel: The combination of (1) domain-specific evaluation corpora, (2) hybrid programmatic + LLM scoring, (3) durable artifact output (skills/conventions, not weight updates or optimized prompts), and (4) calibrated judges with monotonicity testing. Each element exists elsewhere; the combination — and the emphasis on producing human-readable, transferable knowledge — is distinctive.

8. Metrics

+0.44
ASCII quality improvement
+0.08
extraction quality improvement
585
alignment issues found repo-wide
7
perturbation types for calibration

ASCII Diagram Gym

MetricValueNotes
Overall quality0.42 → 0.86+105% across 10 challenges
Convention adherence0.23 → 0.83+260% — highest-leverage axis
Feature completeness0.62 → 1.00Inside-out process captures all structural elements
Alignment accuracy0.43 → 0.85Verification procedure works
LLM judge score0.39 → 0.78Judges are stricter on aesthetics than validators

Extraction Gym

MetricValueNotes
Average overall0.81 → 0.89LLM clean step at <$0.001 per call
Intrusion removal0.75 → 0.92+23% — newsletter CTAs reliably stripped
Anchor retention0.83 → 0.83LLM clean preserves all article content
Must-not-contain0.83 → 1.00Like/comment/share counts eliminated
Compression ratio67%–99.6%178KB HTML → 10.6KB clean text (Substack)

Quality Ratings (where applicable)

Using the parallelogram scale for overall gym maturity:

GymCorpusScoringIterationSkill OutputOverall
ASCII Diagrams▰▰▰▰▱▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▱
Extraction▰▰▱▱▱▰▰▰▰▱▰▰▰▱▱▰▰▱▱▱▰▰▰▱▱
Judge Calibration▰▰▰▱▱▰▰▰▰▰▰▰▱▱▱▰▰▰▱▱▰▰▰▱▱

9. Demo

The gym system runs from the command line. Here is what a typical session looks like:

Running a Gym

# Run the ASCII diagram gym
$ gym run ascii_diagrams
[INFO] Loading recipe: learning/gyms/ascii_diagrams/gym.yaml
[INFO] Source: 10 challenges loaded
[INFO] Generating: pipeline... architecture... nested_containers... rating...
[INFO] Evaluating: alignment=0.85, features=1.00, conventions=0.83, judge=0.78
[INFO] Overall: 0.86 (was 0.42 on last naive run)

# Run the extraction gym
$ gym run extraction
[INFO] Loading recipe: learning/gyms/extraction/gym.yaml
[INFO] Source: 6 sites loaded
[INFO] Fetching: substack_waxman... reuters_article... paulgraham...
[INFO] Scoring: intrusion=0.92, anchor=0.83, must_not=1.00
[INFO] Overall: 0.89

# Check all gym statuses
$ gym list
NAME              STATUS    LAST SCORE    LAST RUN
ascii_diagrams    Done      0.86          2026-03-21
extraction        Active    0.89          2026-03-21
badge             Done      —             2026-02-28
fetchability      Done      —             2026-03-01
kb                Done      —             2026-03-02
assessor          Done      —             2026-03-05

Running the Validator Directly

# Check a specific file for ASCII diagram alignment issues
$ python tools/check_ascii.py present/gyms/report.html
Checking present/gyms/report.html...
  Block at line 142: 19 lines, all width 64. ALIGNED.
  236 code blocks scanned across repo.
  585 alignment issues found in 236 blocks.

What to Watch For

In a live demo, the key moments are:

  1. The naive generation — watch the model produce a diagram with plain +---+ borders and ragged alignment. This is the starting point.
  2. The scoring output — watch the validator catch specific issues: "line 5: content width 40 != box width 41 (off by 1)." These are issues invisible to the human eye in a proportional font.
  3. The re-generation — with the skill loaded, watch the model produce a properly hierarchical diagram with aligned borders. Same model, same prompt, dramatically better output.
  4. The delta calculation — the system reports exactly how much each axis improved, making the improvement concrete and falsifiable.

10. Vision: Where This Goes

Near-term: All Gyms Active

The system currently has 14 registered gyms across extraction, badge quality, fetchability, knowledge extraction, code cleanup, and more. The near-term goal is each gym actively running, each scoring its parent system's output, each producing measurable improvement deltas. Every gym produces reusable artifacts — skills, conventions, validators — that compound across the entire system.

GymStatusWhat It Improves
PrincipleNextAgent behavior via evidence-backed principles
ExtractionActiveHTML extraction + LLM cleaning quality
ASCII DiagramsDoneBox-drawing diagram quality
BadgeDoneSession topic summarization
FetchabilityDoneProxy/fetch method selection
KBDoneKnowledge extraction from web sources
AssessorDonePerson scoring prompt calibration
Claim ExtractionActiveClaim completeness across 4 domains
SidekickTODOIntervention timing and helpfulness
Code CleanupTODOCodebase hygiene suggestions

Medium-term: Gyms as Jobs with Dashboard Integration

Each gym becomes a scheduled job that runs automatically — nightly, after code changes, or on-demand. Results flow into the system dashboard, creating a quality scoreboard where you can see at a glance which capabilities are improving and which are regressing. Regressions trigger alerts; improvements trigger celebrations.

The job integration also enables continuous calibration: the judge calibration gym runs whenever a new model is added to the evaluation pipeline, automatically screening it for discriminative power before it can influence scores.

North Star: The System That Improves While You Sleep

The north star is a system where you wake up and it has gotten measurably better overnight. Not because someone pushed a model update or tuned a prompt — but because the gym loop ran, found failures, encoded fixes, verified them, and deployed. The improvement is automatic, measured, and auditable.

Each gym is a small instance of this vision. The ASCII diagram gym took quality from 0.42 to 0.86 in a single iteration cycle. Scale that to 14 gyms, each running continuously, each producing durable artifacts that benefit the entire system — and the compounding begins. A skill created by one gym improves the input quality for another. Better extraction feeds better knowledge. Better knowledge feeds better principles. Better principles feed better everything.

The compounding effect: When gym A (extraction) produces cleaner input text, gym B (knowledge extraction) works on higher-quality data and produces better knowledge. That knowledge makes gym C (principle) produce better principles. Those principles make the system smarter, which improves the output that gym A scores. Each gym's improvement amplifies the others — this is not linear progress, it is a positive feedback loop.

Report generated March 2026. Data from the rivus gym system. ASCII diagram gym data from 10 challenges with programmatic + LLM hybrid scoring. Extraction gym data from 6 corpus sites. Judge calibration framework from lib/eval/calibration/. Source: static.localhost/present/gyms/report.html.