The Gym System

"We are what we repeatedly do. Excellence, then, is not an act, but a habit." — Will Durant, paraphrasing Aristotle

Turning LLM capabilities from "works" to "excellent" — systematically, with evidence, one workout at a time.

Your System Works. How Good Could It Be?

Your LLM system works. A prompt produces output, a pipeline processes data, an agent completes tasks. But is it half as good as it could be? Without measurement, you can't know. Without a method, you can't improve. Without a loop, improvements don't compound.

No measurement. "Looks fine" is not a score. A diagram that appears aligned may be off by one character in a monospace terminal. An extraction that seems clean may have silently dropped key content. Without numbers, you're guessing.

No method. Improvement is ad hoc. Someone notices a problem, tweaks a prompt, eyeballs the output. The fix stays in that session. The next session starts fresh. The insight — "use Unicode box-drawing, not ASCII +---+" — evaporates.

No compounding. Ten manual fixes produce ten slightly-better outputs — not one systematically-better system. Each fix is local. Nothing stacks.

A gym closes all three gaps. It scores output programmatically, diagnoses weaknesses, brings in training material, drills improvement, and the result persists as a reusable skill. The first workout measures. The second improves. By the third, the system is better — permanently.

A Coach, Not a Test Suite

A gym is a coach for LLM capabilities. It doesn't just measure quality — it systematically improves it. One cycle of the loop:

GEN
Generate
LLM produces output
from a task prompt
EVAL
Evaluate
Score against corpus
with defined criteria
LEARN
Learn
Extract failure patterns
→ codify as conventions
GEN
Re-generate
Apply learned conventions
→ verify improvement
Gym TermWhat It Means
GymTraining facility — a corpus of known inputs + scoring criteria + feedback loop
WorkoutOne run through generate → evaluate → learn. Each workout produces a score and insights.
SkillWhat training produces — reusable conventions that persist across sessions and apply to all future work
CoachThe gym's judgment — a scoring rubric (deterministic) or an LLM judge (when human judgment is needed)
Training materialExternal resources brought in to raise the ceiling — Elements of Style, CSS specs, domain textbooks
Personal bestBaseline score to beat — established on the first workout, improved on each subsequent one

The key difference from a test suite: a test tells you pass or fail. A gym tells you why you scored 0.42, what would improve it, and proves the improvement worked. Then it packages that knowledge so every future run starts from the new baseline.

Workout A: ASCII Diagrams — 0.42 → 0.84

We lead with ASCII diagrams because the before/after is viscerally obvious — you can see the improvement. But this is a low-stakes example. You can sidestep ASCII diagrams entirely. The higher-value gyms — extraction quality, claim accuracy, writing style — tackle capabilities you can't sidestep. The loop is the same; the stakes are different.

An LLM asked to draw a diagram produces something that looks reasonable but has systematic problems — misaligned borders, inconsistent styles, missing structural elements. The ASCII diagram gym measured these failures, extracted patterns, codified them as conventions, and re-ran. Score doubled.

Round 1 (Naive)
0.42
Round 2 (Skilled)
0.84
Improvement
+100%
Challenges
10
Round 1: Naive (pipeline challenge — 0.52)
+----------+     +----------+     +--------+
| produce  | --> |  score   | --> | reduce |
+----------+     +----------+     +--------+
Round 2: Skilled (pipeline challenge — 0.90)
╔═══════════════════════════════════════╗
║            Pipeline                   ║
╠═══════════════════════════════════════╣
║                                       ║
║  ╭──────────╮   ╭──────────╮          ║
║  │ produce  │━━▶│  score   │          ║
║  ╰──────────╯   ╰─────┬────╯          ║
║                        │              ║
║                        ▼              ║
║                 ╭──────────╮          ║
║                 │  reduce  │          ║
║                 ╰──────────╯          ║
╚═══════════════════════════════════════╝

What the Gym Found

Failure PatternFound InConvention Created
Ragged right borders8/10 challengesEvery line must be same display width
Single border style for all levels9/104-level hierarchy: container=double, component=single, leaf=rounded, emphasis=heavy
ASCII arrows (-->)7/10Unicode arrows table (━━▶, ───▶, │, ▼)
Missing structural elements5/10Inside-out drawing process — identify hierarchy first
No self-verification10/10Post-drawing checklist + Python validation code
These conventions became a reusable skill — injected into the system prompt for every future diagram request. The gym ran once. The improvement applies forever. That's the difference between a test and a coach.

Workout B: Content Extraction — 12 Sites Scored

Does the extraction pipeline preserve real content while stripping boilerplate? This gym scores each site on three axes — no LLM judge needed, just deterministic string matching.

Total Sites
12
OK (≥0.8)
9
WARN (≥0.5)
2
Avg Overall
0.85

Three Scoring Axes

Full score matrix (12 sites)
SiteCategoryOverallIntrusionAnchorMust-not
substack_mauboussinnewsletter0.350.500.001.00
substack_waxmannewsletter1.001.001.001.00
stratechery_samplenewsletter0.501.000.001.00
sequoia_600bblog0.850.501.001.00
paulgrahamblog1.001.001.001.00
a16z_ai_appsblog0.700.500.671.00
reuters_articlenews1.001.001.001.00
medium_articleblog1.001.001.001.00
python_docsdocs1.001.001.001.00
arxiv_attentionacademic0.850.501.001.00
hn_showforum1.001.001.001.00
hacker_news_commentforum0.350.001.001.00

Before and After: Paul Graham Essay

Raw extraction (9,280 chars)
# Writes and Write-Nots

|  |  |  |  |
| --- | --- | --- | --- |
|  |  | |  | | --- | | Writes and
Write-Nots  October 2024  I'm
usually reluctant to make predictions
about technology, but I feel fairly
confident about this one...
After cleaning (3,058 chars — 33% of raw)
# Writes and Write-Nots

October 2024

I'm usually reluctant to make predictions
about technology, but I feel fairly
confident about this one: in a couple
decades there won't be many people who
can write.

One of the strangest things you learn if
you're a writer is how many people have
trouble writing...
The gym also caught something eyeballing wouldn't — 2 sites (Mauboussin, Stratechery) returned 404 or paywall content, scoring WARN/FAIL. The corpus itself had rotted. A gym doesn't just measure the pipeline — it measures the measurement.

Workout C: Claim Extraction — 6 Models × 20 Documents

Extract structured claims from investment documents — entities, raw facts, interpretations, and influence relationships. An Opus judge scores each candidate against a human-curated reference. 120 candidates total.

Models
6
Documents
20
Candidates
120
Best Score
88.5
opus
ModelAvg Score
opus88.5
gpt81.7
sonnet81.2
haiku77.3
grok-fast56.5
gemini-pro0.0

Six Scoring Dimensions

CriterionWeightWhat it measures
Entity coverage15%Did the model find all significant entities?
Atom completeness20%Were key facts and data points captured?
Assertion quality20%Are interpretations well-formed and distinct from raw facts?
Relationship accuracy20%Are influence links correctly identified with appropriate weights?
Precision15%No hallucinated claims, spurious entities, or invented relationships
Source grounding10%Every claim backed by an accurate quote from the document
Example: Opus scores 93 on Heliad Equity Partners document
CriterionScore
Entity coverage100
Atom completeness95
Assertion quality93
Relationship accuracy90
Precision93
Source grounding95
{
  "entities": [
    {"id": "e1", "name": "Heliad Equity Partners", "type": "company"},
    {"id": "e2", "name": "Andreas Lange", "type": "person"}
  ],
  "claims": [
    {
      "id": "c1", "claim_type": "atom",
      "text": "Heliad is a publicly traded investment firm with 37M Euro market cap",
      "confidence": 0.95,
      "source_quote": "Heliad Equity Partners is a publicly traded investment firm in Germany with a market cap of 37M Euro."
    },
    {
      "id": "c3", "claim_type": "assertion",
      "text": "Andreas Lange is considered a rising star in German private equity",
      "confidence": 0.75, "direction": "bullish",
      "source_quote": "Andreas Lange, a mid 30s executive who is considered to be a rising star..."
    }
  ]
}
Gemini Pro scored 0 — not because it can't extract claims, but because it returned the wrong output format. The gym caught a task-failure mode that spot-checking would miss.
The gym makes model selection evidence-based, not vibes-based. Before this workout, "which model should we use?" was a guess. After, it's a table.

Under the Hood

Technical details: gym.yaml anatomy, corpus conventions, scoring, CLI (click to expand)

The gym.yaml File

Every gym is defined by a single YAML file. Here's an annotated example from the extraction gym:

name: parse-extraction
description: HTML extraction + LLM cleaning quality

measures:
  capability: lib/parse                              # what code is being tested
  question: "Does extraction preserve content        # answerable with data
             while removing boilerplate?"
  baseline: "Raw HTML passed through — no cleaning"  # what we compare against

steps:
  - op: source          # load test data from corpus
    params:
      items:
        - id: substack_waxman
          url: "https://waxmand.substack.com/p/..."
          intrusions: ["Subscribe for free"]         # should be stripped
          anchors: ["Coinbase launched the first"]   # must survive

  - op: task            # run the extraction pipeline
    params:
      fn: lib.parse.gym.gym_ops.fetch_and_extract

  - op: evaluate        # score each result
    params:
      fn: lib.parse.gym.gym_ops.score
      per_item: true

limits:
  usd: 2.0              # cost guardrail per run

The measures block forces you to articulate what "good" means before writing evaluation code. The steps define a pipeline: source items → run the task → score results. The corpus (items with expected behaviors) is the ground truth.

Corpus Conventions

Two Scoring Approaches

CLI

gym list                    # discover all gyms in the project
gym run extraction          # execute the extraction gym pipeline
gym report extraction       # render an HTML report from results

Gyms co-locate with the code they test. lib/parse/gym/ tests the parsing library. draft/style/gym/ tests writing style analysis. Co-location keeps the quality contract visible to anyone working on the code.

How Gyms Compare to Eval Frameworks

Several excellent tools exist for evaluating LLM output. Gyms share their measurement philosophy but add the learn step — closing the loop from score to improvement.

ToolMeasuresDiagnosesImprovesPersists
Promptfoo
Braintrustpartial
LangSmith
pytest
Gyms
All of the above are measurement tools — they tell you how good your output is. Gyms include measurement but close the loop: the learn step produces skills, conventions, and validation code that compound across the entire system. Measurement without improvement is just observation.

The Two Compasses

Rivus has two quality measurement systems — benchmarks and gyms. They serve fundamentally different purposes but form a complete quality loop together.

Benchmarks

Measure potential

How good is this model at reasoning, math, writing? Standardized tasks, fixed data, someone else's definition of quality.

Gyms

Measure actuality

How good is your system at your task? Your corpus, your pipeline, your data, your definition of quality.

The Flywheel

  1. New model drops — benchmark tells you it's worth trying
  2. Gym re-run — confirms the improvement transfers to your specific task
  3. Gym stagnates despite better models — your prompt/pipeline is the bottleneck, not the model
  4. Gym exceeds what benchmarks predict — your corpus + engineering is adding real value
A system that only benchmarks is chasing leaderboards. A system that only gyms is navel-gazing. You need both compasses.

Where This Goes

The gym framework has three layers of maturity. Today we have the first. The others are emerging or envisioned — marked clearly.

Layer 1: Capability Gyms

EXISTS
  • Measure + improve a specific LLM task via gen → eval → learn
  • Three active today: ASCII diagrams (0.42 → 0.84), extraction (12 sites scored), claim extraction (6 models compared)
  • Each workout produces a scored result and a reusable skill

Layer 2: Resource-Enhanced Gyms

EMERGING
  • Bring in external knowledge to raise the ceiling
  • Example: Elements of Style for writing quality, CSS specifications for diagram alignment, domain textbooks for extraction accuracy
  • The gym identifies "what would help" and incorporates it as training material
  • Draft gyms (style evaluation, redline, revision) are the next priority

Layer 3: Self-Bootstrapping Gyms

VISION
  • Spot a gymmable task → scaffold gym → discover resources → iterate → deploy
  • Resource identification, discovery, and fetching are themselves gymmable
  • The system accelerates: 1st workout takes hours, nth takes minutes — because the gym accumulates knowledge about what to look for
Draft (writing assistance) is the primary product focus. Gyms for style evaluation, editorial quality, and revision voice-preservation are next — each naturally brings in external resources like Elements of Style, making them the first resource-enhanced gyms.

Gym Inventory

GymStatusWhat It Improves
ASCII DiagramsEXISTSBox-drawing diagram quality
ExtractionEXISTSHTML extraction + LLM cleaning
Claim ExtractionEXISTSStructured claim completeness
Style EvaluationEMERGINGWriting quality scoring
RedlineEMERGINGEditorial suggestion quality
RevisionEMERGINGRevision voice-preservation
FetchabilityEXISTSProxy/fetch method selection
BadgeEXISTSSession topic summarization
RecallVISIONKnowledge retrieval quality
SidekickVISIONIntervention timing

The Flywheel

Corpus = accumulated taste. Not just test cases — the specific edge cases and quality standards your domain demands. Each correction makes the gym more discriminating. Each discriminating test makes the capability more reliable. The corpus is the moat.

Skills persist across sessions. A gym workout produces reusable knowledge, not just a score. The ASCII diagram skill applies to every future diagram. The extraction conventions improve every future parse. Knowledge compounds.

Cross-gym amplification. Better extraction feeds better claims. Better claims feed better knowledge. Better knowledge feeds better everything. When gym A improves, gym B's inputs get better for free.

The system accelerates. The first workout takes hours — build the corpus, define scoring, iterate. The tenth workout takes minutes — the gym knows what to look for, the scoring is calibrated, the baseline is established. Improvement gets cheaper over time.

The moat is not the model (everyone has access to the same models). The moat is not the prompt (prompts are easy to copy). The moat is the corpus — your accumulated taste, your edge cases, your quality standards — and the skills the gym extracts from it. That compounds.

Glossary

Corpus
A curated set of inputs with known expected behaviors — the ground truth a gym scores against. A good corpus covers the variation the capability encounters in practice, includes edge cases, and documents known gaps.
LLM (Large Language Model)
An AI model that generates text from prompts. Examples: Claude, GPT, Gemini. Gyms measure and improve the quality of LLM output on specific tasks.
Prompt
Instructions given to an LLM to produce output. A gym improves the prompt (and the pipeline around it) by measuring what works and what doesn't.
Rubric Judge
An LLM that scores another LLM's output against defined criteria — like having a senior reviewer grade papers against a rubric. Used when quality dimensions require human-like judgment (style, relevance, coherence).
Intrusion
Boilerplate text that should be stripped during content extraction — ads, subscribe buttons, navigation, social sharing widgets. A gym scores how many intrusions survive extraction.
Anchor
Key content that must survive extraction — specific phrases, data points, or structural elements. If the article mentions "Coinbase launched the first crypto wallet," that phrase should appear in the extracted output.
Deterministic Scoring
Evaluation via string matching or programmatic checks — fast, reproducible, zero cost. "Does this phrase appear in the output?" Preferred over LLM judges when criteria are definable.
Co-location
Placing the gym directory next to the code it tests. lib/parse/gym/ tests the parsing library. Keeps the quality contract visible to anyone working on the code.
Fan-out
Running the same task across multiple models in parallel — like having six candidates take the same exam. Reveals which models actually work for a specific task.