The Gym System

"We are what we repeatedly do. Excellence, then, is not an act, but a habit." — Will Durant, paraphrasing Aristotle

Turning LLM capabilities from "works" to "excellent" — systematically, with evidence, one workout at a time.

Your System Works. How Good Could It Be?

Your LLM system works. A prompt produces output, a pipeline processes data, an agent completes tasks. But is it half as good as it could be? Without measurement, you can't know. Without a method, you can't improve. Without a loop, improvements don't compound.

No measurement. "Looks fine" is not a score. A diagram that appears aligned may be off by one character in a monospace terminal. An extraction that seems clean may have silently dropped key content. Without numbers, you're guessing.

No method. Improvement is ad hoc. Someone notices a problem, tweaks a prompt, eyeballs the output. The fix stays in that session. The next session starts fresh. The insight — "use Unicode box-drawing, not ASCII +---+" — evaporates.

No compounding. Ten manual fixes produce ten slightly-better outputs — not one systematically-better system. Each fix is local. Nothing stacks.

A gym closes all three gaps. It scores output programmatically, diagnoses weaknesses, brings in training material, drills improvement, and the result persists as a reusable skill. The first workout measures. The second improves. By the third, the system is better — permanently.

A Coach, Not a Test Suite

A gym is a coach for LLM capabilities. It doesn't just measure quality — it systematically improves it. One cycle of the loop:

GEN

Generate

LLM produces output
from a task prompt

→

EVAL

Evaluate

Score against corpus
with defined criteria

→

LEARN

Learn

Extract failure patterns
→ codify as conventions

→

GEN

Re-generate

Apply learned conventions
→ verify improvement

Gym Term	What It Means
Gym	Training facility — a corpus of known inputs + scoring criteria + feedback loop
Workout	One run through generate → evaluate → learn. Each workout produces a score and insights.
Skill	What training produces — reusable conventions that persist across sessions and apply to all future work
Coach	The gym's judgment — a scoring rubric (deterministic) or an LLM judge (when human judgment is needed)
Training material	External resources brought in to raise the ceiling — Elements of Style, CSS specs, domain textbooks
Personal best	Baseline score to beat — established on the first workout, improved on each subsequent one

The key difference from a test suite: a test tells you pass or fail. A gym tells you why you scored 0.42, what would improve it, and proves the improvement worked. Then it packages that knowledge so every future run starts from the new baseline.

Workout A: ASCII Diagrams — 0.42 → 0.84

We lead with ASCII diagrams because the before/after is viscerally obvious — you can see the improvement. But this is a low-stakes example. You can sidestep ASCII diagrams entirely. The higher-value gyms — extraction quality, claim accuracy, writing style — tackle capabilities you can't sidestep. The loop is the same; the stakes are different.

An LLM asked to draw a diagram produces something that looks reasonable but has systematic problems — misaligned borders, inconsistent styles, missing structural elements. The ASCII diagram gym measured these failures, extracted patterns, codified them as conventions, and re-ran. Score doubled.

Round 1 (Naive)

0.42

Round 2 (Skilled)

0.84

Improvement

+100%

Challenges

Round 1: Naive (pipeline challenge — 0.52)

+----------+     +----------+     +--------+
| produce  | --> |  score   | --> | reduce |
+----------+     +----------+     +--------+

Round 2: Skilled (pipeline challenge — 0.90)

╔═══════════════════════════════════════╗
║            Pipeline                   ║
╠═══════════════════════════════════════╣
║                                       ║
║  ╭──────────╮   ╭──────────╮          ║
║  │ produce  │━━▶│  score   │          ║
║  ╰──────────╯   ╰─────┬────╯          ║
║                        │              ║
║                        ▼              ║
║                 ╭──────────╮          ║
║                 │  reduce  │          ║
║                 ╰──────────╯          ║
╚═══════════════════════════════════════╝

What the Gym Found

Failure Pattern	Found In	Convention Created
Ragged right borders	8/10 challenges	Every line must be same display width
Single border style for all levels	9/10	4-level hierarchy: container=double, component=single, leaf=rounded, emphasis=heavy
ASCII arrows (`-->`)	7/10	Unicode arrows table (━━▶, ───▶, │, ▼)
Missing structural elements	5/10	Inside-out drawing process — identify hierarchy first
No self-verification	10/10	Post-drawing checklist + Python validation code

These conventions became a reusable skill — injected into the system prompt for every future diagram request. The gym ran once. The improvement applies forever. That's the difference between a test and a coach.

Workout B: Content Extraction — 12 Sites Scored

Does the extraction pipeline preserve real content while stripping boilerplate? This gym scores each site on three axes — no LLM judge needed, just deterministic string matching.

Total Sites

OK (≥0.8)

WARN (≥0.5)

Avg Overall

0.85

Three Scoring Axes

Anchor preservation (50% weight): Did key content phrases survive extraction? If the article mentions "Coinbase launched the first crypto wallet," that phrase must appear in the output.
Intrusion removal (30%): Were boilerplate elements stripped? "Subscribe for free" should not appear in clean output.
Must-not patterns (20%): Are known junk patterns (navigation, footer, social buttons) absent?

Full score matrix (12 sites)

Site	Category	Overall	Intrusion	Anchor	Must-not
substack_mauboussin	newsletter	0.35	0.50	0.00	1.00
substack_waxman	newsletter	1.00	1.00	1.00	1.00
stratechery_sample	newsletter	0.50	1.00	0.00	1.00
sequoia_600b	blog	0.85	0.50	1.00	1.00
paulgraham	blog	1.00	1.00	1.00	1.00
a16z_ai_apps	blog	0.70	0.50	0.67	1.00
reuters_article	news	1.00	1.00	1.00	1.00
medium_article	blog	1.00	1.00	1.00	1.00
python_docs	docs	1.00	1.00	1.00	1.00
arxiv_attention	academic	0.85	0.50	1.00	1.00
hn_show	forum	1.00	1.00	1.00	1.00
hacker_news_comment	forum	0.35	0.00	1.00	1.00

Before and After: Paul Graham Essay

Raw extraction (9,280 chars)

# Writes and Write-Nots

|  |  |  |  |
| --- | --- | --- | --- |
|  |  | |  | | --- | | Writes and
Write-Nots  October 2024  I'm
usually reluctant to make predictions
about technology, but I feel fairly
confident about this one...

After cleaning (3,058 chars — 33% of raw)

# Writes and Write-Nots

October 2024

I'm usually reluctant to make predictions
about technology, but I feel fairly
confident about this one: in a couple
decades there won't be many people who
can write.

One of the strangest things you learn if
you're a writer is how many people have
trouble writing...

The gym also caught something eyeballing wouldn't — 2 sites (Mauboussin, Stratechery) returned 404 or paywall content, scoring WARN/FAIL. The corpus itself had rotted. A gym doesn't just measure the pipeline — it measures the measurement.

Workout C: Claim Extraction — 6 Models × 20 Documents

Extract structured claims from investment documents — entities, raw facts, interpretations, and influence relationships. An Opus judge scores each candidate against a human-curated reference. 120 candidates total.

Models

Documents

Candidates

120

Best Score

88.5

opus

Model	Avg Score
opus	88.5
gpt	81.7
sonnet	81.2
haiku	77.3
grok-fast	56.5
gemini-pro	0.0

Six Scoring Dimensions

Criterion	Weight	What it measures
Entity coverage	15%	Did the model find all significant entities?
Atom completeness	20%	Were key facts and data points captured?
Assertion quality	20%	Are interpretations well-formed and distinct from raw facts?
Relationship accuracy	20%	Are influence links correctly identified with appropriate weights?
Precision	15%	No hallucinated claims, spurious entities, or invented relationships
Source grounding	10%	Every claim backed by an accurate quote from the document

Example: Opus scores 93 on Heliad Equity Partners document

Criterion	Score
Entity coverage	100
Atom completeness	95
Assertion quality	93
Relationship accuracy	90
Precision	93
Source grounding	95

{
  "entities": [
    {"id": "e1", "name": "Heliad Equity Partners", "type": "company"},
    {"id": "e2", "name": "Andreas Lange", "type": "person"}
  ],
  "claims": [
    {
      "id": "c1", "claim_type": "atom",
      "text": "Heliad is a publicly traded investment firm with 37M Euro market cap",
      "confidence": 0.95,
      "source_quote": "Heliad Equity Partners is a publicly traded investment firm in Germany with a market cap of 37M Euro."
    },
    {
      "id": "c3", "claim_type": "assertion",
      "text": "Andreas Lange is considered a rising star in German private equity",
      "confidence": 0.75, "direction": "bullish",
      "source_quote": "Andreas Lange, a mid 30s executive who is considered to be a rising star..."
    }
  ]
}

Gemini Pro scored 0 — not because it can't extract claims, but because it returned the wrong output format. The gym caught a task-failure mode that spot-checking would miss.

The gym makes model selection evidence-based, not vibes-based. Before this workout, "which model should we use?" was a guess. After, it's a table.

Under the Hood

Technical details: gym.yaml anatomy, corpus conventions, scoring, CLI (click to expand)

The gym.yaml File

Every gym is defined by a single YAML file. Here's an annotated example from the extraction gym:

name: parse-extraction
description: HTML extraction + LLM cleaning quality

measures:
  capability: lib/parse                              # what code is being tested
  question: "Does extraction preserve content        # answerable with data
             while removing boilerplate?"
  baseline: "Raw HTML passed through — no cleaning"  # what we compare against

steps:
  - op: source          # load test data from corpus
    params:
      items:
        - id: substack_waxman
          url: "https://waxmand.substack.com/p/..."
          intrusions: ["Subscribe for free"]         # should be stripped
          anchors: ["Coinbase launched the first"]   # must survive

  - op: task            # run the extraction pipeline
    params:
      fn: lib.parse.gym.gym_ops.fetch_and_extract

  - op: evaluate        # score each result
    params:
      fn: lib.parse.gym.gym_ops.score
      per_item: true

limits:
  usd: 2.0              # cost guardrail per run

The measures block forces you to articulate what "good" means before writing evaluation code. The steps define a pipeline: source items → run the task → score results. The corpus (items with expected behaviors) is the ground truth.

Corpus Conventions

Real-world inputs, not synthetic (unless the capability is inherently synthetic)
Edge cases that have caused failures in the past
At least one adversarial example — something that looks valid but should be rejected
Enough variety to cover the categories the capability serves (6 items from one source is not a corpus)
Known gaps documented explicitly rather than ignored

Two Scoring Approaches

Deterministic (string matching): Fast, reproducible, no cost. Used when criteria are definable — "does this phrase appear?" Use this when you can.
LLM Judge (rubric-based): For quality dimensions that require human-like judgment — style, relevance, coherence. More expensive, less reproducible. Use only when deterministic scoring can't capture what matters.

CLI

gym list                    # discover all gyms in the project
gym run extraction          # execute the extraction gym pipeline
gym report extraction       # render an HTML report from results

Gyms co-locate with the code they test. lib/parse/gym/ tests the parsing library. draft/style/gym/ tests writing style analysis. Co-location keeps the quality contract visible to anyone working on the code.

How Gyms Compare to Eval Frameworks

Several excellent tools exist for evaluating LLM output. Gyms share their measurement philosophy but add the learn step — closing the loop from score to improvement.

Tool	Measures	Diagnoses	Improves	Persists
Promptfoo	✓	—	—	—
Braintrust	✓	partial	—	—
LangSmith	✓	✓	—	—
pytest	✓	—	—	—
Gyms	✓	✓	✓	✓

All of the above are measurement tools — they tell you how good your output is. Gyms include measurement but close the loop: the learn step produces skills, conventions, and validation code that compound across the entire system. Measurement without improvement is just observation.

The Two Compasses

Rivus has two quality measurement systems — benchmarks and gyms. They serve fundamentally different purposes but form a complete quality loop together.

Benchmarks

Measure potential

How good is this model at reasoning, math, writing? Standardized tasks, fixed data, someone else's definition of quality.

Gyms

Measure actuality

How good is your system at your task? Your corpus, your pipeline, your data, your definition of quality.

The Flywheel

New model drops — benchmark tells you it's worth trying
Gym re-run — confirms the improvement transfers to your specific task
Gym stagnates despite better models — your prompt/pipeline is the bottleneck, not the model
Gym exceeds what benchmarks predict — your corpus + engineering is adding real value

A system that only benchmarks is chasing leaderboards. A system that only gyms is navel-gazing. You need both compasses.

Where This Goes

The gym framework has three layers of maturity. Today we have the first. The others are emerging or envisioned — marked clearly.

Layer 1: Capability Gyms

EXISTS

Measure + improve a specific LLM task via gen → eval → learn
Three active today: ASCII diagrams (0.42 → 0.84), extraction (12 sites scored), claim extraction (6 models compared)
Each workout produces a scored result and a reusable skill

Layer 2: Resource-Enhanced Gyms

EMERGING

Bring in external knowledge to raise the ceiling
Example: Elements of Style for writing quality, CSS specifications for diagram alignment, domain textbooks for extraction accuracy
The gym identifies "what would help" and incorporates it as training material
Draft gyms (style evaluation, redline, revision) are the next priority

Layer 3: Self-Bootstrapping Gyms

VISION

Spot a gymmable task → scaffold gym → discover resources → iterate → deploy
Resource identification, discovery, and fetching are themselves gymmable
The system accelerates: 1st workout takes hours, nth takes minutes — because the gym accumulates knowledge about what to look for

Draft (writing assistance) is the primary product focus. Gyms for style evaluation, editorial quality, and revision voice-preservation are next — each naturally brings in external resources like Elements of Style, making them the first resource-enhanced gyms.

Gym Inventory

Gym	Status	What It Improves
ASCII Diagrams	EXISTS	Box-drawing diagram quality
Extraction	EXISTS	HTML extraction + LLM cleaning
Claim Extraction	EXISTS	Structured claim completeness
Style Evaluation	EMERGING	Writing quality scoring
Redline	EMERGING	Editorial suggestion quality
Revision	EMERGING	Revision voice-preservation
Fetchability	EXISTS	Proxy/fetch method selection
Badge	EXISTS	Session topic summarization
Recall	VISION	Knowledge retrieval quality
Sidekick	VISION	Intervention timing

The Flywheel

Corpus = accumulated taste. Not just test cases — the specific edge cases and quality standards your domain demands. Each correction makes the gym more discriminating. Each discriminating test makes the capability more reliable. The corpus is the moat.

Skills persist across sessions. A gym workout produces reusable knowledge, not just a score. The ASCII diagram skill applies to every future diagram. The extraction conventions improve every future parse. Knowledge compounds.

Cross-gym amplification. Better extraction feeds better claims. Better claims feed better knowledge. Better knowledge feeds better everything. When gym A improves, gym B's inputs get better for free.

The system accelerates. The first workout takes hours — build the corpus, define scoring, iterate. The tenth workout takes minutes — the gym knows what to look for, the scoring is calibrated, the baseline is established. Improvement gets cheaper over time.

The moat is not the model (everyone has access to the same models). The moat is not the prompt (prompts are easy to copy). The moat is the corpus — your accumulated taste, your edge cases, your quality standards — and the skills the gym extracts from it. That compounds.

Glossary

Corpus: A curated set of inputs with known expected behaviors — the ground truth a gym scores against. A good corpus covers the variation the capability encounters in practice, includes edge cases, and documents known gaps.
LLM (Large Language Model): An AI model that generates text from prompts. Examples: Claude, GPT, Gemini. Gyms measure and improve the quality of LLM output on specific tasks.
Prompt: Instructions given to an LLM to produce output. A gym improves the prompt (and the pipeline around it) by measuring what works and what doesn't.
Rubric Judge: An LLM that scores another LLM's output against defined criteria — like having a senior reviewer grade papers against a rubric. Used when quality dimensions require human-like judgment (style, relevance, coherence).
Intrusion: Boilerplate text that should be stripped during content extraction — ads, subscribe buttons, navigation, social sharing widgets. A gym scores how many intrusions survive extraction.
Anchor: Key content that must survive extraction — specific phrases, data points, or structural elements. If the article mentions "Coinbase launched the first crypto wallet," that phrase should appear in the extracted output.
Deterministic Scoring: Evaluation via string matching or programmatic checks — fast, reproducible, zero cost. "Does this phrase appear in the output?" Preferred over LLM judges when criteria are definable.
Co-location: Placing the gym directory next to the code it tests. lib/parse/gym/ tests the parsing library. Keeps the quality contract visible to anyone working on the code.
Fan-out: Running the same task across multiple models in parallel — like having six candidates take the same exam. Reveals which models actually work for a specific task.