Turning LLM capabilities from "works" to "excellent" — systematically, with evidence, one workout at a time.
Your LLM system works. A prompt produces output, a pipeline processes data, an agent completes tasks. But is it half as good as it could be? Without measurement, you can't know. Without a method, you can't improve. Without a loop, improvements don't compound.
No measurement. "Looks fine" is not a score. A diagram that appears aligned may be off by one character in a monospace terminal. An extraction that seems clean may have silently dropped key content. Without numbers, you're guessing.
No method. Improvement is ad hoc. Someone notices a problem, tweaks a prompt, eyeballs the output. The fix stays in that session. The next session starts fresh. The insight — "use Unicode box-drawing, not ASCII +---+" — evaporates.
No compounding. Ten manual fixes produce ten slightly-better outputs — not one systematically-better system. Each fix is local. Nothing stacks.
A gym is a coach for LLM capabilities. It doesn't just measure quality — it systematically improves it. One cycle of the loop:
| Gym Term | What It Means |
|---|---|
| Gym | Training facility — a corpus of known inputs + scoring criteria + feedback loop |
| Workout | One run through generate → evaluate → learn. Each workout produces a score and insights. |
| Skill | What training produces — reusable conventions that persist across sessions and apply to all future work |
| Coach | The gym's judgment — a scoring rubric (deterministic) or an LLM judge (when human judgment is needed) |
| Training material | External resources brought in to raise the ceiling — Elements of Style, CSS specs, domain textbooks |
| Personal best | Baseline score to beat — established on the first workout, improved on each subsequent one |
The key difference from a test suite: a test tells you pass or fail. A gym tells you why you scored 0.42, what would improve it, and proves the improvement worked. Then it packages that knowledge so every future run starts from the new baseline.
An LLM asked to draw a diagram produces something that looks reasonable but has systematic problems — misaligned borders, inconsistent styles, missing structural elements. The ASCII diagram gym measured these failures, extracted patterns, codified them as conventions, and re-ran. Score doubled.
+----------+ +----------+ +--------+ | produce | --> | score | --> | reduce | +----------+ +----------+ +--------+
╔═══════════════════════════════════════╗ ║ Pipeline ║ ╠═══════════════════════════════════════╣ ║ ║ ║ ╭──────────╮ ╭──────────╮ ║ ║ │ produce │━━▶│ score │ ║ ║ ╰──────────╯ ╰─────┬────╯ ║ ║ │ ║ ║ ▼ ║ ║ ╭──────────╮ ║ ║ │ reduce │ ║ ║ ╰──────────╯ ║ ╚═══════════════════════════════════════╝
| Failure Pattern | Found In | Convention Created |
|---|---|---|
| Ragged right borders | 8/10 challenges | Every line must be same display width |
| Single border style for all levels | 9/10 | 4-level hierarchy: container=double, component=single, leaf=rounded, emphasis=heavy |
ASCII arrows (-->) | 7/10 | Unicode arrows table (━━▶, ───▶, │, ▼) |
| Missing structural elements | 5/10 | Inside-out drawing process — identify hierarchy first |
| No self-verification | 10/10 | Post-drawing checklist + Python validation code |
Does the extraction pipeline preserve real content while stripping boilerplate? This gym scores each site on three axes — no LLM judge needed, just deterministic string matching.
| Site | Category | Overall | Intrusion | Anchor | Must-not |
|---|---|---|---|---|---|
| substack_mauboussin | newsletter | 0.35 | 0.50 | 0.00 | 1.00 |
| substack_waxman | newsletter | 1.00 | 1.00 | 1.00 | 1.00 |
| stratechery_sample | newsletter | 0.50 | 1.00 | 0.00 | 1.00 |
| sequoia_600b | blog | 0.85 | 0.50 | 1.00 | 1.00 |
| paulgraham | blog | 1.00 | 1.00 | 1.00 | 1.00 |
| a16z_ai_apps | blog | 0.70 | 0.50 | 0.67 | 1.00 |
| reuters_article | news | 1.00 | 1.00 | 1.00 | 1.00 |
| medium_article | blog | 1.00 | 1.00 | 1.00 | 1.00 |
| python_docs | docs | 1.00 | 1.00 | 1.00 | 1.00 |
| arxiv_attention | academic | 0.85 | 0.50 | 1.00 | 1.00 |
| hn_show | forum | 1.00 | 1.00 | 1.00 | 1.00 |
| hacker_news_comment | forum | 0.35 | 0.00 | 1.00 | 1.00 |
# Writes and Write-Nots | | | | | | --- | --- | --- | --- | | | | | | | --- | | Writes and Write-Nots October 2024 I'm usually reluctant to make predictions about technology, but I feel fairly confident about this one...
# Writes and Write-Nots October 2024 I'm usually reluctant to make predictions about technology, but I feel fairly confident about this one: in a couple decades there won't be many people who can write. One of the strangest things you learn if you're a writer is how many people have trouble writing...
Extract structured claims from investment documents — entities, raw facts, interpretations, and influence relationships. An Opus judge scores each candidate against a human-curated reference. 120 candidates total.
| Model | Avg Score |
|---|---|
| opus | 88.5 |
| gpt | 81.7 |
| sonnet | 81.2 |
| haiku | 77.3 |
| grok-fast | 56.5 |
| gemini-pro | 0.0 |
| Criterion | Weight | What it measures |
|---|---|---|
| Entity coverage | 15% | Did the model find all significant entities? |
| Atom completeness | 20% | Were key facts and data points captured? |
| Assertion quality | 20% | Are interpretations well-formed and distinct from raw facts? |
| Relationship accuracy | 20% | Are influence links correctly identified with appropriate weights? |
| Precision | 15% | No hallucinated claims, spurious entities, or invented relationships |
| Source grounding | 10% | Every claim backed by an accurate quote from the document |
| Criterion | Score |
|---|---|
| Entity coverage | 100 |
| Atom completeness | 95 |
| Assertion quality | 93 |
| Relationship accuracy | 90 |
| Precision | 93 |
| Source grounding | 95 |
{
"entities": [
{"id": "e1", "name": "Heliad Equity Partners", "type": "company"},
{"id": "e2", "name": "Andreas Lange", "type": "person"}
],
"claims": [
{
"id": "c1", "claim_type": "atom",
"text": "Heliad is a publicly traded investment firm with 37M Euro market cap",
"confidence": 0.95,
"source_quote": "Heliad Equity Partners is a publicly traded investment firm in Germany with a market cap of 37M Euro."
},
{
"id": "c3", "claim_type": "assertion",
"text": "Andreas Lange is considered a rising star in German private equity",
"confidence": 0.75, "direction": "bullish",
"source_quote": "Andreas Lange, a mid 30s executive who is considered to be a rising star..."
}
]
}
Every gym is defined by a single YAML file. Here's an annotated example from the extraction gym:
name: parse-extraction
description: HTML extraction + LLM cleaning quality
measures:
capability: lib/parse # what code is being tested
question: "Does extraction preserve content # answerable with data
while removing boilerplate?"
baseline: "Raw HTML passed through — no cleaning" # what we compare against
steps:
- op: source # load test data from corpus
params:
items:
- id: substack_waxman
url: "https://waxmand.substack.com/p/..."
intrusions: ["Subscribe for free"] # should be stripped
anchors: ["Coinbase launched the first"] # must survive
- op: task # run the extraction pipeline
params:
fn: lib.parse.gym.gym_ops.fetch_and_extract
- op: evaluate # score each result
params:
fn: lib.parse.gym.gym_ops.score
per_item: true
limits:
usd: 2.0 # cost guardrail per run
The measures block forces you to articulate what "good" means before writing evaluation code. The steps define a pipeline: source items → run the task → score results. The corpus (items with expected behaviors) is the ground truth.
gym list # discover all gyms in the project gym run extraction # execute the extraction gym pipeline gym report extraction # render an HTML report from results
Gyms co-locate with the code they test. lib/parse/gym/ tests the parsing library. draft/style/gym/ tests writing style analysis. Co-location keeps the quality contract visible to anyone working on the code.
Several excellent tools exist for evaluating LLM output. Gyms share their measurement philosophy but add the learn step — closing the loop from score to improvement.
| Tool | Measures | Diagnoses | Improves | Persists |
|---|---|---|---|---|
| Promptfoo | ✓ | — | — | — |
| Braintrust | ✓ | partial | — | — |
| LangSmith | ✓ | ✓ | — | — |
| pytest | ✓ | — | — | — |
| Gyms | ✓ | ✓ | ✓ | ✓ |
Rivus has two quality measurement systems — benchmarks and gyms. They serve fundamentally different purposes but form a complete quality loop together.
Measure potential
How good is this model at reasoning, math, writing? Standardized tasks, fixed data, someone else's definition of quality.
Measure actuality
How good is your system at your task? Your corpus, your pipeline, your data, your definition of quality.
The gym framework has three layers of maturity. Today we have the first. The others are emerging or envisioned — marked clearly.
| Gym | Status | What It Improves |
|---|---|---|
| ASCII Diagrams | EXISTS | Box-drawing diagram quality |
| Extraction | EXISTS | HTML extraction + LLM cleaning |
| Claim Extraction | EXISTS | Structured claim completeness |
| Style Evaluation | EMERGING | Writing quality scoring |
| Redline | EMERGING | Editorial suggestion quality |
| Revision | EMERGING | Revision voice-preservation |
| Fetchability | EXISTS | Proxy/fetch method selection |
| Badge | EXISTS | Session topic summarization |
| Recall | VISION | Knowledge retrieval quality |
| Sidekick | VISION | Intervention timing |
Corpus = accumulated taste. Not just test cases — the specific edge cases and quality standards your domain demands. Each correction makes the gym more discriminating. Each discriminating test makes the capability more reliable. The corpus is the moat.
Skills persist across sessions. A gym workout produces reusable knowledge, not just a score. The ASCII diagram skill applies to every future diagram. The extraction conventions improve every future parse. Knowledge compounds.
Cross-gym amplification. Better extraction feeds better claims. Better claims feed better knowledge. Better knowledge feeds better everything. When gym A improves, gym B's inputs get better for free.
The system accelerates. The first workout takes hours — build the corpus, define scoring, iterate. The tenth workout takes minutes — the gym knows what to look for, the scoring is calibrated, the baseline is established. Improvement gets cheaper over time.
lib/parse/gym/ tests the parsing library. Keeps the quality contract visible to anyone working on the code.