The Gen → Eval → Learn Loop in Action — how we went from naive HTML extraction to measured, reliable content quality
A typical web page is overwhelmingly boilerplate — navigation, CTAs, social widgets, cookie banners, subscribe prompts. The actual article content is a small fraction of the raw HTML.
Three scoring dimensions, weighted by importance:
| Dimension | Weight | What It Measures | Method |
|---|---|---|---|
| Anchor Retention | 0.50 | Key phrases that must survive extraction | substring match |
| Intrusion Removal | 0.30 | UI chrome that should be stripped | substring match |
| Must-Not-Contain | 0.20 | Exact patterns that must be absent | substring match |
gym.yaml → vario.run() → tasks.py → JSONL. Found cache staleness bug.Each site scored with and without LLM cleaning. Bars show overall score (0–1).
substack_mauboussin scores 0.35 in both modes — the content is behind a paywall. Only 76 chars extracted from 89KB of HTML. LLM cleaning can't fix what readability never received. This is an expected failure that validates our scoring: the gym correctly identifies extraction problems even when the root cause is access, not algorithm.
This is the site where LLM cleaning makes the biggest difference. The article content is preserved perfectly, but the Substack boilerplate at the end is stripped:
...12. KITE AI: $18M Series A, $33M cumulative (Sep 2025) — PayPal Ventures, General Catalyst; CoinDesk; Fortune Thanks for reading! Subscribe for free to receive new posts and support my work. Subscribe 6 3 1 Share
...12. KITE AI: $18M Series A, $33M cumulative
(Sep 2025) — PayPal Ventures, General Catalyst;
CoinDesk; Fortune
✓ Clean ending — no CTA, no share counts,
no subscribe prompts. Article content intact.
Even well-structured docs have extraction artifacts. The LLM removes permalink anchors and navigation chrome:
# `asyncio` — Asynchronous I/O[¶](#module-asyncio
"Link to this heading")
---
asyncio is a library to write **concurrent** code
using the **async/await** syntax.
asyncio is used as a foundation for multiple Python
asynchronous frameworks...
# `asyncio` — Asynchronous I/O
asyncio is a library to write **concurrent** code
using the **async/await** syntax.
asyncio is used as a foundation for multiple Python
asynchronous frameworks...
✓ Permalink anchors removed, horizontal rules
cleaned, navigation stripped. Content preserved.
| Site | Category | Mode | Overall | Intrusion | Anchor | Must-Not | Raw → Clean | Timing |
|---|---|---|---|---|---|---|---|---|
| substack_waxman | baseline | 0.50 | 0.00 | 1.00 | 0.00 | 178K → 11K | 69ms | |
| + clean | 1.00 | 1.00 | 1.00 | 1.00 | 178K → 10.6K | 10.7s | ||
| substack_mauboussin | baseline | 0.35 | 0.50 | 0.00 | 1.00 | 89K → 76 | 417ms | |
| + clean | 0.35 | 0.50 | 0.00 | 1.00 | 89K → 76 | 692ms | ||
| reuters_article | news | baseline | 1.00 | 1.00 | 1.00 | 1.00 | 586K → 2.2K | 60ms |
| + clean | 1.00 | 1.00 | 1.00 | 1.00 | 586K → 2.2K | 2.5s | ||
| paulgraham | blog | baseline | 1.00 | 1.00 | 1.00 | 1.00 | 9K → 3.2K | 11ms |
| + clean | 1.00 | 1.00 | 1.00 | 1.00 | 9K → 3.1K | 3.7s | ||
| medium_article | blog | baseline | 1.00 | 1.00 | 1.00 | 1.00 | 150K → 14.8K | 683ms |
| + clean | 1.00 | 1.00 | 1.00 | 1.00 | 150K → 14.3K | 11.2s | ||
| python_docs | docs | baseline | 1.00 | 1.00 | 1.00 | 1.00 | 25K → 3.7K | 67ms |
| + clean | 1.00 | 1.00 | 1.00 | 1.00 | 25K → 2.4K | 3.4s |
| Category | Sites | Baseline Avg | Clean Avg | Delta | Insight |
|---|---|---|---|---|---|
| 2 | 0.43 | 0.68 | +0.25 | Hardest category. CTAs and social widgets survive readability. LLM clean is essential here. Paywalls remain unsolved. | |
| news | 1 | 1.00 | 1.00 | 0.00 | Clean extraction from well-structured HTML. LLM clean adds no value but doesn't hurt. |
| blog | 2 | 1.00 | 1.00 | 0.00 | Blog platforms (Medium, personal sites) extract cleanly. Readability handles them well. |
| docs | 1 | 1.00 | 1.00 | 0.00 | Documentation sites have clean HTML structure. LLM clean removes permalink artifacts (cosmetic improvement, not scored). |
| Metric | Baseline | With LLM Clean | Delta |
|---|---|---|---|
| Average Score | 0.81 | 0.89 | +0.08 |
| Perfect Scores (1.00) | 4 / 6 | 5 / 6 | +1 |
| Failing Sites (<0.5) | 1 / 6 | 1 / 6 | 0 (paywall) |
| Cost per Run | $0.000 | $0.002 | +$0.002 |
| Avg Latency | 218ms | 5.4s | +5.2s |
lib/fetch/ + lib/parse/). The gym is a quality feedback loop:
The Gradio UI (app.py) enables live testing, result browsing, and flagging bad extractions — which feeds directly back into corpus/flagged.yaml for future gym runs.