Extraction Gym

The Gen → Eval → Learn Loop in Action — how we went from naive HTML extraction to measured, reliable content quality

0.81
Baseline Score
readability only
0.89
With LLM Clean
+0.08 avg improvement
0.50→1.00
Best Single-Site Lift
substack newsletter: perfect after clean
$0.002
Cost per Run
6 sites × gemini-lite

The Problem: HTML is 90% Noise

A typical web page is overwhelmingly boilerplate — navigation, CTAs, social widgets, cookie banners, subscribe prompts. The actual article content is a small fraction of the raw HTML.

178 KB
Raw HTML
11 KB
After Readability
10.6 KB
After LLM Clean
586 KB
Reuters (raw)
2.2 KB
Reuters (clean)
The Compression Story
Reuters: 586 KB → 2.2 KB (99.6% removed). Substack: 178 KB → 10.6 KB (94% removed). Paul Graham's minimal site: 9 KB → 3 KB (67% — already clean HTML). The noisier the page, the more value extraction delivers.

The Pipeline

Fetch
HTTP + cache
stealth headers
Extract
readability algorithm
+ markdownify
LLM Clean
gemini-lite, temp=0
remove UI chrome
Score
deterministic
substring matching

Three scoring dimensions, weighted by importance:

DimensionWeightWhat It MeasuresMethod
Anchor Retention 0.50 Key phrases that must survive extraction substring match
Intrusion Removal 0.30 UI chrome that should be stripped substring match
Must-Not-Contain 0.20 Exact patterns that must be absent substring match
Design Choice: Deterministic Scoring
Scoring uses simple substring matching — no LLM judge. This makes scores reproducible, fast, and free. The same extraction always gets the same score. Regressions are instantly detectable.

The Gen → Eval → Learn Loop

Generate fetch + extract + clean
Evaluate score against corpus
Learn grow corpus, tune pipeline

Iteration History

March 7, 2026
Run #1 — Pipeline Wiring
First end-to-end test. 1/6 sites fetched — rest hit cached error responses. Validated pipeline wiring: gym.yaml → vario.run() → tasks.py → JSONL. Found cache staleness bug.
March 11, 2026 — AM
Run #2 — Baseline Established
All 6 sites, no LLM cleaning. Average score: 0.81. Discovered newsletters are the hard category — Substack CTAs survive readability extraction.
March 11, 2026 — PM
Run #3 — LLM Clean Added
Same 6 sites with gemini-lite cleaning pass. Average: 0.89 (+0.08). Waxman newsletter: 0.50 → 1.00. Paywall site still fails (expected).
March 11, 2026
HTML Report Generator
Built report.py — standalone HTML reports from JSONL with score matrices, per-site cards, category breakdowns. Light theme (later migrated to dark).
Next
Corpus Expansion
Target: 20+ sites across 6 categories. Priority: more newsletters (hardest category), academic papers, forums. Production flagging (Gradio UI) feeds new entries.

Before & After: The Scores

Each site scored with and without LLM cleaning. Bars show overall score (0–1).

substack_waxman newsletter
baseline
0.50
+ clean
1.00
substack_mauboussin newsletter
baseline
0.35
+ clean
0.35
reuters_article news
baseline
1.00
+ clean
1.00
paulgraham blog
baseline
1.00
+ clean
1.00
medium_article blog
baseline
1.00
+ clean
1.00
python_docs docs
baseline
1.00
+ clean
1.00
The Paywall Problem
substack_mauboussin scores 0.35 in both modes — the content is behind a paywall. Only 76 chars extracted from 89KB of HTML. LLM cleaning can't fix what readability never received. This is an expected failure that validates our scoring: the gym correctly identifies extraction problems even when the root cause is access, not algorithm.

The Money Shot: Before & After Text

substack_waxman — Newsletter CTA Removal

This is the site where LLM cleaning makes the biggest difference. The article content is preserved perfectly, but the Substack boilerplate at the end is stripped:

Baseline (readability only) — Score: 0.50
...12. KITE AI: $18M Series A, $33M cumulative
(Sep 2025) — PayPal Ventures, General Catalyst;
CoinDesk; Fortune

Thanks for reading! Subscribe for free to
receive new posts and support my work.

Subscribe

6

3

1

Share
With LLM Clean — Score: 1.00
...12. KITE AI: $18M Series A, $33M cumulative
(Sep 2025) — PayPal Ventures, General Catalyst;
CoinDesk; Fortune

✓ Clean ending — no CTA, no share counts,
  no subscribe prompts. Article content intact.

python_docs — Permalink Artifact Removal

Even well-structured docs have extraction artifacts. The LLM removes permalink anchors and navigation chrome:

Baseline — 3,746 chars
# `asyncio` — Asynchronous I/O[¶](#module-asyncio
"Link to this heading")

---

asyncio is a library to write **concurrent** code
using the **async/await** syntax.

asyncio is used as a foundation for multiple Python
asynchronous frameworks...
With LLM Clean — 2,381 chars (−36%)
# `asyncio` — Asynchronous I/O

asyncio is a library to write **concurrent** code
using the **async/await** syntax.

asyncio is used as a foundation for multiple Python
asynchronous frameworks...

✓ Permalink anchors removed, horizontal rules
  cleaned, navigation stripped. Content preserved.

Full Score Matrix

Site Category Mode Overall Intrusion Anchor Must-Not Raw → Clean Timing
substack_waxman newsletter baseline 0.50 0.00 1.00 0.00 178K → 11K 69ms
+ clean 1.00 1.00 1.00 1.00 178K → 10.6K 10.7s
substack_mauboussin newsletter baseline 0.35 0.50 0.00 1.00 89K → 76 417ms
+ clean 0.35 0.50 0.00 1.00 89K → 76 692ms
reuters_article news baseline 1.00 1.00 1.00 1.00 586K → 2.2K 60ms
+ clean 1.00 1.00 1.00 1.00 586K → 2.2K 2.5s
paulgraham blog baseline 1.00 1.00 1.00 1.00 9K → 3.2K 11ms
+ clean 1.00 1.00 1.00 1.00 9K → 3.1K 3.7s
medium_article blog baseline 1.00 1.00 1.00 1.00 150K → 14.8K 683ms
+ clean 1.00 1.00 1.00 1.00 150K → 14.3K 11.2s
python_docs docs baseline 1.00 1.00 1.00 1.00 25K → 3.7K 67ms
+ clean 1.00 1.00 1.00 1.00 25K → 2.4K 3.4s

Category Patterns

CategorySitesBaseline AvgClean AvgDeltaInsight
newsletter 2 0.43 0.68 +0.25 Hardest category. CTAs and social widgets survive readability. LLM clean is essential here. Paywalls remain unsolved.
news 1 1.00 1.00 0.00 Clean extraction from well-structured HTML. LLM clean adds no value but doesn't hurt.
blog 2 1.00 1.00 0.00 Blog platforms (Medium, personal sites) extract cleanly. Readability handles them well.
docs 1 1.00 1.00 0.00 Documentation sites have clean HTML structure. LLM clean removes permalink artifacts (cosmetic improvement, not scored).
Key Insight: LLM Clean is Targeted, Not Universal
LLM cleaning delivers massive value on newsletter content (the hardest category) while having zero negative impact on already-clean content. It's not a blunt instrument — it's a precision tool for the specific failure mode that readability can't handle: distinguishing article text from platform UI text that readability considers "content."

Cost-Benefit Analysis

MetricBaselineWith LLM CleanDelta
Average Score 0.81 0.89 +0.08
Perfect Scores (1.00) 4 / 6 5 / 6 +1
Failing Sites (<0.5) 1 / 6 1 / 6 0 (paywall)
Cost per Run $0.000 $0.002 +$0.002
Avg Latency 218ms 5.4s +5.2s
The Tradeoff
LLM cleaning costs $0.0003/page and adds ~5 seconds of latency. For batch processing (the primary use case), this is negligible. For interactive use, consider cleaning only newsletter/social content where it matters most.

What We Learned

What Worked
What Needs Work

The System Architecture

Gym = Steering Module on a Job
The extraction gym doesn't run its own pipeline — it scores the output of the extraction pipeline (lib/fetch/ + lib/parse/). The gym is a quality feedback loop:
  1. Pipeline extracts content from URLs (production or corpus)
  2. Gym scores extraction quality against known expectations
  3. Low scores surface as review items
  4. Human corrections grow the corpus
  5. Gym re-evaluates — the loop closes

The Gradio UI (app.py) enables live testing, result browsing, and flagging bad extractions — which feeds directly back into corpus/flagged.yaml for future gym runs.