Extraction Gym — The Gen→Eval→Learn Loop in Action

The Gen → Eval → Learn Loop in Action — how we went from naive HTML extraction to measured, reliable content quality

The Problem: HTML is 90% Noise

A typical web page is overwhelmingly boilerplate — navigation, CTAs, social widgets, cookie banners, subscribe prompts. The actual article content is a small fraction of the raw HTML.

The Compression Story

Reuters: 586 KB → 2.2 KB (99.6% removed). Substack: 178 KB → 10.6 KB (94% removed). Paul Graham's minimal site: 9 KB → 3 KB (67% — already clean HTML). The noisier the page, the more value extraction delivers.

The Pipeline

Dimension	Weight	What It Measures	Method
Anchor Retention	0.50	Key phrases that must survive extraction	substring match
Intrusion Removal	0.30	UI chrome that should be stripped	substring match
Must-Not-Contain	0.20	Exact patterns that must be absent	substring match

Design Choice: Deterministic Scoring

Scoring uses simple substring matching — no LLM judge. This makes scores reproducible, fast, and free. The same extraction always gets the same score. Regressions are instantly detectable.

The Gen → Eval → Learn Loop

Iteration History

March 7, 2026

Run #1 — Pipeline Wiring

First end-to-end test. 1/6 sites fetched — rest hit cached error responses. Validated pipeline wiring: gym.yaml → vario.run() → tasks.py → JSONL. Found cache staleness bug.

March 11, 2026 — AM

Run #2 — Baseline Established

All 6 sites, no LLM cleaning. Average score: 0.81. Discovered newsletters are the hard category — Substack CTAs survive readability extraction.

March 11, 2026 — PM

Run #3 — LLM Clean Added

Same 6 sites with gemini-lite cleaning pass. Average: 0.89 (+0.08). Waxman newsletter: 0.50 → 1.00. Paywall site still fails (expected).

March 11, 2026

HTML Report Generator

Built report.py — standalone HTML reports from JSONL with score matrices, per-site cards, category breakdowns. Light theme (later migrated to dark).

Corpus Expansion

Target: 20+ sites across 6 categories. Priority: more newsletters (hardest category), academic papers, forums. Production flagging (Gradio UI) feeds new entries.

Before & After: The Scores

Each site scored with and without LLM cleaning. Bars show overall score (0–1).

substack_waxman

baseline

0.50

+ clean

1.00

substack_mauboussin

baseline

0.35

+ clean

0.35

reuters_article news

baseline

1.00

+ clean

1.00

paulgraham blog

baseline

1.00

+ clean

1.00

medium_article blog

baseline

1.00

+ clean

1.00

python_docs docs

baseline

1.00

+ clean

1.00

The Paywall Problem

substack_mauboussin scores 0.35 in both modes — the content is behind a paywall. Only 76 chars extracted from 89KB of HTML. LLM cleaning can't fix what readability never received. This is an expected failure that validates our scoring: the gym correctly identifies extraction problems even when the root cause is access, not algorithm.

The Money Shot: Before & After Text

substack_waxman — Newsletter CTA Removal

This is the site where LLM cleaning makes the biggest difference. The article content is preserved perfectly, but the Substack boilerplate at the end is stripped:

Baseline (readability only) — Score: 0.50

...12. KITE AI: $18M Series A, $33M cumulative
(Sep 2025) — PayPal Ventures, General Catalyst;
CoinDesk; Fortune

Thanks for reading! Subscribe for free to
receive new posts and support my work.

Subscribe

6

3

1

Share

With LLM Clean — Score: 1.00

...12. KITE AI: $18M Series A, $33M cumulative
(Sep 2025) — PayPal Ventures, General Catalyst;
CoinDesk; Fortune

✓ Clean ending — no CTA, no share counts,
  no subscribe prompts. Article content intact.

python_docs — Permalink Artifact Removal

Even well-structured docs have extraction artifacts. The LLM removes permalink anchors and navigation chrome:

Baseline — 3,746 chars

# `asyncio` — Asynchronous I/O[¶](#module-asyncio
"Link to this heading")

---

asyncio is a library to write **concurrent** code
using the **async/await** syntax.

asyncio is used as a foundation for multiple Python
asynchronous frameworks...

With LLM Clean — 2,381 chars (−36%)

# `asyncio` — Asynchronous I/O

asyncio is a library to write **concurrent** code
using the **async/await** syntax.

asyncio is used as a foundation for multiple Python
asynchronous frameworks...

✓ Permalink anchors removed, horizontal rules
  cleaned, navigation stripped. Content preserved.

Full Score Matrix

Category Patterns

Site	Category	Mode	Overall	Intrusion	Anchor	Must-Not	Raw → Clean	Timing
substack_waxman	newsletter	baseline	0.50	0.00	1.00	0.00	178K → 11K	69ms
+ clean	1.00	1.00	1.00	1.00	178K → 10.6K	10.7s
substack_mauboussin	newsletter	baseline	0.35	0.50	0.00	1.00	89K → 76	417ms
+ clean	0.35	0.50	0.00	1.00	89K → 76	692ms
reuters_article	news	baseline	1.00	1.00	1.00	1.00	586K → 2.2K	60ms
+ clean	1.00	1.00	1.00	1.00	586K → 2.2K	2.5s
paulgraham	blog	baseline	1.00	1.00	1.00	1.00	9K → 3.2K	11ms
+ clean	1.00	1.00	1.00	1.00	9K → 3.1K	3.7s
medium_article	blog	baseline	1.00	1.00	1.00	1.00	150K → 14.8K	683ms
+ clean	1.00	1.00	1.00	1.00	150K → 14.3K	11.2s
python_docs	docs	baseline	1.00	1.00	1.00	1.00	25K → 3.7K	67ms
+ clean	1.00	1.00	1.00	1.00	25K → 2.4K	3.4s

Category	Sites	Baseline Avg	Clean Avg	Delta	Insight
newsletter	2	0.43	0.68	+0.25	Hardest category. CTAs and social widgets survive readability. LLM clean is essential here. Paywalls remain unsolved.
news	1	1.00	1.00	0.00	Clean extraction from well-structured HTML. LLM clean adds no value but doesn't hurt.
blog	2	1.00	1.00	0.00	Blog platforms (Medium, personal sites) extract cleanly. Readability handles them well.
docs	1	1.00	1.00	0.00	Documentation sites have clean HTML structure. LLM clean removes permalink artifacts (cosmetic improvement, not scored).

Key Insight: LLM Clean is Targeted, Not Universal

LLM cleaning delivers massive value on newsletter content (the hardest category) while having zero negative impact on already-clean content. It's not a blunt instrument — it's a precision tool for the specific failure mode that readability can't handle: distinguishing article text from platform UI text that readability considers "content."

Cost-Benefit Analysis

Metric	Baseline	With LLM Clean	Delta
Average Score	0.81	0.89	+0.08
Perfect Scores (1.00)	4 / 6	5 / 6	+1
Failing Sites (<0.5)	1 / 6	1 / 6	0 (paywall)
Cost per Run	$0.000	$0.002	+$0.002
Avg Latency	218ms	5.4s	+5.2s

The Tradeoff

LLM cleaning costs $0.0003/page and adds ~5 seconds of latency. For batch processing (the primary use case), this is negligible. For interactive use, consider cleaning only newsletter/social content where it matters most.

What We Learned

What Worked

Deterministic scoring — reproducible, fast, free
Weighted dimensions — anchors matter most
LLM clean at temperature=0 — consistent behavior
Safety check (<50% length = reject) — prevents summarization
Category-based corpus — reveals per-category patterns

What Needs Work

Corpus too small — 6 sites, need 20+
No paywall handling — expected failure but unaddressed
No regression detection — need diff between runs
No production integration — gym is standalone
LLM clean latency — 5-10s per page

The System Architecture

Gym = Steering Module on a Job

The extraction gym doesn't run its own pipeline — it scores the output of the extraction pipeline (lib/fetch/ + lib/parse/). The gym is a quality feedback loop:

Pipeline extracts content from URLs (production or corpus)
Gym scores extraction quality against known expectations
Low scores surface as review items
Human corrections grow the corpus
Gym re-evaluates — the loop closes

The Gradio UI (app.py) enables live testing, result browsing, and flagging bad extractions — which feeds directly back into corpus/flagged.yaml for future gym runs.