VIC DB Cleanup: Dedup + HTML Extraction

2026-02-22 — projects/vic

Problem

vic.db is 7.1 GB, mostly from 3.5 GB of raw HTML stored inline in two SQLite tables. Additionally, 10,042 ideas exist as duplicates — the same idea stored under both a numeric ID (1234567890) and a VC-prefixed ID (VC-1234567890). Neither current discovery strategy creates VC- prefixes; they're a legacy artifact from an earlier import.

7.1 GB

DB size before

10,042

duplicate ID pairs

35,483

ideas rows before

~34K

HTML blobs in DB

Solution

Extract HTML to compressed files — move raw_html from SQLite to .html.xz files on disk (lzma preset 6)
Dedup IDs — merge VC-prefixed rows into their numeric counterparts, keeping the richer data
Update code paths — 6 files modified to read/write HTML via the new file store
VACUUM — reclaim space from NULL'd columns

File layout after migration

jobs/data/vic_ideas/ html/ {idea_id}_direct.html.xz # HTML from direct VIC scrape {idea_id}_wayback.html.xz # HTML from Wayback Machine {idea_id}.html.xz # merged "best" HTML from ideas table

Phase 1: Dedup

Decision: Keep numeric IDs as canonical (not VC- as originally planned). All 26,664 work_items in jobs.db use numeric IDs. Keeping numeric avoids updating the tracker DB entirely.

Data analysis that drove the merge strategy

Metric	Numeric IDs	VC- IDs	Winner
Has posted_at date	8,659 / 10,042	9,821 / 10,042	VC-
Has symbol	8,629	9,821	VC-
Description >100 chars	8,624	6,517	Numeric
Longer description	8,618	195	Numeric
Source conflicts (same source in both)	113		—
… numeric quality ≥ VC-	79		—
… VC- quality > numeric	34		—
Numeric=paywall, VC-=viewable	23		—

Merge strategy (field-level)

Field type	Rule
`description`, `catalysts`	Keep longer version. If numeric is paywalled and VC- is viewable, take VC-.
Structured fields (symbol, date, etc.)	Fill NULLs from VC- into numeric row
`viewable`	If numeric=0 and VC-=1, set to 1 (23 cases rescued from paywall)
`idea_sources` conflicts	Compare `quality` score, keep higher (not INSERT OR IGNORE)

Results

Action	Count
Duplicate pairs merged	10,042 done
VC-only orphans renamed (prefix stripped)	1,038 done
Remaining VC- IDs	0 clean
Remaining duplicate pairs	0 clean
ideas rows after dedup	25,441 (was 35,483)

Phase 2: HTML Extraction

Raw HTML moved from SQLite raw_html TEXT columns to .html.xz files on disk. Compressed with lzma preset=6 — good ratio (~10:1), fast enough for single-file reads.

New module: `projects/vic/html_store.py`

from projects.vic.html_store import save_html, load_html, html_exists

save_html("0001234567", html, "direct")    # -> html/0001234567_direct.html.xz
html = load_html("0001234567", "direct")   # -> decompress and return
exists = html_exists("0001234567", "direct") # -> True/False

Results

34,643

HTML files on disk

3.0 GB

compressed on disk

461 MB

DB size after

HTML blobs remaining

Source	Rows with HTML	Extracted to files	Remaining in DB
`idea_sources`	17,767	17,767 done	0
`ideas` (merged)	16,876	16,876 done	0

Bug found & fixed during migration: OFFSET-based pagination combined with WHERE raw_html IS NOT NULL + NULL-on-commit skips every other batch. Each batch NULLs 500 rows, so OFFSET=500 starts 500 rows past the remaining data. Fix: remove OFFSET — just LIMIT 500 from the start each time, since processed rows no longer match the WHERE clause. Script corrected for future reruns.

Phase 3: Code Path Updates

6 files updated to read/write HTML via html_store instead of SQLite raw_html.

projects/vic/html_store.py NEW — compressed HTML file I/O projects/vic/migrate_html_to_files.py NEW — one-time migration script projects/vic/db.py MOD — upsert_source saves to file; merge_idea drops raw_html projects/vic/enrich.py MOD — check_enrich reads from files (3-level fallback) jobs/handlers/vic_ideas.py MOD — fetch/extract/idempotency via html_store jobs/handlers/vic_wayback.py MOD — fetch/extract via html_store

Key changes per file

projects/vic/db.py

upsert_source(): When raw_html is passed, calls save_html(idea_id, html, source). No longer stores HTML in the idea_sources table.
merge_idea(): Removed the "best raw_html" selection loop and dropped raw_html from the INSERT/UPDATE SQL entirely.

projects/vic/enrich.py

check_enrich(): Replaced 3-level SQL fallback (idea_sources by source → idea_sources by quality → ideas table) with file reads: load_html(id, source) → load_html(id, other_source) → load_html(id) (merged).

jobs/handlers/vic_ideas.py

_stage_fetch(): Idempotency check uses html_exists() instead of DB query. Writes via save_html() + lightweight DB row (URL + timestamp, no HTML).
_stage_extract(): Reads HTML from load_html(id, "direct") with fallback to merged file. Removed DB-based duplicate HTML detection (was querying LENGTH(raw_html) across all rows).

jobs/handlers/vic_wayback.py

_stage_fetch(): Skip check uses html_exists() for both numeric and VC- IDs (legacy compat).
_stage_extract(): Reads from file with fallback chain. Fixed VC- propagation bug (see below).

Bugs Found

VC- ID propagation in wayback extract
vic_wayback._stage_extract() had a read-fallback that checked for VC-prefixed files, then reassigned item_key to the VC- version. All subsequent writes (upsert_source, merge_idea) then wrote under the VC- ID — perpetuating the very duplicates we're trying to eliminate.

Fix: Read from any format (for backward compat) but never reassign the working ID. Always write under the numeric canonical format.

Principle extracted: data-quality/canonicalize-on-write-back — "Always persist data in its canonical format, even when the input was accepted via a legacy fallback path."

OFFSET + mutating WHERE clause
Pagination used OFFSET += 500 on a query with WHERE raw_html IS NOT NULL. After each batch NULLs 500 rows, the next OFFSET skips 500 still-unprocessed rows. Result: only ~50% of rows extracted per run.

Fix: Drop OFFSET entirely. Since processed rows are NULL'd, LIMIT 500 always gets the next unprocessed batch.

Migration Script

# Dry run (counts only, no writes)
python projects/vic/migrate_html_to_files.py --dry-run

# Full run
python projects/vic/migrate_html_to_files.py

# Re-run after OFFSET bug fix (skips dedup since already done, extracts remainder)
python projects/vic/migrate_html_to_files.py --skip-dedup

# Phases can be skipped independently
python projects/vic/migrate_html_to_files.py --skip-dedup --skip-vacuum

Verification Checklist

Check	Command / Query	Expected	Status
No VC- IDs remain	`SELECT COUNT(*) FROM ideas WHERE idea_id LIKE 'VC-%'`	0	pass
No duplicate pairs	`SELECT COUNT(*) FROM ideas a JOIN ideas b ON b.idea_id = 'VC-' \|\| a.idea_id WHERE a.idea_id NOT LIKE 'VC-%'`	0	pass
HTML files on disk	`ls jobs/data/vic_ideas/html/*.xz \| wc -l`	~34K	34,643
No raw_html in DB	`SELECT COUNT(*) FROM idea_sources WHERE raw_html IS NOT NULL`	0	pass
DB size after VACUUM	`stat -f%z jobs/data/vic_ideas/vic.db`	< 1 GB	461 MB
Spot-check: load random HTML	`python -c "from projects.vic.html_store import load_html; print(len(load_html('0014652725', 'direct')))"`	> 1000 chars	3,089,346 chars
Handler test: vic_ideas	`inv jobs.test -j vic_ideas -l 1`	passes	pending
Handler test: vic_wayback	`inv jobs.test -j vic_wayback -l 1`	passes	pending

Timeline

Created html_store.py

save_html / load_html / html_exists — lzma preset 6

Created migrate_html_to_files.py

3-phase: dedup → extract → VACUUM. Batch size 500, per-chunk commits.

Updated 6 code paths

db.py, enrich.py, vic_ideas.py, vic_wayback.py

Found VC- propagation bug in wayback extract

Read fallback reassigned item_key to VC- format, then wrote back under it. Fixed: read any format, always write canonical.

Ran migration Phase 1: Dedup

10,042 pairs merged + 1,038 orphans renamed in ~16s

Found OFFSET pagination bug during Phase 2

OFFSET + mutating WHERE skipped every other batch. Fixed: removed OFFSET.

Phase 2 + VACUUM complete

34,643 HTML files extracted (3.0 GB compressed). DB: 7,185 MB → 461 MB.

VIC DB Cleanup: Dedup + HTML Extraction

Problem

Solution

File layout after migration

Phase 1: Dedup

Data analysis that drove the merge strategy

Merge strategy (field-level)

Results

Phase 2: HTML Extraction

New module: projects/vic/html_store.py

Results

Phase 3: Code Path Updates

Key changes per file

Bugs Found

Migration Script

Verification Checklist

Timeline

New module: `projects/vic/html_store.py`