VIC DB Cleanup: Dedup + HTML Extraction

2026-02-22 — projects/vic

Problem

vic.db is 7.1 GB, mostly from 3.5 GB of raw HTML stored inline in two SQLite tables. Additionally, 10,042 ideas exist as duplicates — the same idea stored under both a numeric ID (1234567890) and a VC-prefixed ID (VC-1234567890). Neither current discovery strategy creates VC- prefixes; they're a legacy artifact from an earlier import.

7.1 GB
DB size before
10,042
duplicate ID pairs
35,483
ideas rows before
~34K
HTML blobs in DB

Solution

  1. Extract HTML to compressed files — move raw_html from SQLite to .html.xz files on disk (lzma preset 6)
  2. Dedup IDs — merge VC-prefixed rows into their numeric counterparts, keeping the richer data
  3. Update code paths — 6 files modified to read/write HTML via the new file store
  4. VACUUM — reclaim space from NULL'd columns

File layout after migration

jobs/data/vic_ideas/ html/ {idea_id}_direct.html.xz # HTML from direct VIC scrape {idea_id}_wayback.html.xz # HTML from Wayback Machine {idea_id}.html.xz # merged "best" HTML from ideas table

Phase 1: Dedup

Decision: Keep numeric IDs as canonical (not VC- as originally planned). All 26,664 work_items in jobs.db use numeric IDs. Keeping numeric avoids updating the tracker DB entirely.

Data analysis that drove the merge strategy

MetricNumeric IDsVC- IDsWinner
Has posted_at date8,659 / 10,0429,821 / 10,042VC-
Has symbol8,6299,821VC-
Description >100 chars8,6246,517Numeric
Longer description8,618195Numeric
Source conflicts (same source in both)113
… numeric quality ≥ VC-79
… VC- quality > numeric34
Numeric=paywall, VC-=viewable23

Merge strategy (field-level)

Field typeRule
description, catalystsKeep longer version. If numeric is paywalled and VC- is viewable, take VC-.
Structured fields (symbol, date, etc.)Fill NULLs from VC- into numeric row
viewableIf numeric=0 and VC-=1, set to 1 (23 cases rescued from paywall)
idea_sources conflictsCompare quality score, keep higher (not INSERT OR IGNORE)

Results

ActionCount
Duplicate pairs merged10,042 done
VC-only orphans renamed (prefix stripped)1,038 done
Remaining VC- IDs0 clean
Remaining duplicate pairs0 clean
ideas rows after dedup25,441 (was 35,483)

Phase 2: HTML Extraction

Raw HTML moved from SQLite raw_html TEXT columns to .html.xz files on disk. Compressed with lzma preset=6 — good ratio (~10:1), fast enough for single-file reads.

New module: projects/vic/html_store.py

from projects.vic.html_store import save_html, load_html, html_exists

save_html("0001234567", html, "direct")    # -> html/0001234567_direct.html.xz
html = load_html("0001234567", "direct")   # -> decompress and return
exists = html_exists("0001234567", "direct") # -> True/False

Results

34,643
HTML files on disk
3.0 GB
compressed on disk
461 MB
DB size after
0
HTML blobs remaining
SourceRows with HTMLExtracted to filesRemaining in DB
idea_sources17,76717,767 done0
ideas (merged)16,87616,876 done0
Bug found & fixed during migration: OFFSET-based pagination combined with WHERE raw_html IS NOT NULL + NULL-on-commit skips every other batch. Each batch NULLs 500 rows, so OFFSET=500 starts 500 rows past the remaining data. Fix: remove OFFSET — just LIMIT 500 from the start each time, since processed rows no longer match the WHERE clause. Script corrected for future reruns.

Phase 3: Code Path Updates

6 files updated to read/write HTML via html_store instead of SQLite raw_html.

projects/vic/html_store.py NEW — compressed HTML file I/O projects/vic/migrate_html_to_files.py NEW — one-time migration script projects/vic/db.py MOD — upsert_source saves to file; merge_idea drops raw_html projects/vic/enrich.py MOD — check_enrich reads from files (3-level fallback) jobs/handlers/vic_ideas.py MOD — fetch/extract/idempotency via html_store jobs/handlers/vic_wayback.py MOD — fetch/extract via html_store

Key changes per file

projects/vic/db.py
projects/vic/enrich.py
jobs/handlers/vic_ideas.py
jobs/handlers/vic_wayback.py

Bugs Found

VC- ID propagation in wayback extract
vic_wayback._stage_extract() had a read-fallback that checked for VC-prefixed files, then reassigned item_key to the VC- version. All subsequent writes (upsert_source, merge_idea) then wrote under the VC- ID — perpetuating the very duplicates we're trying to eliminate.

Fix: Read from any format (for backward compat) but never reassign the working ID. Always write under the numeric canonical format.

Principle extracted: data-quality/canonicalize-on-write-back — "Always persist data in its canonical format, even when the input was accepted via a legacy fallback path."
OFFSET + mutating WHERE clause
Pagination used OFFSET += 500 on a query with WHERE raw_html IS NOT NULL. After each batch NULLs 500 rows, the next OFFSET skips 500 still-unprocessed rows. Result: only ~50% of rows extracted per run.

Fix: Drop OFFSET entirely. Since processed rows are NULL'd, LIMIT 500 always gets the next unprocessed batch.

Migration Script

# Dry run (counts only, no writes)
python projects/vic/migrate_html_to_files.py --dry-run

# Full run
python projects/vic/migrate_html_to_files.py

# Re-run after OFFSET bug fix (skips dedup since already done, extracts remainder)
python projects/vic/migrate_html_to_files.py --skip-dedup

# Phases can be skipped independently
python projects/vic/migrate_html_to_files.py --skip-dedup --skip-vacuum

Verification Checklist

CheckCommand / QueryExpectedStatus
No VC- IDs remain SELECT COUNT(*) FROM ideas WHERE idea_id LIKE 'VC-%' 0 pass
No duplicate pairs SELECT COUNT(*) FROM ideas a JOIN ideas b ON b.idea_id = 'VC-' || a.idea_id WHERE a.idea_id NOT LIKE 'VC-%' 0 pass
HTML files on disk ls jobs/data/vic_ideas/html/*.xz | wc -l ~34K 34,643
No raw_html in DB SELECT COUNT(*) FROM idea_sources WHERE raw_html IS NOT NULL 0 pass
DB size after VACUUM stat -f%z jobs/data/vic_ideas/vic.db < 1 GB 461 MB
Spot-check: load random HTML python -c "from projects.vic.html_store import load_html; print(len(load_html('0014652725', 'direct')))" > 1000 chars 3,089,346 chars
Handler test: vic_ideas inv jobs.test -j vic_ideas -l 1 passes pending
Handler test: vic_wayback inv jobs.test -j vic_wayback -l 1 passes pending

Timeline

Created html_store.py

save_html / load_html / html_exists — lzma preset 6

Created migrate_html_to_files.py

3-phase: dedup → extract → VACUUM. Batch size 500, per-chunk commits.

Updated 6 code paths

db.py, enrich.py, vic_ideas.py, vic_wayback.py

Found VC- propagation bug in wayback extract

Read fallback reassigned item_key to VC- format, then wrote back under it. Fixed: read any format, always write canonical.

Ran migration Phase 1: Dedup

10,042 pairs merged + 1,038 orphans renamed in ~16s

Found OFFSET pagination bug during Phase 2

OFFSET + mutating WHERE skipped every other batch. Fixed: removed OFFSET.

Phase 2 + VACUUM complete

34,643 HTML files extracted (3.0 GB compressed). DB: 7,185 MB → 461 MB.