2026-02-22 — projects/vic
vic.db is 7.1 GB, mostly from 3.5 GB of raw HTML stored inline in two SQLite tables.
Additionally, 10,042 ideas exist as duplicates — the same idea stored under both a numeric ID
(1234567890) and a VC-prefixed ID (VC-1234567890). Neither current discovery strategy
creates VC- prefixes; they're a legacy artifact from an earlier import.
raw_html from SQLite to .html.xz files on disk (lzma preset 6)jobs.db use numeric IDs. Keeping numeric avoids updating the tracker DB entirely.
| Metric | Numeric IDs | VC- IDs | Winner |
|---|---|---|---|
| Has posted_at date | 8,659 / 10,042 | 9,821 / 10,042 | VC- |
| Has symbol | 8,629 | 9,821 | VC- |
| Description >100 chars | 8,624 | 6,517 | Numeric |
| Longer description | 8,618 | 195 | Numeric |
| Source conflicts (same source in both) | 113 | — | |
| … numeric quality ≥ VC- | 79 | — | |
| … VC- quality > numeric | 34 | — | |
| Numeric=paywall, VC-=viewable | 23 | — | |
| Field type | Rule |
|---|---|
description, catalysts | Keep longer version. If numeric is paywalled and VC- is viewable, take VC-. |
| Structured fields (symbol, date, etc.) | Fill NULLs from VC- into numeric row |
viewable | If numeric=0 and VC-=1, set to 1 (23 cases rescued from paywall) |
idea_sources conflicts | Compare quality score, keep higher (not INSERT OR IGNORE) |
| Action | Count |
|---|---|
| Duplicate pairs merged | 10,042 done |
| VC-only orphans renamed (prefix stripped) | 1,038 done |
| Remaining VC- IDs | 0 clean |
| Remaining duplicate pairs | 0 clean |
| ideas rows after dedup | 25,441 (was 35,483) |
Raw HTML moved from SQLite raw_html TEXT columns to .html.xz files on disk.
Compressed with lzma preset=6 — good ratio (~10:1), fast enough for single-file reads.
projects/vic/html_store.pyfrom projects.vic.html_store import save_html, load_html, html_exists
save_html("0001234567", html, "direct") # -> html/0001234567_direct.html.xz
html = load_html("0001234567", "direct") # -> decompress and return
exists = html_exists("0001234567", "direct") # -> True/False
| Source | Rows with HTML | Extracted to files | Remaining in DB |
|---|---|---|---|
idea_sources | 17,767 | 17,767 done | 0 |
ideas (merged) | 16,876 | 16,876 done | 0 |
WHERE raw_html IS NOT NULL + NULL-on-commit skips every other batch.
Each batch NULLs 500 rows, so OFFSET=500 starts 500 rows past the remaining data.
Fix: remove OFFSET — just LIMIT 500 from the start each time, since
processed rows no longer match the WHERE clause. Script corrected for future reruns.
6 files updated to read/write HTML via html_store instead of SQLite raw_html.
projects/vic/db.pyupsert_source(): When raw_html is passed, calls save_html(idea_id, html, source). No longer stores HTML in the idea_sources table.merge_idea(): Removed the "best raw_html" selection loop and dropped raw_html from the INSERT/UPDATE SQL entirely.projects/vic/enrich.pycheck_enrich(): Replaced 3-level SQL fallback (idea_sources by source → idea_sources by quality → ideas table) with file reads: load_html(id, source) → load_html(id, other_source) → load_html(id) (merged).jobs/handlers/vic_ideas.py_stage_fetch(): Idempotency check uses html_exists() instead of DB query. Writes via save_html() + lightweight DB row (URL + timestamp, no HTML)._stage_extract(): Reads HTML from load_html(id, "direct") with fallback to merged file. Removed DB-based duplicate HTML detection (was querying LENGTH(raw_html) across all rows).jobs/handlers/vic_wayback.py_stage_fetch(): Skip check uses html_exists() for both numeric and VC- IDs (legacy compat)._stage_extract(): Reads from file with fallback chain. Fixed VC- propagation bug (see below).vic_wayback._stage_extract() had a read-fallback that checked for VC-prefixed files,
then reassigned item_key to the VC- version. All subsequent writes (upsert_source, merge_idea)
then wrote under the VC- ID — perpetuating the very duplicates we're trying to eliminate.
data-quality/canonicalize-on-write-back —
"Always persist data in its canonical format, even when the input was accepted via a legacy fallback path."
OFFSET += 500 on a query with WHERE raw_html IS NOT NULL.
After each batch NULLs 500 rows, the next OFFSET skips 500 still-unprocessed rows.
Result: only ~50% of rows extracted per run.
LIMIT 500 always gets the next unprocessed batch.
# Dry run (counts only, no writes)
python projects/vic/migrate_html_to_files.py --dry-run
# Full run
python projects/vic/migrate_html_to_files.py
# Re-run after OFFSET bug fix (skips dedup since already done, extracts remainder)
python projects/vic/migrate_html_to_files.py --skip-dedup
# Phases can be skipped independently
python projects/vic/migrate_html_to_files.py --skip-dedup --skip-vacuum
| Check | Command / Query | Expected | Status |
|---|---|---|---|
| No VC- IDs remain | SELECT COUNT(*) FROM ideas WHERE idea_id LIKE 'VC-%' |
0 | pass |
| No duplicate pairs | SELECT COUNT(*) FROM ideas a JOIN ideas b ON b.idea_id = 'VC-' || a.idea_id WHERE a.idea_id NOT LIKE 'VC-%' |
0 | pass |
| HTML files on disk | ls jobs/data/vic_ideas/html/*.xz | wc -l |
~34K | 34,643 |
| No raw_html in DB | SELECT COUNT(*) FROM idea_sources WHERE raw_html IS NOT NULL |
0 | pass |
| DB size after VACUUM | stat -f%z jobs/data/vic_ideas/vic.db |
< 1 GB | 461 MB |
| Spot-check: load random HTML | python -c "from projects.vic.html_store import load_html; print(len(load_html('0014652725', 'direct')))" |
> 1000 chars | 3,089,346 chars |
| Handler test: vic_ideas | inv jobs.test -j vic_ideas -l 1 |
passes | pending |
| Handler test: vic_wayback | inv jobs.test -j vic_wayback -l 1 |
passes | pending |
html_store.py
save_html / load_html / html_exists — lzma preset 6
migrate_html_to_files.py
3-phase: dedup → extract → VACUUM. Batch size 500, per-chunk commits.
db.py, enrich.py, vic_ideas.py, vic_wayback.py
Read fallback reassigned item_key to VC- format, then wrote back under it. Fixed: read any format, always write canonical.
10,042 pairs merged + 1,038 orphans renamed in ~16s
OFFSET + mutating WHERE skipped every other batch. Fixed: removed OFFSET.
34,643 HTML files extracted (3.0 GB compressed). DB: 7,185 MB → 461 MB.