2026-03-02 | vic.db reset discovered, scaling path assessed
vic.db was reset — only 4 empty rows. The jobs pipeline has
17,698 processed wayback items and 334 direct scrape items, but the
extracted data (posted_at, description, catalysts) was lost with the DB. The analysis dataset
(875 ideas with returns in returns_cache.db) is unaffected — W1 results stand.
To scale, we must rebuild vic.db first.
The VIC prediction pipeline has two data paths:
batch_returns() reads ideas from vic.db (via open_raw_conn()),
fetches price data from Finnhub, and caches results in returns_cache.db. The 875 ideas
with returns were computed in a previous session when vic.db had data. Those cached results are intact.
The jobs pipeline (wayback + direct scraping) writes to vic.db via
upsert_source() → merge_idea(). All 18K+ items completed successfully — but the
DB was reset after processing, losing the merged idea data.
save_html() to
jobs/data/vic_ideas/html/*.html.xz. That directory doesn't exist locally. May be on
SanDisk backup at /Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/
(SanDisk directory exists but is too slow to inspect).
Investigation revealed two errors in documentation:
Multiple docs cited 25,736 as the VIC idea count. This number was from a previous session when SanDisk was mounted, but was never independently verified. Actual verified numbers:
| Source | Count | Notes |
|---|---|---|
projects/vic/README.md "known IDs" | 24,584 | Discovery-stage ID list |
| dschonholtz GitHub list | 13,925 | Public URL list |
| VIC target ("13K+ ideas") | ~13,000 | README goal statement |
| Jobs pipeline processed | 18,032 | 17,698 wayback + 334 direct |
| CDX unique ideas | 18,083 | From wayback_cdx.json (65,933 entries) |
The 25,736 was likely the vic.db row count before reset (including stubs without full data). Corrected all docs to use verified numbers.
LOGBOOK.md and other docs claimed the 875-idea analysis dataset was "tech-sector only." Investigation shows it's all sectors:
| Sector | Count | % |
|---|---|---|
| consumer | 181 | 20.7% |
| technology | 141 | 16.1% |
| (unknown) | 113 | 12.9% |
| financials | 107 | 12.2% |
| industrials | 103 | 11.8% |
| healthcare | 79 | 9.0% |
| materials | 40 | 4.6% |
| real-estate | 39 | 4.5% |
| energy | 37 | 4.2% |
| telecom | 28 | 3.2% |
| utilities | 7 | 0.8% |
| Total | 875 | 100% |
This is actually good news — the W1 signal (30d Spearman 0.170, permutation p=0.000) is already cross-sector, not just tech. But it also means "scaling to all sectors" won't add as much new signal variety as expected — we already have sector diversity, just small sample size.
| Database | Rows | Size | Status |
|---|---|---|---|
returns_cache.db | 1,000 (875 ok) | 860 KB | OK |
thesis_text.db | 875 | 11 MB | OK |
embeddings.db | 771 | 6.1 MB | OK |
fundamentals.db | 812 | 492 KB | OK |
vic.db (local) | 4 | 36 KB | RESET |
| Horizon | Non-null | % | Usable with embargo? |
|---|---|---|---|
| 1d | 857 | 98% | YES |
| 7d | 855 | 98% | YES |
| 30d | 853 | 97% | YES — confirmed signal |
| 90d | 841 | 96% | Marginal |
| 180d | 821 | 94% | Untested |
| 365d | 753 | 86% | Can't test — too few years |
| 730d | 338 | 39% | Insufficient data |
| Year | Count | Notes |
|---|---|---|
| 2022 | 284 | Earliest in returns cache |
| 2023 | 349 | |
| 2024 | 232 | |
| 2025 | 10 | Only recent posts |
The jobs system processed 18K+ VIC ideas through the full fetch → extract → check_enrich pipeline. The results table stores per-stage outputs:
| Job | Stage | Done | Pending | What's stored |
|---|---|---|---|---|
vic_wayback |
fetch | 17,700 | 4 | html_size, wayback_ts, static_url |
| extract | 17,700 | — | symbol, trade_dir, company, quality | |
| check_enrich | 17,698 | 4 | thesis_type, sector, quality_score, etc. | |
vic_ideas |
fetch | 339 | — | (same structure) |
| extract | 334 | 7 | ||
| check_enrich | 334 | 10 |
symbol, trade_dir, company,
quality — but NOT posted_at, description,
or catalysts. Those were written directly to vic.db via upsert_source()
and are now lost.
Without posted_at, we can't compute returns (need to know when the idea was posted
to measure forward price moves from that date).
SanDisk has /Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/ with 9 items
inside. Likely contains vic.db + html/ directory. SanDisk was too slow to list contents during
this investigation (directory listing timed out at 60s+).
Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --sandisk
Blocker: SanDisk access speed. Try when drive is warm / recently accessed.
Re-download each HTML page from the Wayback Machine using stored URLs and timestamps,
re-parse to extract all fields, write to vic.db via upsert_source().
Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --refetch
Pro: Guaranteed to work, no external dependency. Con: 14 hours, hits Wayback Machine heavily.
The vic_ideas job has 9,610 pending items. These are direct scrapes via residential
proxy + guest account cookies. Rate-limited to ~400 views/day per account.
Command: inv jobs.runner (with vic_ideas enabled)
Pro: Gets new ideas not in wayback. Con: Very slow, requires proxy/cookies setup.
Can populate vic.db with symbol, trade_dir, company, sector, thesis_type from the jobs
results table. But cannot compute returns without posted_at dates.
Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --partial
Useful for: Getting the DB structure back, exploring what we have. Not for scaling prediction.
python -m finance.vic_analysis.returns --batch
└── ~18K ideas × 2 Finnhub calls = ~36K calls, ~20 min at 600/min
└── Fills returns_cache.db
4. Generate embeddings for new ideas
└── text-embedding-3-small, 1536-dim
└── ~18K ideas × ~$0.02/M tokens ≈ $4 total
5. Re-run W1 evaluation on expanded dataset
└── With 18K ideas (2000-2025), 365d embargo becomes viable
└── Can finally test whether 365d signal is realreturns_cache.db,
thesis_text.db, embeddings.db) is intact. The 30d signal (Spearman 0.170,
permutation p=0.000) stands. The vic.db reset only blocks W2 (scaling to full corpus).
The sector correction (all sectors, not tech-only) is actually positive for the results — the signal is already cross-sector, making it more generalizable than previously assumed.
| File | Change |
|---|---|
finance/vic_analysis/LOGBOOK.md | Fixed "tech-sector sample only" → all sectors with breakdown |
finance/vic_analysis/CLAUDE.md | Fixed 25,736 count, documented vic.db reset, added pipeline status |
finance/CLAUDE.md | Fixed "25K ideas" reference |
docs/plans/2026-03-01-vic-alpha-v2.md | Fixed counts, added rebuild step to W2, documented blocker |
finance/vic_analysis/scripts/rebuild_vicdb.py | NEW — rebuild script with 4 strategies |
# Check current state python -m finance.vic_analysis.scripts.rebuild_vicdb --check # Option A: Restore from SanDisk (try first) python -m finance.vic_analysis.scripts.rebuild_vicdb --sandisk # Option B: Re-fetch from wayback (slow but reliable) python -m finance.vic_analysis.scripts.rebuild_vicdb --refetch # Option D: Partial rebuild (quick, but can't compute returns) python -m finance.vic_analysis.scripts.rebuild_vicdb --partial # After rebuild: compute returns python -m finance.vic_analysis.returns --batch # After returns: re-run prediction python -m finance.vic_analysis.predict_robust --horizon 30d --permutation-test