VIC Alpha — W2 Data Investigation

2026-03-02  |  vic.db reset discovered, scaling path assessed

TL;DR: vic.db was reset — only 4 empty rows. The jobs pipeline has 17,698 processed wayback items and 334 direct scrape items, but the extracted data (posted_at, description, catalysts) was lost with the DB. The analysis dataset (875 ideas with returns in returns_cache.db) is unaffected — W1 results stand. To scale, we must rebuild vic.db first.

What Happened

The VIC prediction pipeline has two data paths:

vic.db
4 rows (reset)
batch_returns()
returns_cache.db
875 ok
predict_robust.py

batch_returns() reads ideas from vic.db (via open_raw_conn()), fetches price data from Finnhub, and caches results in returns_cache.db. The 875 ideas with returns were computed in a previous session when vic.db had data. Those cached results are intact.

The jobs pipeline (wayback + direct scraping) writes to vic.db via upsert_source() → merge_idea(). All 18K+ items completed successfully — but the DB was reset after processing, losing the merged idea data.

HTML files also gone. The handler stores raw HTML via save_html() to jobs/data/vic_ideas/html/*.html.xz. That directory doesn't exist locally. May be on SanDisk backup at /Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/ (SanDisk directory exists but is too slow to inspect).

Data Corrections

Investigation revealed two errors in documentation:

1. "25,736 ideas" — unverified, likely wrong

Multiple docs cited 25,736 as the VIC idea count. This number was from a previous session when SanDisk was mounted, but was never independently verified. Actual verified numbers:

SourceCountNotes
projects/vic/README.md "known IDs"24,584Discovery-stage ID list
dschonholtz GitHub list13,925Public URL list
VIC target ("13K+ ideas")~13,000README goal statement
Jobs pipeline processed18,03217,698 wayback + 334 direct
CDX unique ideas18,083From wayback_cdx.json (65,933 entries)

The 25,736 was likely the vic.db row count before reset (including stubs without full data). Corrected all docs to use verified numbers.

2. "Tech-sector sample only" — wrong, it's all sectors

LOGBOOK.md and other docs claimed the 875-idea analysis dataset was "tech-sector only." Investigation shows it's all sectors:

SectorCount%
consumer18120.7%
technology14116.1%
(unknown)11312.9%
financials10712.2%
industrials10311.8%
healthcare799.0%
materials404.6%
real-estate394.5%
energy374.2%
telecom283.2%
utilities70.8%
Total875100%

This is actually good news — the W1 signal (30d Spearman 0.170, permutation p=0.000) is already cross-sector, not just tech. But it also means "scaling to all sectors" won't add as much new signal variety as expected — we already have sector diversity, just small sample size.

Current Analysis Dataset

875
Ideas with returns
771
With embeddings
10
Sectors covered
2022–2025
Date range
DatabaseRowsSizeStatus
returns_cache.db1,000 (875 ok)860 KBOK
thesis_text.db87511 MBOK
embeddings.db7716.1 MBOK
fundamentals.db812492 KBOK
vic.db (local)436 KBRESET

Returns coverage by horizon

HorizonNon-null%Usable with embargo?
1d85798%YES
7d85598%YES
30d85397%YES — confirmed signal
90d84196%Marginal
180d82194%Untested
365d75386%Can't test — too few years
730d33839%Insufficient data

Ideas by year

YearCountNotes
2022284Earliest in returns cache
2023349
2024232
202510Only recent posts

What's in the Jobs Pipeline

The jobs system processed 18K+ VIC ideas through the full fetch → extract → check_enrich pipeline. The results table stores per-stage outputs:

JobStageDonePendingWhat's stored
vic_wayback fetch17,7004html_size, wayback_ts, static_url
extract17,700symbol, trade_dir, company, quality
check_enrich17,6984thesis_type, sector, quality_score, etc.
vic_ideas fetch339(same structure)
extract3347
check_enrich33410
Critical gap: The results table stores only summaries — not the full parsed fields. The extract result has symbol, trade_dir, company, quality — but NOT posted_at, description, or catalysts. Those were written directly to vic.db via upsert_source() and are now lost.

Without posted_at, we can't compute returns (need to know when the idea was posted to measure forward price moves from that date).

Rebuild Options

A. Restore from SanDisk FASTEST if available

Time: ~5 min copy  |  Completeness: Full (if backup exists)

SanDisk has /Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/ with 9 items inside. Likely contains vic.db + html/ directory. SanDisk was too slow to list contents during this investigation (directory listing timed out at 60s+).

Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --sandisk

Blocker: SanDisk access speed. Try when drive is warm / recently accessed.

B. Re-run wayback fetch+extract SLOW but complete

Time: ~14 hours (17K items × ~3s/item)  |  Completeness: Full

Re-download each HTML page from the Wayback Machine using stored URLs and timestamps, re-parse to extract all fields, write to vic.db via upsert_source().

Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --refetch

Pro: Guaranteed to work, no external dependency. Con: 14 hours, hits Wayback Machine heavily.

C. Run direct scraper for pending items ADDITIVE

Time: ~24 days at 400/day  |  Completeness: Adds 9,610 new ideas

The vic_ideas job has 9,610 pending items. These are direct scrapes via residential proxy + guest account cookies. Rate-limited to ~400 views/day per account.

Command: inv jobs.runner (with vic_ideas enabled)

Pro: Gets new ideas not in wayback. Con: Very slow, requires proxy/cookies setup.

D. Partial rebuild from results table INSUFFICIENT

Time: ~30 seconds  |  Completeness: Missing posted_at, description, catalysts

Can populate vic.db with symbol, trade_dir, company, sector, thesis_type from the jobs results table. But cannot compute returns without posted_at dates.

Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --partial

Useful for: Getting the DB structure back, exploring what we have. Not for scaling prediction.

Recommended Path

1. Try SanDisk restore first (fastest if it works) └── SUCCESS? → vic.db + html/ restored → proceed to step 3 └── FAIL? → SanDisk too slow or vic.db not there → step 2 2. Re-run wayback fetch+extract (14 hours, run overnight) └── Produces complete vic.db with 17K+ ideas └── Also restores html/ compressed files for future re-extraction 3. Compute returns for new ideas └── python -m finance.vic_analysis.returns --batch └── ~18K ideas × 2 Finnhub calls = ~36K calls, ~20 min at 600/min └── Fills returns_cache.db 4. Generate embeddings for new ideas └── text-embedding-3-small, 1536-dim └── ~18K ideas × ~$0.02/M tokens ≈ $4 total 5. Re-run W1 evaluation on expanded dataset └── With 18K ideas (2000-2025), 365d embargo becomes viable └── Can finally test whether 365d signal is real

Impact on W1 Results

W1 results are unaffected. The analysis dataset (returns_cache.db, thesis_text.db, embeddings.db) is intact. The 30d signal (Spearman 0.170, permutation p=0.000) stands. The vic.db reset only blocks W2 (scaling to full corpus).

The sector correction (all sectors, not tech-only) is actually positive for the results — the signal is already cross-sector, making it more generalizable than previously assumed.

Files Changed

FileChange
finance/vic_analysis/LOGBOOK.mdFixed "tech-sector sample only" → all sectors with breakdown
finance/vic_analysis/CLAUDE.mdFixed 25,736 count, documented vic.db reset, added pipeline status
finance/CLAUDE.mdFixed "25K ideas" reference
docs/plans/2026-03-01-vic-alpha-v2.mdFixed counts, added rebuild step to W2, documented blocker
finance/vic_analysis/scripts/rebuild_vicdb.pyNEW — rebuild script with 4 strategies

Rebuild Script Reference

# Check current state
python -m finance.vic_analysis.scripts.rebuild_vicdb --check

# Option A: Restore from SanDisk (try first)
python -m finance.vic_analysis.scripts.rebuild_vicdb --sandisk

# Option B: Re-fetch from wayback (slow but reliable)
python -m finance.vic_analysis.scripts.rebuild_vicdb --refetch

# Option D: Partial rebuild (quick, but can't compute returns)
python -m finance.vic_analysis.scripts.rebuild_vicdb --partial

# After rebuild: compute returns
python -m finance.vic_analysis.returns --batch

# After returns: re-run prediction
python -m finance.vic_analysis.predict_robust --horizon 30d --permutation-test