VIC Alpha — W2 Data Investigation

2026-03-02 | vic.db reset discovered, scaling path assessed

TL;DR: vic.db was reset — only 4 empty rows. The jobs pipeline has 17,698 processed wayback items and 334 direct scrape items, but the extracted data (posted_at, description, catalysts) was lost with the DB. The analysis dataset (875 ideas with returns in returns_cache.db) is unaffected — W1 results stand. To scale, we must rebuild vic.db first.

What Happened

The VIC prediction pipeline has two data paths:

vic.db
4 rows (reset)

→

batch_returns()

→

returns_cache.db
875 ok

→

predict_robust.py

batch_returns() reads ideas from vic.db (via open_raw_conn()), fetches price data from Finnhub, and caches results in returns_cache.db. The 875 ideas with returns were computed in a previous session when vic.db had data. Those cached results are intact.

The jobs pipeline (wayback + direct scraping) writes to vic.db via upsert_source() → merge_idea(). All 18K+ items completed successfully — but the DB was reset after processing, losing the merged idea data.

HTML files also gone. The handler stores raw HTML via save_html() to jobs/data/vic_ideas/html/*.html.xz. That directory doesn't exist locally. May be on SanDisk backup at /Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/ (SanDisk directory exists but is too slow to inspect).

Data Corrections

Investigation revealed two errors in documentation:

1. "25,736 ideas" — unverified, likely wrong

Multiple docs cited 25,736 as the VIC idea count. This number was from a previous session when SanDisk was mounted, but was never independently verified. Actual verified numbers:

Source	Count	Notes
`projects/vic/README.md` "known IDs"	24,584	Discovery-stage ID list
dschonholtz GitHub list	13,925	Public URL list
VIC target ("13K+ ideas")	~13,000	README goal statement
Jobs pipeline processed	18,032	17,698 wayback + 334 direct
CDX unique ideas	18,083	From wayback_cdx.json (65,933 entries)

The 25,736 was likely the vic.db row count before reset (including stubs without full data). Corrected all docs to use verified numbers.

2. "Tech-sector sample only" — wrong, it's all sectors

LOGBOOK.md and other docs claimed the 875-idea analysis dataset was "tech-sector only." Investigation shows it's all sectors:

Sector	Count	%
consumer	181	20.7%
technology	141	16.1%
(unknown)	113	12.9%
financials	107	12.2%
industrials	103	11.8%
healthcare	79	9.0%
materials	40	4.6%
real-estate	39	4.5%
energy	37	4.2%
telecom	28	3.2%
utilities	7	0.8%
Total	875	100%

This is actually good news — the W1 signal (30d Spearman 0.170, permutation p=0.000) is already cross-sector, not just tech. But it also means "scaling to all sectors" won't add as much new signal variety as expected — we already have sector diversity, just small sample size.

Current Analysis Dataset

875

Ideas with returns

771

With embeddings

Sectors covered

2022–2025

Date range

Database	Rows	Size	Status
`returns_cache.db`	1,000 (875 ok)	860 KB	OK
`thesis_text.db`	875	11 MB	OK
`embeddings.db`	771	6.1 MB	OK
`fundamentals.db`	812	492 KB	OK
`vic.db` (local)	4	36 KB	RESET

Returns coverage by horizon

Horizon	Non-null	%	Usable with embargo?
1d	857	98%	YES
7d	855	98%	YES
30d	853	97%	YES — confirmed signal
90d	841	96%	Marginal
180d	821	94%	Untested
365d	753	86%	Can't test — too few years
730d	338	39%	Insufficient data

Ideas by year

Year	Count	Notes
2022	284	Earliest in returns cache
2023	349
2024	232
2025	10	Only recent posts

What's in the Jobs Pipeline

The jobs system processed 18K+ VIC ideas through the full fetch → extract → check_enrich pipeline. The results table stores per-stage outputs:

Job	Stage	Done	Pending	What's stored
`vic_wayback`	fetch	17,700	4	html_size, wayback_ts, static_url
	extract	17,700	—	symbol, trade_dir, company, quality
	check_enrich	17,698	4	thesis_type, sector, quality_score, etc.
`vic_ideas`	fetch	339	—	(same structure)
	extract	334	7
	check_enrich	334	10

Critical gap: The results table stores only summaries — not the full parsed fields. The extract result has symbol, trade_dir, company, quality — but NOT posted_at, description, or catalysts. Those were written directly to vic.db via upsert_source() and are now lost.

Without posted_at, we can't compute returns (need to know when the idea was posted to measure forward price moves from that date).

Rebuild Options

A. Restore from SanDisk FASTEST if available

Time: ~5 min copy | Completeness: Full (if backup exists)

SanDisk has /Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/ with 9 items inside. Likely contains vic.db + html/ directory. SanDisk was too slow to list contents during this investigation (directory listing timed out at 60s+).

Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --sandisk

Blocker: SanDisk access speed. Try when drive is warm / recently accessed.

B. Re-run wayback fetch+extract SLOW but complete

Time: ~14 hours (17K items × ~3s/item) | Completeness: Full

Re-download each HTML page from the Wayback Machine using stored URLs and timestamps, re-parse to extract all fields, write to vic.db via upsert_source().

Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --refetch

Pro: Guaranteed to work, no external dependency. Con: 14 hours, hits Wayback Machine heavily.

C. Run direct scraper for pending items ADDITIVE

Time: ~24 days at 400/day | Completeness: Adds 9,610 new ideas

The vic_ideas job has 9,610 pending items. These are direct scrapes via residential proxy + guest account cookies. Rate-limited to ~400 views/day per account.

Command: inv jobs.runner (with vic_ideas enabled)

Pro: Gets new ideas not in wayback. Con: Very slow, requires proxy/cookies setup.

D. Partial rebuild from results table INSUFFICIENT

Time: ~30 seconds | Completeness: Missing posted_at, description, catalysts

Can populate vic.db with symbol, trade_dir, company, sector, thesis_type from the jobs results table. But cannot compute returns without posted_at dates.

Command: python -m finance.vic_analysis.scripts.rebuild_vicdb --partial

Useful for: Getting the DB structure back, exploring what we have. Not for scaling prediction.

Recommended Path

1. Try SanDisk restore first (fastest if it works) └── SUCCESS? → vic.db + html/ restored → proceed to step 3 └── FAIL? → SanDisk too slow or vic.db not there → step 2 2. Re-run wayback fetch+extract (14 hours, run overnight) └── Produces complete vic.db with 17K+ ideas └── Also restores html/ compressed files for future re-extraction 3. Compute returns for new ideas └── python -m finance.vic_analysis.returns --batch └── ~18K ideas × 2 Finnhub calls = ~36K calls, ~20 min at 600/min └── Fills returns_cache.db 4. Generate embeddings for new ideas └── text-embedding-3-small, 1536-dim └── ~18K ideas × ~$0.02/M tokens ≈ $4 total 5. Re-run W1 evaluation on expanded dataset └── With 18K ideas (2000-2025), 365d embargo becomes viable └── Can finally test whether 365d signal is real

Impact on W1 Results

W1 results are unaffected. The analysis dataset (returns_cache.db, thesis_text.db, embeddings.db) is intact. The 30d signal (Spearman 0.170, permutation p=0.000) stands. The vic.db reset only blocks W2 (scaling to full corpus).

The sector correction (all sectors, not tech-only) is actually positive for the results — the signal is already cross-sector, making it more generalizable than previously assumed.

Files Changed

File	Change
`finance/vic_analysis/LOGBOOK.md`	Fixed "tech-sector sample only" → all sectors with breakdown
`finance/vic_analysis/CLAUDE.md`	Fixed 25,736 count, documented vic.db reset, added pipeline status
`finance/CLAUDE.md`	Fixed "25K ideas" reference
`docs/plans/2026-03-01-vic-alpha-v2.md`	Fixed counts, added rebuild step to W2, documented blocker
`finance/vic_analysis/scripts/rebuild_vicdb.py`	NEW — rebuild script with 4 strategies

Rebuild Script Reference

# Check current state
python -m finance.vic_analysis.scripts.rebuild_vicdb --check

# Option A: Restore from SanDisk (try first)
python -m finance.vic_analysis.scripts.rebuild_vicdb --sandisk

# Option B: Re-fetch from wayback (slow but reliable)
python -m finance.vic_analysis.scripts.rebuild_vicdb --refetch

# Option D: Partial rebuild (quick, but can't compute returns)
python -m finance.vic_analysis.scripts.rebuild_vicdb --partial

# After rebuild: compute returns
python -m finance.vic_analysis.returns --batch

# After returns: re-run prediction
python -m finance.vic_analysis.predict_robust --horizon 30d --permutation-test