VIC Alpha Prediction — W1 Evaluation Results
2026-03-01 | Pipeline: finance/vic_analysis/predict_robust.py | Plan: docs/plans/2026-03-01-vic-alpha-v2.md
TL;DR: VIC thesis text predicts 30-day stock alpha (Spearman 0.17, p<0.01 permutation test, Q5 win rate 69%).
The 365-day signal we reported earlier was inflated by target leakage — with proper embargo it vanishes on current data.
Scaling from 875 tech ideas to the full 25K VIC corpus (2000–2025) is the critical next step.
0.170
Spearman r (30d, embargoed)
p=0.000
Permutation test (100 shuffles)
0.067
Spearman r (365d, embargoed)
What is the Embargo Gap?
The embargo gap prevents target leakage — when training data "sees" information from the test period through overlapping return windows.
Without embargo (LEAKY):
Training idea posted 2023-06-01 ────── 365d return window ──────▶ 2024-06-01
Test idea posted 2024-01-01 ────── 365d return window ──────▶ 2025-01-01
▲▲▲▲▲▲ OVERLAP ▲▲▲▲▲▲
Jan–Jun 2024: both windows
share the same market moves
With 365-day embargo (CLEAN):
Training idea posted 2023-01-01 ── 365d return ──▶ 2024-01-01
◄── embargo gap ──►
Test idea posted 2024-01-01 ────── 365d return window ──────▶ 2025-01-01
No overlap. Training returns fully resolved
before test period begins.
Embargo rule: Training idea's posted_at + horizon_days < first test idea's posted_at
For a 30-day target, the embargo is just 30 days — barely affects training size.
For a 365-day target, the embargo pushes the training cutoff back a full year — devastating with only 3 years of data.
Why Only 3 Years of Data?
The full VIC database has 25,736 ideas from 2000–2025. We only computed returns for 1,000 tech-sector ideas from 2022–2025 as a proof-of-concept. This was a design choice for the initial experiment, not a fundamental limitation.
| Dataset | Ideas | Years | Sectors | Status |
| Current sample | 875 | 2022–2025 (3 yrs) | Tech only | Done |
| With embeddings | 587 | 2022–2025 | Tech only | Done |
| Full VIC DB | 25,736 | 2000–2025 (25 yrs) | All sectors | W2: next step |
| With descriptions | ~18,255 | 2000–2025 | All sectors | Need embed |
With the full corpus, a 365-day embargo would still leave 15+ years of training data per fold. The 365-day signal may well be real — we just can't test it honestly with 3 years.
Can We Use All Ideas That Have 30d Returns?
Yes — and we already are. The pipeline drops NaN targets per-horizon, so --horizon 30d uses all 853 ideas with 30d data (vs 753 for 365d). The gain isn't in idea count (853 vs 753 is small), it's in embargo impact:
| Horizon | Embargo | Ideas with target | Embargoed training (2024 fold) | Folds available |
| 30d | 30 days | 853 | 371 | 2 (2023, 2024) |
| 90d | 90 days | 841 | 320 | 2 |
| 365d | 365 days | 753 | 122 | 1 (only 2024) |
| 730d | 730 days | 338 | 0 | 0 (impossible) |
The real bottleneck isn't idea count — it's date range. With 2022–2025 data, a 365d embargo eats an entire year of training. With 2000–2025, it barely matters.
W1 Results: Signal by Horizon
| Horizon | Embargo | Folds | Spearman r | p-value | Q5-Q1 median | Q5 win% | Signal? |
| 30d | 30d | 2 | +0.170 | 0.000* | +4.0% | 69% |
REAL |
| 90d | 90d | 2 | +0.065 | 0.069 | +13.5% | 67% |
Marginal |
| 365d | 365d | 1 | +0.067 | 0.429 | +6.9% | 41% |
Untestable |
| 365d | none | 2 | +0.199 | 0.002 | +39.1% | 55% |
Leaked |
*Permutation test p-value (100 stratified shuffles). Other p-values are Spearman rank correlation.
30d Quintile Detail (2024 Fold)
| Quintile | n | Mean alpha | Median alpha | Win rate |
| Q1 (worst predicted) | 36 | -4.4% | -2.4% | 36% |
| Q2 | 36 | -5.4% | -8.6% | 33% |
| Q3 | 35 | +2.5% | +1.6% | 54% |
| Q4 | 36 | +1.5% | +0.6% | 53% |
| Q5 (best predicted) | 36 | +6.4% | +4.5% | 69% |
Monotonic quintile pattern: model's top picks (Q5) earn +4.5% median alpha in 30 days at a 69% win rate, while bottom picks (Q1) earn -2.4% at 36%.
Permutation Test
Stratified permutation test (30d alpha, 100 shuffles)
Shuffle y within (year × thesis_type) strata, run full pipeline
Observed mean Spearman: +0.170
Null distribution mean: +0.024 (std: 0.043)
95th percentile (null): +0.094
99th percentile (null): +0.107
z-score: 3.4σ above null
p-value: 0.000 (0/100 shuffles ≥ observed)
Verdict: Signal is genuine. Not an artifact of temporal structure or sector clustering.
How We Got Here: V1 → V2 → W1
| Stage | 365d Spearman | 30d Spearman | What changed |
| V1 (single split) | 0.357 | 0.242 | Initial result — inflated |
| V2 (walk-forward, no embargo) | 0.199 | 0.163 | Walk-forward CV, historical mcap, bootstrap |
| W1 (with embargo) | 0.067 | 0.170 | Embargo gap, block bootstrap, permutation test |
Lesson: V1's 365d Spearman of 0.357 was inflated by three things: data snooping (testing 10+ models on the same 137-sample test set), fundamentals look-ahead bias (using 2025 market cap to predict 2022 returns), and target leakage (overlapping return windows). Each fix independently reduced the number. The honest 365d signal is untestable with current data — not disproven, just insufficient.
Current Pipeline
Input: 587 tech-sector VIC ideas with thesis text + embeddings (2022–2025)
Features:
Embeddings: 1536-dim (text-embedding-3-small) → PCA to 100 dims (train-only fit)
Scalar (18): quality_score, is_long, posting_year, posting_month,
desc_len, cat_len, has_catalysts, is_contrarian,
time_horizon (short/long), thesis_type dummies
Model: RidgeCV (α auto-selected via internal LOO, always picks α=1000)
Evaluation:
Walk-forward CV: train on [min_year, T-embargo), test on [T, T+1)
Embargo gap: auto-matched to target horizon (30d → 30 day gap)
Bootstrap: quarterly block bootstrap (not IID)
Significance: stratified permutation test (year × sector)
Metrics: Spearman r, Q5-Q1 median spread, quintile win rates
Key files:
Pipeline: finance/vic_analysis/predict_robust.py
Experiments: finance/vic_analysis/LOGBOOK.md
Plan: docs/plans/2026-03-01-vic-alpha-v2.md
Next Steps — Prioritized
-
W2: Scale to full 25K VIC corpus
Critical blocker
Extract all 25,736 ideas from SanDisk VIC DB. Compute returns for all sectors (not just tech).
Embed all thesis texts (~$4 at text-embedding-3-small rates). This gives 20+ years of walk-forward
folds and unlocks honest 365d evaluation.
Effort: Medium. Requires SanDisk mounted. ~30 min Finnhub API for returns, ~$4 OpenAI for embeddings.
-
Re-evaluate all horizons with 25K ideas
Depends on W2
With 2000–2025 data, a 365d embargo leaves ~15 years of training per fold. Run walk-forward CV with
embargo at 30d, 90d, 365d. If 365d signal reappears, it's real. If not, 30d is the signal.
Also test 7d (fastest actionable horizon).
-
W4: Author track record features
Low effort, high value
Bayesian-shrunk hit rate per author, with horizon embargo (only count prior ideas whose return
window has fully elapsed). Implementation code is in the plan doc. Requires W2 data for meaningful
author sample sizes.
-
W3: Survivorship bias fix
Data integrity
Delisted companies are missing from returns. SHORTs on bankruptcies (+100% win) and LONGs on
failures (-100% loss) are both excluded. Cross-reference VIC symbols against delisting databases.
-
W5: Direction-aware evaluation
Quick win
Single model with is_long feature, but report Spearman and quintiles separately for LONG and
SHORT subsets. SHORT signal may be stronger (VIC members are better at finding overvalued stocks).
-
W7: Portfolio-level backtest
Proves tradability
Monthly formation of top-decile LONG + bottom-decile SHORT portfolio. Compute Sharpe, drawdown,
turnover with realistic constraints (slippage, borrow costs, position limits).
-
W6: Discovery timing
Live pipeline prep
Measure empirical lag cost: compute alpha at T+0, T+1, T+3, T+7 and check if rankings change.
Connect to moneygun event log (nvme/paper/o/) for real discovery timestamps.
Execution Order
Now Then After
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ W2: Scale data │──────────▶│ Re-eval all │────────▶│ W7: Portfolio │
│ 25K ideas │ │ horizons w/ │ │ backtest │
│ all sectors │ │ embargo │ │ (proves tradable)│
│ ~$4 embed cost │ │ (365d testable!) │ └──────────────────┘
└─────────────────┘ └──────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ W4: Author │ │ W5: Direction │
│ track record │ │ split eval │
│ (needs data) │ │ (LONG vs SHORT) │
└─────────────────┘ └──────────────────┘
│
▼
┌─────────────────┐
│ W3: Survivorship│
│ bias fix │
│ (delisted cos) │
└─────────────────┘
CLI Reference
# Standard run (30d with embargo — the honest evaluation)
python -m finance.vic_analysis.predict_robust --no-fundamentals --horizon 30d
# Compare with/without embargo
python -m finance.vic_analysis.predict_robust --no-fundamentals --horizon 365d # with embargo (honest)
python -m finance.vic_analysis.predict_robust --no-fundamentals --horizon 365d --no-embargo # without (inflated)
# Nested PCA selection (slower, auto-picks dims)
python -m finance.vic_analysis.predict_robust --no-fundamentals --horizon 30d --auto-pca
# Full permutation test (~5 min, 100 shuffles)
python -m finance.vic_analysis.predict_robust --no-fundamentals --horizon 30d --permutation-test
# Single split mode (for quick V1-style comparison)
python -m finance.vic_analysis.predict_robust --no-fundamentals --single-split
Glossary
- Spearman r
- Rank correlation between predicted and actual alpha. Measures whether the model correctly orders ideas from worst to best, regardless of the exact predicted values. Range: -1 to +1.
- Embargo gap
- Time buffer between the last training idea's return window and the first test idea. Prevents target leakage from overlapping price observation periods.
- Walk-forward CV
- Cross-validation that respects time order: train on past data, test on future data. Expanding window: each fold adds more past data. Never trains on future data.
- Q5-Q1 spread
- Difference in actual returns between the model's top quintile (Q5, best predicted) and bottom quintile (Q1, worst predicted). Measures economic significance.
- Block bootstrap
- Statistical resampling that preserves temporal structure by resampling whole time blocks (quarters) rather than individual observations.
- Permutation test
- Significance test that shuffles the target variable within strata (year × sector) and re-runs the full pipeline. If the real result exceeds 95% of shuffled results, the signal is real.
- PCA
- Principal Component Analysis — reduces 1536-dimensional embeddings to ~100 meaningful dimensions. Fit on training data only to prevent leakage.
- RidgeCV
- Linear regression with L2 regularization. The "CV" means it automatically selects the regularization strength via internal leave-one-out cross-validation.
- Alpha
- Excess return vs benchmark (SPY). Alpha = stock return − beta × SPY return. Positive alpha means the stock beat the market after adjusting for systematic risk.
Generated 2026-03-01 | VIC Alpha V2 Project | finance/vic_analysis/