# VIC Alpha Prediction — V2 Plan

**Status**: Active — W1 complete, W2 is critical blocker
**Created**: 2026-03-01
**Reviewed**: Vario maxthink (Opus, GPT-Pro, Grok, Gemini) — fixes applied below
**Previous work**: `finance/vic_analysis/LOGBOOK.md`, `docs/plans/2026-02-25-vic-alpha-prediction-design.md`

## Context

V1 (E1-E3 experiments, PCA-100 Ridge) achieved Spearman r=0.357 on a single 2024 test split. A 4-model Vario review identified three sources of inflation:

1. **Data snooping**: 10+ models evaluated on the same 137-sample test set; best reported
2. **Fundamentals look-ahead**: Current market cap (2025) used to predict returns on 2020-2023 ideas — mechanical correlation with future returns
3. **Outlier-inflated metrics**: Q5-Q1 spread used means, dominated by SMCI (+1257%)

V2 robust evaluation (`predict_robust.py`) confirmed the signal is real but smaller: **Spearman ≈ 0.20, Q5-Q1 median spread ≈ +46%** (both folds significant at p<0.05). Signal comes entirely from **thesis text embeddings** — fundamentals add zero after look-ahead fix.

## Goal

Move from "we have a signal" to "we have a deployable prediction system" that:
1. Ranks new VIC ideas by predicted alpha with honest uncertainty estimates
2. Uses a single model with direction as a feature (split evaluation per direction)
3. Uses realistic entry timing (discovery date, not publication date)
4. Scales to full VIC corpus (~18K processed ideas) across all sectors

## Architecture

```
VIC idea (raw HTML)
  ↓
[Parse] → posted_at, symbol, thesis_type, quality_score, description, catalysts, author
  ↓
[Discovery lag] → entry_date = max(posted_at, discovery_date from o/ log)
  ↓
[Embed] → text-embedding-3-small (1536-dim)
  ↓
[Features]
  ├── PCA(n) on embeddings (fit on train only; n selected by inner 3-fold CV)
  ├── Scalar: quality_score, is_long, posting_year, posting_month
  ├── Thesis metadata: desc_len, has_catalysts, is_contrarian, time_horizon
  ├── Author track record: Bayesian-shrunk hit rate, n_eligible_ideas, avg alpha (horizon-embargoed)
  └── Market regime: VIX at entry, SPY trailing 12m return
  ↓
[Model] — single RidgeCV with is_long feature (direction interactions via Ridge)
  ↓
[Evaluate] → embargoed walk-forward CV, block bootstrap (quarterly), quintile medians
  ├── LONG evaluation: metrics on LONG subset of test set
  └── SHORT evaluation: metrics on SHORT subset, plus tail risk (CVaR, max loss)
  ↓
[Output] → ranked idea list with predicted alpha, confidence tier, position sizing
  ↓
[Post-hoc filters] — SHORT: exclude high short-interest / squeeze-risk stocks (rule-based)
```

## Data Pipeline

### Current state

| Data                   | Count      | Source                         | Location                        |
|------------------------|------------|--------------------------------|---------------------------------|
| Returns (with alpha)   | 875        | Finnhub daily candles          | `data/returns_cache.db`         |
| Thesis text            | 771 matched| SanDisk backup VIC DB          | `data/thesis_text.db`           |
| Embeddings             | 587 whole  | OpenAI text-embedding-3-small  | `data/embeddings.db`            |
| Fundamentals           | 750        | Finnhub company profiles       | `data/fundamentals.db`          |
| Full VIC DB            | ~18K processed (vic.db reset, 4 rows local) | Jobs pipeline results + SanDisk backup | `jobs/data/vic_ideas/vic_ideas.db` (needs rebuild) |
| Trading event log      | 17,967     | Moneygun (Twitter only)        | `nvme/paper/o/twitter.jsonl`    |
| Trade execution log    | 1,826      | Moneygun paper trading         | `nvme/paper/tlogs/recent-trades.jsonl` |

### Target state

| Data                   | Count      | Source                         | Notes                           |
|------------------------|------------|--------------------------------|---------------------------------|
| Returns (all sectors)  | ~20K+      | Finnhub                        | Expand from current 875 (all sectors, not tech-only) |
| Delisting returns      | ~500-2K    | Finnhub + manual               | Bankruptcies=-100%, acquisitions=buyout price |
| Thesis text            | ~20K       | Full VIC DB                    | All with description >100 chars |
| Embeddings             | ~20K       | OpenAI                         | ~$4 at $0.02/M tokens           |
| Author profiles        | ~2K authors| Computed from returns           | Bayesian-shrunk, horizon-embargoed |
| Market regime          | per idea   | Yahoo/Finnhub                  | VIX, SPY trailing return        |
| Discovery timing       | per idea   | `nvme/paper/o/` + jobs logs    | When we actually found each idea|
| Entity dedup map       | ~18K→~12K  | MinHash/SimHash                | Collapse near-duplicates        |

## Workstreams

### W1: Lock evaluation framework (first — prevents self-deception)

**Goal**: Bulletproof methodology on current 875-idea dataset before scaling

All 4 Vario models agreed: lock evaluation before scaling, otherwise you waste scaling effort and may need to re-run.

**Steps**:
1. **Embargo gap in walk-forward CV**: For 365d alpha target, insert a 365-day gap between last training idea's posted_at and first test idea's posted_at. Ideas whose return windows haven't fully resolved by the test cutoff must be excluded from training. Currently missing — this is a form of target leakage (Gemini, GPT-Pro, Opus all flagged)
2. **Block bootstrap with quarterly clusters**: Currently IID bootstrap. Switch to block bootstrap with 3-6 month blocks (all 4 models agree monthly is too granular). Use `arch` package's optimal block length estimator or fixed quarterly blocks
3. **Nested PCA dim selection**: Inner 3-fold CV on training set to select from {50, 100, 150, 200, 300}. Currently fixed at 100. Low risk (Ridge shrinks noisy PCs) but should be automated
4. **Permutation test**: Shuffle y within (year, sector) strata, run full pipeline 100x. Real Spearman must exceed 95th percentile. Respects time + clustering structure per GPT-Pro
5. **Calibration plot**: Predicted alpha decile bins vs actual alpha
6. **Regime-conditional reporting**: Split results by market regime (bull: SPY >15% trailing, bear: <0%). Define minimum per-regime sample sizes
7. **Report mean Spearman across folds with single block-bootstrapped CI** (not per-fold significance tests — avoids multiple testing per Opus)

**Metrics to report** (per fold and aggregate):
- Spearman r + block-bootstrap 95% CI
- Q5-Q1 median spread + CI
- Q5 win rate, Q1 win rate (separately for LONG and SHORT)
- MAE, median absolute error
- IC (Pearson of predicted vs actual ranks)
- For SHORTs: CVaR at 95th percentile, max single-idea loss

### W2: Scale to full corpus

**Goal**: Full corpus → returns + embeddings → 20x more training data

**Blocker**: `vic.db` was reset (only 4 rows). Must rebuild before computing returns.

**Steps**:
0. **Rebuild vic.db** — see `finance/vic_analysis/scripts/rebuild_vicdb.py`. Options:
   a. Re-run `vic_wayback` fetch stage (re-download 17K pages from Wayback Machine — slowest but most complete)
   b. Restore from SanDisk backup at `/Volumes/Sandisk-4TB/rivus-offload/jobs-data/vic_ideas/` (need to check if vic.db + html/ exist there — SanDisk was too slow to verify)
   c. Run `vic_ideas` direct scraper for the 9,610 pending items (requires proxy + cookies, ~24 hrs at 400/day)
1. Extract all ideas from rebuilt vic.db into `thesis_text.db`
2. Run `batch_returns()` on all sectors. Estimate: 18K × 2 API calls = 36K Finnhub calls, ~20 min
3. Run symbol resolution (normalize + LLM batch) for failed symbols
4. **Entity dedup** (new — GPT-Pro): Deduplicate near-identical writeups on same ticker within 90 days using MinHash/SimHash. Multiple ideas on same ticker in same window should keep only the first, or be flagged for the model. Without this, same-ticker clustering creates dependence that inflates significance
5. Embed all ideas with text (~20K). Cost: ~$4 at text-embedding-3-small rates
6. Validate: coverage table by sector, year, thesis_type

**Expected impact**: n_train 344 → ~7,000. Enables sector models, 7-fold CV, stable quintiles.

### W3: Survivorship bias & delisting (new workstream — all 4 models flagged)

**Goal**: Eliminate survivorship bias from delisted/acquired companies

**Problem**: Delisted companies are excluded from returns. SHORTs on companies that go bankrupt (+100% win) and LONGs on companies that fail (-100% loss) are both missing. This biases LONG alpha upward and SHORT alpha downward.

**Steps**:
1. Systematic scan: cross-reference all VIC symbols against Finnhub delisted endpoints and SEC EDGAR filings
2. Classify delisting reason: bankruptcy, acquisition, going-private, regulatory
3. Assign terminal returns: bankruptcy → -100%, acquisition → buyout premium (from last traded price), going-private → last price
4. Sensitivity analysis: measure Spearman change with vs without delisted imputation
5. Make "return availability rate" a first-class metric — track what % of ideas we have returns for, by sector and year

### W4: Author track record (high-value, low-effort)

**Goal**: Add persistent analyst skill signal

**Implementation with horizon embargo** (all 4 models provided specific leakage fixes):

```python
def author_features(author_id, posting_date, horizon_days=365):
    """Compute author stats using ONLY fully-resolved prior ideas."""
    prior_ideas = get_ideas_by_author(author_id)

    # CRITICAL: Only include ideas whose full return window has elapsed
    # An idea posted 6 months ago does NOT have a resolved 365d alpha yet
    eligible = [
        idea for idea in prior_ideas
        if idea.posted_at + timedelta(days=horizon_days) < posting_date
    ]

    if len(eligible) < 3:
        return POPULATION_MEDIAN_FEATURES  # cold start: use global median, not zeros

    # Bayesian shrinkage to avoid 1-for-1 authors getting 100% hit rate
    global_hit_rate = 0.375  # population base rate (fit on training only)
    prior_weight = 5  # shrinkage strength
    raw_hits = sum(1 for i in eligible if i.alpha > 0)

    return {
        "n_eligible_ideas": len(eligible),
        "hit_rate_shrunk": (raw_hits + prior_weight * global_hit_rate) / (len(eligible) + prior_weight),
        "avg_alpha": np.mean([i.alpha for i in eligible]),
        "alpha_consistency": np.std([i.alpha for i in eligible]),
        "has_track_record": 1,  # binary flag for cold-start distinction
        "days_since_last": (posting_date - max(i.posted_at for i in eligible)).days,
    }
```

Key requirements:
- `global_hit_rate` and `prior_weight` must be fit inside each training fold (Opus, GPT-Pro)
- Don't use `prior_best_alpha` — remove from original plan (leakage-prone via hindsight)
- Add `has_track_record` binary to let model weight track record features differently for new authors (Opus)
- Add `days_since_last` for recency (Opus) — stale edge detection

### W5: Direction-aware evaluation (replaces "separate LONG/SHORT models")

**Goal**: Understand direction-specific performance without splitting training data

All 4 models agreed: 119 SHORTs is not enough for a separate model. Even at 20K scale (~2,500-3,500 SHORTs), separate PCA-Ridge is borderline.

**Approach** (Opus recommended, others agreed):
1. **Single model** with `is_short` as a feature. Ridge learns direction-specific coefficients implicitly
2. **Separate evaluation**: Report Spearman, quintiles, win rates, tail risk separately for LONG and SHORT test subsets
3. **Post-hoc SHORT risk filter**: After model ranks SHORTs, apply rule-based overlay — exclude stocks with >50% short interest, recent momentum squeeze indicators, hard-to-borrow flags
4. **Threshold for split**: Only consider separate models if full corpus yields 2,500+ SHORTs AND separate models outperform single model by >0.03 Spearman in paired walk-forward comparison

**Target variable clarification** (Opus):
- Train on raw alpha (not thesis_return) — alpha is the right measure of idea quality
- Report both alpha and direction-adjusted absolute return in output
- For SHORT portfolio construction, use absolute return (not alpha) — a SHORT "alpha win" doesn't mean the short made money in absolute terms

### W6: Discovery timing & live pipeline

**Goal**: Realistic entry dates and live scoring of new ideas

**Discovery lag impact** (Vario consensus): For VIC's fundamental, catalyst-driven ideas:
- Large cap: 1-3 day lag costs ~0-2% (noise-dominated)
- Small/mid cap: 1-3 day lag costs ~2-8% (VIC bump from institutional following)
- VIC 45-day exclusivity period means public ideas are already delayed for the best content
- Asymmetric: SHORT ideas may actually benefit from delayed entry (initial bounce by existing longs)
- Opus: "The real risk isn't lag — it's missing ideas entirely because the scraper was down"

**Empirical measurement first** (before building live pipeline):
```python
# For each idea, compute alpha at T+0, T+1, T+3, T+7
# Plot decay curve — does ranking change? (Spearman of T+0 alpha vs T+3 alpha)
# Stratify by market cap, thesis_type
```

**Data sources for discovery timing**:
- `nvme/paper/o/` — observation log schema (documented in `infra/README.md § Trading Data`)
- `jobs/` system — VIC scraping timestamps in job execution logs
- Future: VIC RSS/scraper producing events with `channel: "vic"`

**Live scoring pipeline**:
1. VIC scraper detects new idea → writes to `o/vic.jsonl`
2. Trigger: embed thesis text, compute features (author track record, market regime)
3. Run PCA-Ridge model → predicted alpha + confidence tier
4. Output: notification (Pushover) with symbol, predicted alpha, Q-tier, author track record
5. Position sizing: fractional Kelly based on model confidence and historical calibration

### W7: Portfolio-level backtest (new — GPT-Pro, Grok)

**Goal**: Bridge idea-level metrics to tradable strategy

**Problem**: Spearman=0.20 and Q5-Q1=+46% are idea-level metrics. Tradability requires accounting for concentration, turnover, liquidity, and (for shorts) borrow costs.

**Steps**:
1. Monthly formation: top decile LONG + bottom decile SHORT, equal weight
2. Constraints: max 5% per name, sector neutrality, minimum market cap $200M
3. Assumptions: 0.1% round-trip slippage, 3% annual borrow cost for shorts, monthly rebalance
4. Metrics: CAGR, Sharpe, max drawdown, Calmar, turnover
5. Compare: Q5-only (long only) vs Q5 long + Q5 short (market neutral)

### W8: Semantic claims extraction (speculative, last priority)

**Goal**: Test whether structured claims add signal beyond dense embeddings

**Steps**:
1. Use LLM to extract structured claims from thesis texts (haiku, ~$0.50 for batch)
2. One-hot encode claim types as additional features
3. Add to PCA-Ridge model alongside embeddings
4. Compare: embeddings + claims vs embeddings alone

**Priority**: Last. Embeddings likely already capture this signal implicitly (all 4 models agree).

## Execution Priority

| Priority | Workstream | Rationale | Effort |
|----------|-----------|-----------|--------|
| 1 | W1: Lock evaluation | **DONE.** Embargo kills 365d (data too short), 30d signal survives (Sp=0.242, p=0.001) | Low (1-2 days) |
| 2 | W2: Scale corpus | 20x data transforms everything | Medium (infra exists) |
| 3 | W3: Survivorship fix | Critical data integrity (all 4 models flagged as missing) | Medium |
| 4 | W4: Author track record | Low effort, high expected value | Low |
| 5 | W5: Direction evaluation | Separate metrics, single model | Low |
| 6 | W6: Discovery timing | Bridges to live trading; measure lag cost empirically first | Medium |
| 7 | W7: Portfolio backtest | Proves tradability | Medium |
| 8 | W8: Claims extraction | Speculative | Medium |

## Success Criteria

1. **Spearman r ≥ 0.15** mean across ≥5 walk-forward folds, with no fold below 0.05 (Opus refinement)
2. **Q5 win rate lift**: LONG Q5 ≥ +10pp above base rate (not absolute 50% — GPT-Pro noted 50% may be aggressive). SHORT Q5 ≥ +5pp above base rate, with CVaR constraint (95th percentile loss < 100%)
3. **Q5-Q1 median spread ≥ +20%** at 365d across all sectors (reduced from +30% — Gemini noted tech-heavy sample inflates this; +30% may not hold across utilities, large caps)
4. **Stable across regimes**: point estimate Spearman > 0 in all regime buckets; statistically significant in majority regime (Opus: define minimum per-regime sample size)
5. **Live pipeline**: <24h from VIC publication to scored prediction
6. **Portfolio Sharpe ≥ 0.5** in walk-forward backtest with realistic constraints (new — W7)

## Key Files

| File | Purpose |
|------|---------|
| `finance/vic_analysis/predict_robust.py` | V2 evaluation pipeline (walk-forward, PCA-Ridge, bootstrap) |
| `finance/vic_analysis/predict_alpha.py` | E1: metadata regression (historical) |
| `finance/vic_analysis/predict_embeddings.py` | E3: embedding generation + NN |
| `finance/vic_analysis/predict_combined.py` | Combined NN (V1, Huber delta fixed to 15) |
| `finance/vic_analysis/fundamentals.py` | Finnhub profile fetcher |
| `finance/vic_analysis/returns.py` | Alpha computation from Finnhub candles |
| `finance/vic_analysis/LOGBOOK.md` | Full experiment history |
| `infra/README.md` § Trading Data | NVMe/moneygun data locations, o/ event schema |

## Dependencies

- **SanDisk 4TB**: Must be mounted for full VIC DB access (`/Volumes/Sandisk-4TB/`)
- **Finnhub API (paid tier)**: For daily candles, company profiles, delisted endpoints
- **OpenAI API**: For text-embedding-3-small embeddings (~$4 for full corpus)
- **NVMe drive**: For moneygun trading data and Finnhub cache
- **`arch` Python package**: For optimal block bootstrap length estimation

## Vario Review Summary

4 models reviewed the original V2 plan. Key changes applied:

| Issue | Raised by | Change |
|-------|-----------|--------|
| Lock evaluation before scaling | Opus, GPT-Pro | W1 and W2 swapped in priority |
| 365-day embargo gap in walk-forward | Gemini, GPT-Pro, Opus | Added to W1 step 1 |
| Block bootstrap quarterly (not monthly) | All 4 | Fixed in W1 step 2 |
| Survivorship bias = missing workstream | Gemini, Opus, GPT-Pro | Added as W3 |
| Entity dedup for same-ticker clustering | GPT-Pro | Added to W2 step 4 |
| Author features need horizon embargo | All 4 | Rewrote W4 with code example |
| Don't split LONG/SHORT models (n too small) | All 4 | Replaced W3 → W5 (single model, split eval) |
| Q5-Q1 +30% may be too aggressive cross-sector | Gemini | Reduced to +20% |
| Q5 win rate as lift over base (not absolute) | GPT-Pro | Reframed success criterion 2 |
| Portfolio-level backtest missing | GPT-Pro, Grok | Added as W7 |
| Permutation test must respect strata | GPT-Pro | Fixed in W1 step 4 |
| Discovery lag is small for fundamental theses | Opus, GPT-Pro, Grok | Added empirical measurement as first step in W6 |
| Target variable: train on alpha, report both | Opus | Added to W5 |