# VIC Alpha Prediction — Design

## Goal

Predict which VIC investment theses will generate positive alpha (returns above SPY benchmark), track prediction accuracy, and iterate toward a tradeable signal.

## Data Foundation

- **875 ideas** with computed returns at 7 horizons (1d, 7d, 30d, 90d, 180d, 365d, 730d)
- Beta-adjusted alpha, excess returns, absolute returns per horizon
- Metadata: thesis_type, quality_score (1-10), trade_dir (LONG/SHORT), symbol, posted_at
- **25K ideas total** in VIC (backfill in progress for thesis text + HTML)

### Baseline Finding

VIC ideas **underperform SPY** on median at 365d (median alpha = -17.6%, std = 91%). Wide dispersion means strong alpha exists in the tails — the challenge is predicting it.

## Experiment Layers

### E1: Metadata → Alpha (immediate, no additional data needed)

**Features**: thesis_type (9 categories → one-hot), quality_score, trade_dir, posting_year, posting_month
**Target**: alpha at 30d, 90d, 365d
**Models**: OLS, Ridge, Random Forest, XGBoost
**Validation**: Time-based split (train on pre-2024, test on 2024+)
**Question**: Do VIC's own metadata features predict returns?

### E2: Fundamentals at Time of Writing → Alpha

**Additional data**: Finnhub company profiles for each symbol (market_cap, industry, sector)
**Features**: E1 features + market_cap_at_posting, finnhub_industry, beta_at_posting
**Models**: Ridge, XGBoost
**Question**: Do company fundamentals at posting time predict which VIC picks work?

### E3: Embeddings → Alpha (blocked on thesis text backfill)

Mirror moneygun approach:
- Embed full writeup with text-embedding-ada-002 or text-embedding-3-small (1536-dim)
- Per-paragraph embeddings for finer granularity
- Concatenate with fundamentals + metadata features
- Train PyTorch NN (Linear → ReLU → Dropout → Linear, output = alpha prediction)
- **Key difference from moneygun**: VIC is longer-horizon (months, not minutes)

### E4: Semantic Claims → Alpha (blocked on thesis text)

- Extract atomic claims using existing VIC taxonomy (6 categories, 23 sub-types)
- Map claim type distribution to alpha
- Test: do writeups heavy on "Valuation Discount" claims outperform "Market Perception Error"?
- Aggregate claim vectors per writeup → features for prediction

### E5: Reasoning & Deep Context (future)

- Industry status at time of writing (macro regime, sector rotation)
- Crux of thesis → verifiable predictions → track hit rate
- Company profile built from contemporaneous data (Wayback Machine for financials)
- Rank company independent of thesis → does ranking alone correlate with alpha?

## Outlier Handling

- Flag |alpha| > 200% for investigation (stock splits, mergers, delistings)
- Winsorize at 1st/99th percentile for training
- Maintain uncapped data for analysis
- Check VIC "winner" badge correlation with actual alpha

## Evaluation Framework

- **No future leakage**: Time-based train/test split, never train on data posted after test set
- **Metrics**: R², Spearman correlation, quintile excess returns (top quintile vs bottom)
- **Significance**: bootstrap confidence intervals on all metrics
- **Baseline**: random selection from VIC = median alpha (-17.6% at 365d)

## Brainstorm — Full Collection of Approaches

### Already Executed (this session)

1. **Metadata → Alpha (E1)** — thesis_type, quality_score, trade_dir → alpha. Result: weak signal at 365d (Spearman=0.22)
2. **Direction-adjusted thesis return** — LONG: alpha, SHORT: -alpha. Key finding: SHORT theses have 74.8% win rate at 365d
3. **Outlier analysis** — 21 extreme outliers, zero stock split artifacts. SMCI (+1257%) top outlier.
4. **Pattern analysis** — Growth is only positive LONG type (+13.7%). Distressed/event-driven terrible.

### In Progress

5. **Finnhub fundamentals (E2)** — market_cap, industry, sector at time of posting → alpha regression
6. **Whole-doc embeddings (E3)** — text-embedding-3-small on full description+catalysts → NN
7. **Summary embeddings** — thesis_summary embedding (shorter, signal-dense variant)

### Next Up

8. **Per-paragraph embeddings** — Embed each paragraph separately, use attention-weighted aggregation. Some paragraphs may carry more signal (valuation analysis vs company description).
9. **Description vs catalysts** — Embed separately, compare which section predicts better. Catalysts may be more actionable.
10. **Embeddings + fundamentals fusion** — Concat embeddings + Finnhub features + metadata → bigger NN

### Semantic / Reasoning Layer

11. **Claim extraction → alpha** — Use existing VIC taxonomy (6 categories, 23 sub-types) to extract atomic claims. Map claim type distributions to alpha. Hypothesis: ideas heavy on "Valuation Discount" claims outperform "Market Perception Error" ones.
12. **Crux/big bet extraction** — LLM identifies the 1-3 core bets in each thesis. Track whether those specific bets played out. Can we identify thesis cruxes that consistently predict alpha?
13. **Verifiable prediction tracking** — Extract concrete predictions ("revenue will grow 20%", "FDA approval by Q3"). Match against actuals (earnings data, FDA databases). Calibrate author/thesis accuracy.
14. **Contrarian signal** — VIC flags some theses as contrarian. Are contrarian theses more alpha-generative? Combine with sector momentum to identify "smart contrarian" vs "wrong contrarian."

### Company Context Layer

15. **Company profile without thesis** — Build a fundamentals-only score for each company (valuation, growth, quality metrics from Finnhub). Does our ranking correlate with alpha? If yes, the thesis text adds incremental value above fundamentals alone. If not, the thesis IS the signal.
16. **Industry status at time of writing** — What sector rotation regime was active? (Value vs growth, cyclicals vs defensive.) VIC thesis alpha may vary by macro regime.
17. **Wayback Machine financials** — Fetch IR pages / SEC filings from time of posting. Build contemporaneous financial profile. This prevents look-ahead bias that Finnhub current profiles introduce.
18. **Peer comparison** — How did the VIC idea compare to its sector at time of posting? Were they picking the cheapest name in the sector or going against the grain?

### Cross-Reference & Meta

19. **Author alpha tracking** — Which VIC authors consistently generate alpha? Serial outperformance = skill signal. Use author identity as a feature.
20. **Idea clustering** — Embed all ideas, cluster. Do certain clusters of ideas (thematically similar) predict better? E.g., "undervalued cyclical with upcoming earnings catalyst" as a archetype.
21. **Comment/response quality** — If VIC ideas have comments, does engagement/pushback correlate with returns? Ideas that provoke debate may be more interesting.
22. **Temporal patterns** — Ideas posted during market drawdowns may outperform (value investors buy fear). Test posting_month + VIX/SPY context at time.
23. **Short squeeze detection** — For SHORT theses, can we identify which are vulnerable to squeezes (high short interest + momentum)? These are the catastrophic loss candidates.

### Moneygun-Mirrored NN Architecture

24. **Full pipeline (E3b)** — Mirror moneygun exactly:
    - Input: [emb_0...emb_1535] + [log_mcap, industry_one_hot, quality, is_long, year, month]
    - Hidden: [256, 64] with ReLU + Dropout(0.3)
    - Output: predicted alpha (regression) or win/loss (classification)
    - Training: time-split, winsorized targets, MSE loss, Adam optimizer
    - Evaluation: R², Spearman, quintile spread, win rate by predicted decile

25. **Classification variant** — Instead of predicting alpha magnitude, predict win/loss (alpha > 0 vs < 0). Binary classification may be easier and more actionable.
26. **Multi-horizon prediction** — Single model predicts [30d, 90d, 365d] simultaneously (multi-task learning). Shared representation may improve all horizons.
27. **Separate LONG/SHORT models** — Given the massive difference in base rates (37.5% vs 74.8% win rate), separate models may capture different dynamics.

## File Structure

```
finance/vic_analysis/
├── predict_alpha.py        # E1-E2: metadata + fundamentals regression
├── predict_embeddings.py   # E3: embeddings → NN (when text available)
├── predict_claims.py       # E4: semantic claims → alpha
├── outliers.py             # Outlier detection and investigation
├── fundamentals.py         # Finnhub fundamentals fetcher
└── LOGBOOK.md              # Experiment results and observations
```