# CEO Quality Backtest — Design

**Date**: 2026-03-03
**Status**: Approved
**Location**: `finance/ceo_quality/`

## Goal

Discover which CEO/leadership quality signals predict stock returns by backtesting
point-in-time assessments against forward returns across hundreds of companies.

## Why This Matters

Strategy #3 from STRATEGIES.md: "Pick CEOs by psychological makeup, track consistency
of what they say vs do. Companies with bold culture and 'go through walls' mentality
outperform." This module tests that hypothesis with data.

## Key Insight: Point-in-Time Scoring

The initial `intel/eval/` correlation (10 semi companies) used current assessments
against trailing returns — the LLM may have been influenced by recent stock performance.
This module fixes that by scoring CEOs **as of a historical date**, using only information
available at that time, then measuring **forward** returns.

## Architecture

```
finance/ceo_quality/
├── CLAUDE.md           # Project instructions
├── dataset.py          # Build (company, as_of_date, ceo_name) tuples
├── assess.py           # Point-in-time CEO scoring (LLM + Wayback + date control)
├── features.py         # Extract numeric features from assessment JSON
├── predict.py          # Walk-forward CV: CEO features → returns
├── backtest.py         # Binary outcome analysis (winners vs losers)
├── iterate.py          # Vary rubric weights, prompt variations, feature combos
├── LOGBOOK.md
├── data/
│   ├── assessments/    # Per-company point-in-time assessment JSONs
│   ├── dataset.db      # SQLite: companies, assessments, returns, features
│   └── .share
└── reports/
    └── .share
```

## Data Sources

### Company Universe (3 pools)

| Source | Companies | Has Returns? | CEO Data Quality |
|--------|-----------|-------------|-----------------|
| S&P 500 (Finnhub) | ~500 | Yes (Finnhub daily) | High — well-covered |
| Semi universe (existing) | 10 | Yes | Excellent — full dossiers |
| VIC ideas with symbols | ~875 | Yes (returns_cache.db) | Variable — need to discover |

### Point-in-Time Data

| Data | Source | Point-in-Time Method |
|------|--------|---------------------|
| CEO name & tenure | Finnhub profile + SEC DEF 14A | Profile as of date |
| CEO interviews/bios | Company IR pages, YouTube | Wayback Machine (`lib/ingest/archive_fetch.py`) |
| Company financials | Finnhub fundamentals | Filter by reporting date |
| Stock returns | Finnhub daily candles | Forward from as_of date |
| CEO track record | Web search grounded to date | LLM system prompt: "only pre-{date} info" |

### Wayback Machine Integration

Use existing `lib/ingest/archive_fetch.py`:
```python
from lib.ingest.archive_fetch import find_snapshots, fetch_snapshot

# Find snapshots of a company's IR page near our as-of date
snapshots = find_snapshots("https://ir.latticesemi.com/", limit=20)
# Pick closest snapshot before as_of date
html, info = fetch_snapshot(url, timestamp="20240115")  # YYYYMMDD
```

For CEO assessment, combine:
1. Wayback snapshot of company website/IR page
2. LLM web_search with date constraint in system prompt
3. Finnhub profile (static — use current as proxy, CEO tenure confirms timing)

## Point-in-Time CEO Assessment

### Scoring Dimensions (from TFTF + founder deep dive)

| Dimension | Signal | How to Score Point-in-Time |
|-----------|--------|--------------------------|
| Track record | Prior exits, revenue milestones | Public record — stable over time |
| Decision quality | Strategic pivots, M&A outcomes | Wayback news + earnings transcripts |
| Technical depth | Patents, engineering background | LinkedIn/bio snapshots |
| Team building | Exec retention, key hires | SEC proxy + news |
| Drive / intensity | Ambition, work style | Interviews, conference talks |
| Communication | Earnings call clarity, honesty | Transcript analysis |

### Assessment Prompt Strategy

```python
system = f"""You are assessing the CEO of {{company}} as of {as_of_date}.
CRITICAL: Only use information that would have been publicly available
before {as_of_date}. Do not reference events after this date.
You are scoring leadership quality to predict future stock performance."""
```

The LLM gets:
- Wayback snapshot of company page (if available)
- CEO name + basic bio from Finnhub profile
- Constrained web_search (system prompt enforces date cutoff)

Returns structured JSON with scores (0-10) per dimension.

## Binary Outcome Framing

Instead of noisy continuous correlation:

1. Compute 1y forward excess return (vs SPY) for each company from as_of date
2. **Top quartile** (excess > 0, top 25%) → label = 1 ("winner")
3. **Bottom quartile** (excess < 0, bottom 25%) → label = 0 ("loser")
4. **Drop middle 50%** — ambiguous cases add noise
5. Train classifiers on CEO features → binary outcome
6. Measure: AUC, precision/recall, feature importance

This gives cleaner signal and works better with moderate sample sizes.

### Why Binary Is Better Here

- Continuous returns have fat tails and outliers that dominate regressions
- CEO quality is ordinal (exceptional > strong > competent) — better suited to classification
- Binary framing answers the actionable question: "Should I invest in this CEO?"
- Easier to interpret: "CEOs with trait X are 2x more likely to be in the top quartile"

## Walk-Forward Cross-Validation

Reuse pattern from `finance/vic_analysis/predict_robust.py`:

```
Train: 2020-2022 → Test: 2023 (with 365d embargo gap)
Train: 2020-2023 → Test: 2024
Train: 2020-2024 → Test: 2025
```

This prevents look-ahead bias in both:
- Price data (returns only computed forward from as_of)
- CEO assessment (Wayback ensures only pre-date information)

## Iteration Loop

Vary these parameters and rank by AUC on held-out test set:

1. **Rubric weights**: Which CEO dimensions matter most?
2. **Assessment prompt variations**: What questions extract the most predictive info?
3. **Feature engineering**: Raw scores vs ratios vs interactions
4. **Universe filters**: Sector, market cap, CEO tenure ranges
5. **Outcome horizons**: 90d vs 180d vs 365d forward returns
6. **Model**: Logistic regression vs gradient boosting vs simple threshold rules

## Budget

| Phase | LLM Cost | Finnhub Calls |
|-------|----------|---------------|
| Phase 1: 50 S&P companies × 1 date | ~$2 (grok-fast) | ~50 |
| Phase 2: 200 companies × 2 dates | ~$8 (grok-fast) | ~200 |
| Phase 3: 500 companies × 3 dates | ~$25 | ~500 |
| Quality spot-checks (opus) | ~$3 | 0 |

Stay under $10 for initial exploration (Phase 1-2).

## Integration Points

| System | Integration |
|--------|------------|
| `intel/eval/` | Quick correlation tool — keep as-is for exploratory analysis |
| `intel/companies/semi/` | Existing assessments — use as validation set |
| `intel/people/` | CEO dossier data source (for companies already profiled) |
| `finance/vic_analysis/` | VIC ideas as additional company universe |
| `finance/eval/` | VariantConfig pattern for iteration |
| `lib/ingest/archive_fetch` | Wayback Machine snapshots |
| `lib/finnhub/` | Stock prices + company profiles |
| `lib/llm/` | CEO assessment LLM calls |

## Success Criteria

1. **Minimum viable**: Run 50+ company assessments point-in-time, show correlation with forward returns
2. **Signal found**: At least one CEO dimension predicts returns with AUC > 0.55 out-of-sample
3. **Actionable**: Identify 2-3 CEO traits that are most predictive and can be scored on new companies
4. **Validated**: Forward returns (not trailing) with proper embargo, no look-ahead bias

## Relationship to Strategy #3

This directly implements the CEO quality strategy from STRATEGIES.md:
- **(1) company-to-team mapper** → `dataset.py` (Finnhub profiles → CEO names)
- **(2) person dossier from interviews/transcripts** → `assess.py` (Wayback + LLM)
- **(3) extract personality traits** → `features.py` (structured scores)
- **(4) correlate traits with returns** → `predict.py` + `backtest.py`

## Next Steps

1. Build `dataset.py` — pull S&P 500 company list + CEO names from Finnhub
2. Build `assess.py` — point-in-time CEO scoring with Wayback + LLM
3. Run Phase 1: 50 companies, 1 as-of date (e.g., 2024-01-15), 1y forward returns
4. Build `features.py` + `backtest.py` — extract features, run binary analysis
5. Iterate with `iterate.py` — find best dimensions
