# Plan: VIC Theses → pgmpy Inference Network

**Date**: 2026-02-25
**Status**: Draft
**Goal**: Convert ~10 VIC investment theses into a queryable pgmpy Bayesian network

## Context

- 17K+ enriched VIC theses in `vic.db` with summaries, key theses, sectors
- `lib/semnet/` has claim extraction infra (adapter, schema, prompts) — 0 claims extracted yet
- `projects/vic/semanticnet_adapter.py` ready to go with VIC-specific taxonomy
- pgmpy 1.0.0 installed, working (see `inference/example.py`)
- `inference/example.py` proves the concept with a hand-built 4-node thesis network

## Architecture

```
VIC thesis (text)
    ↓
[Step 1] LLM extracts atomic claims
    ↓
claims: [{text, category, confidence, direction}, ...]
    ↓
[Step 2] LLM identifies causal edges between claims
    ↓
edges: [(claim_A, claim_B, strength), ...]
    ↓
[Step 3] LLM estimates conditional probability tables
    ↓
CPTs: {node: P(node | parents)}
    ↓
[Step 4] Build pgmpy DiscreteBayesianNetwork
    ↓
[Step 5] Query in English → formal query → answer
```

## Step 1: Extract Claims (use existing infra)

**Already built**: `lib/semnet/extract.py` + `VICAdapter`

Run the existing extraction pipeline on 10 selected theses. Each thesis yields 5-15 atomic claims like:

```
Thesis: First Energy (FE) — LONG
Claims:
  1. "FE trades at 11x P/E vs 19x utility peer average" (valuation.relative_discount, 0.95)
  2. "FBI bribery scandal does not impair core regulated earnings" (valuation.perception_error, 0.80)
  3. "Potential regulatory penalties are bounded at $500M" (valuation.asymmetric, 0.70)
  4. "Core regulated utility earnings are ~$2.50/share normalized" (operational.margin_cost, 0.85)
```

**Output**: ~100 claims across 10 theses, stored in semanticnet `claims` table.

**Implementation**: `inference/extract_claims.py` — thin wrapper around semanticnet batch extraction.

## Step 2: Identify Causal Edges

**New code needed.** For each thesis, ask the LLM:

> Given these extracted claims from a single investment thesis, identify which claims causally support or depend on other claims. Return edges as (source_claim_id, target_claim_id, relationship_type, strength).

Relationship types:
- `supports` — claim A being true makes claim B more likely
- `requires` — claim B depends on claim A being true
- `undermines` — claim A being true makes claim B less likely
- `amplifies` — claim A magnifies the effect of claim B

Cross-thesis edges (e.g., "FE's regulatory risk" relates to "CAH's SEC investigation") are a later step. Start within-thesis.

**Output**: ~200 directed edges with strength (0-1).

**Implementation**: `inference/build_edges.py`

## Step 3: Estimate CPTs

**New code needed.** For each node (claim) with parents (incoming edges), ask the LLM:

> Claim: "FE trades at 11x P/E vs 19x utility peer average"
> Parents: ["FBI scandal does not impair earnings" (supports, 0.8), "Regulatory penalties bounded" (supports, 0.7)]
>
> Estimate P(this claim holds | each combination of parents being true/false).

For binary nodes with 2 parents, that's 4 probabilities. LLM provides these as domain-expert estimates. We can calibrate later against actual outcomes (VIC tracks winners).

**Key insight**: LLMs are good at estimating these individual CPTs (3-5 variables). They fail at propagating through the whole network — that's pgmpy's job.

**Output**: CPT for each claim node.

**Implementation**: `inference/estimate_cpts.py`

## Step 4: Build pgmpy Network

Straightforward assembly:

```python
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.factors.discrete import TabularCPD

model = DiscreteBayesianNetwork(edges)
for claim in claims:
    cpd = TabularCPD(claim.id, 2, claim.cpt, evidence=claim.parent_ids, ...)
    model.add_cpds(cpd)
model.check_model()
```

**Implementation**: `inference/build_network.py`

## Step 5: Query Interface

Start simple — Python function that translates structured queries:

```python
# Direct pgmpy query
infer.query(["stock_rerate"], evidence={"revenue_growth": 1, "mgmt_quality": 1})

# Slightly higher-level
query_network(
    model,
    question="Given FE's scandal is contained, what's P(stock re-rates)?",
    evidence={"fbi_scandal_contained": True},
    target="stock_rerate"
)
```

NL-to-query adapter (LLM translates English → pgmpy API call) comes after the network works.

**Implementation**: `inference/query.py`

## Thesis Selection (10 theses)

Pick 10 with diversity across:
- Sectors (tech, healthcare, utilities, consumer, industrials)
- Types (value, short, special situation)
- Complexity (simple 5-claim thesis + complex 15-claim thesis)
- Outcome (mix of winners and losers, for later calibration)

Selection query:
```sql
SELECT idea_id, symbol, company, trade_dir, sector, thesis_type, winner,
       length(description) as desc_len
FROM ideas
WHERE content_ok = 1 AND length(description) > 2000
  AND thesis_type IS NOT NULL AND sector IS NOT NULL
ORDER BY RANDOM()
-- then hand-pick for diversity
```

## File Structure

```
inference/
├── README.md              # Vision doc (done)
├── example.py             # Minimal pgmpy example (done)
├── extract_claims.py      # Step 1: semanticnet claim extraction
├── build_edges.py         # Step 2: LLM identifies causal relationships
├── estimate_cpts.py       # Step 3: LLM estimates conditional probabilities
├── build_network.py       # Step 4: assemble pgmpy network
├── query.py               # Step 5: query interface
└── data/
    └── networks/          # Serialized pgmpy networks per thesis
```

## Implementation Order

1. **Select 10 theses** — query DB, pick diverse set
2. **Extract claims** — run semanticnet on 10 theses → ~100 claims
3. **Build edges for 1 thesis** — get the LLM prompt right on a single thesis first
4. **Estimate CPTs for 1 thesis** — same, iterate on prompt
5. **Build + query 1-thesis network** — prove it works end-to-end
6. **Scale to 10** — batch process, handle edge cases
7. **Cross-thesis edges** — connect claims across theses (same sector, same claim type)
8. **NL query adapter** — English → pgmpy query translation

## Open Questions

- **Binary vs multi-state**: Start binary (claim true/false). Some claims are naturally continuous ("revenue growth rate") — handle with discretization bins later.
- **Cross-thesis edges**: Same sector companies share macro factors. How to model?
- **Calibration**: VIC tracks winners. Can we backtest the network's predictions against actual outcomes?
- **Scale**: 10 theses × 10 claims × 2^(avg parents) CPT entries = manageable. 1000 theses? Need structure learning (pgmpy has `HillClimbSearch`, `PC` algorithm).
