# inference/ — Probabilistic Inference Over Knowledge Networks

## The Question

How do you build a system that can answer:

- "If TSMC raises capex 20% and AI chip demand holds, what happens to cloud compute margins?"
- "Given the latest tariff announcements and shipping cost data, which US manufacturers are most exposed?"
- "Based on Fed signaling, employment data, and yield curve, what's the probability of a recession by Q4 2026?"
- "What's the probability that the South China Sea dispute escalates to direct military confrontation within 12 months?"

These questions share a structure: they require combining facts from multiple sources, reasoning through causal chains with uncertainty at each step, and producing calibrated probabilistic answers with provenance. No single tool handles this today — but three categories of tools each handle a piece:

- **LLMs** can reason qualitatively and extract structured claims from text, but struggle to quantify large collections of evidence consistently (see [empirical evidence below](#what-frontier-models-can-and-cant-do-empirically))
- **Knowledge graphs** with query languages (SPARQL, Cypher, GraphQL) can store and traverse millions of facts deterministically — but can't propagate uncertainty or answer "what if" questions
- **Probabilistic inference engines** like pgmpy can compute exact posteriors over causal DAGs with hundreds of nodes — but need structured input, not raw text

This project wires them together.

## Overview

This document covers:

1. **[Two paths](#two-paths)** — LLM-only vs LLM+structure+inference. How they compare, when each wins, why we pursue both.
2. **[What each layer provides](#what-each-layer-provides)** — The natural progression: what LLMs give you → what RDF/SPARQL/GraphQL add → what pgmpy/DoWhy add. Each layer solves problems the previous one can't.
3. **[Architecture](#architecture)** — How the layers compose into a system.
4. **[Extraction: correlation, causation, or both?](#correlation-causation-or-both)** — What we extract from text, what we learn from data, how we merge them.
5. **[Toolkit](#toolkit)** — Core stack, supporting tools, what we skip.
6. **[Plans](#plans)** — MVP from what we have today. Improving fact extraction with gold-standard data.
7. **[Related work](#related-work)** — Who else is building pieces of this, what's missing.
8. **[Key people & events](#key-people--events)** — The field's center of gravity.
9. **[Appendix: worked examples](#appendix-worked-examples)** — Four end-to-end walkthroughs.

## Two Paths

There are two ways to answer the questions above. We need both, and we need to know when each one wins.

### Path A: LLM-Only

Feed the LLM all relevant context and ask it to reason.

```
User question
     ↓
LLM (with retrieved context from RAG)
     ↓
Answer with reasoning chain
```

**What this gives you:**
- Works today, zero infrastructure
- Handles qualitative reasoning well ("what are the main risks?")
- Good at short causal chains (reasoning models extend this to ~100+ steps, but with cliff-like collapse at model-specific thresholds)
- Excellent NL fluency in both input and output
- Can synthesize across heterogeneous sources

**Where it breaks** (with evidence on frontier models):

- **Consistency**: 15–30% of answers change with paraphrasing. 28 of 34 models perform worse on rephrased questions ([arxiv 2509.04013](https://arxiv.org/abs/2509.04013)). Different personas change confidence estimates without changing accuracy ([Xu et al. 2025](https://arxiv.org/abs/2506.00582)). Claude models become *more* overconfident across multi-step tasks ([Barkan 2025](https://arxiv.org/abs/2512.24661)).
- **Compositional scaling**: On 20–30 node causal graphs, GPT-4 shows **60% performance deviation** depending on how the graph is encoded (JSON vs GraphML) ([CausalGraph2LLM, NAACL 2025](https://arxiv.org/abs/2410.15939)). On EconCausal, context shifts cause a **33-point accuracy drop** (88% → 55%), and null-effect detection is only **9.5%** — models almost always hallucinate a causal relationship when none exists ([EconCausal 2025](https://arxiv.org/abs/2510.07231)). Adding a single irrelevant clause to math problems causes up to **65% accuracy drop** ([GSM-Symbolic, Apple/ICLR 2025](https://arxiv.org/abs/2410.05229)).
- **Calibration**: On [ForecastBench](https://www.forecastbench.org/) (ICLR 2025), the best LLM (GPT-4.5) achieves Brier 0.101 vs superforecasters at 0.081. ECE ranges 0.12–0.40 vs superforecasters at 0.03–0.05. GPT-4o in open-ended settings: ECE **0.45** ([arxiv 2502.11028](https://arxiv.org/abs/2502.11028)). Reasoning models (o1, o3) do *not* improve calibration over non-reasoning models.
- **Bayesian updating**: Out-of-the-box LLMs **do not update probabilities with new evidence** — no improvement after the first round of interaction. Specialized fine-tuning can produce 80% agreement with normative Bayesian predictions, but this is not default behavior ([Nature Communications 2025](https://arxiv.org/abs/2503.17523)).
- **Provenance**: The reasoning chain is plausible but not verifiable. You can't audit which facts drove which conclusions.
- **Accumulation**: Each query starts from scratch. There's no persistent knowledge structure that improves over time.

**But LLMs are also getting better fast — and hybrids work:**

- Reasoning models (o3, Claude thinking, DeepSeek-R1) push reliable chain length from ~5 to **~100–1000 steps** — the threshold is real but model-dependent ([arxiv 2509.09677](https://arxiv.org/abs/2509.09677)).
- LLM+data hybrids (LLM-CD) outperform data-only methods by **170% recall**, up to **400%** on some datasets ([Emory 2025](https://www.cs.emory.edu/~jyang71/files/llmcd.pdf)).
- With architectural decomposition (microagent subtasks), MAKER achieved **zero errors over 1M+ LLM steps** on Tower of Hanoi — smaller non-reasoning models provided best reliability-per-dollar ([arxiv 2511.09030](https://arxiv.org/abs/2511.09030)).
- Forecasting accuracy improving ~0.016 difficulty-adjusted Brier points/year (Brier scale: 0 = perfect, 0.25 = random guessing; GPT-4 at 0.131 in Mar 2023 → GPT-4.5 at 0.101 in Feb 2025; superforecasters at 0.081). Linear extrapolation: parity **Nov 2026** (95% CI: Dec 2025–Jan 2028). Source: [ForecastBench](https://forecastingresearch.substack.com/p/ai-llm-forecasting-model-forecastbench-benchmark) (ICLR 2025).

> **The honest take**: The limitation is real but not fixed — it's a rapidly-moving frontier. Our architecture bets on the part that *isn't* moving: **consistency** (KGs with SPARQL give you deterministic queries) and **calibration** (pgmpy gives you Bayesian inference by construction). These are structural properties of the tools, not benchmarks to chase.

### Path B: LLM → Structure → Inference

Use LLMs for what they're good at (NL↔formal translation), dedicated engines for what they're good at (exact inference).

```
User question
     ↓
LLM translates → formal query (SPARQL or GenSQL-style)
     ↓
Query executes against structured knowledge:
  Layer 1: KG (deterministic facts) → SPARQL
  Layer 2: Causal DAG + CPTs (pgmpy) → belief propagation
  Layer 3: Bayesian models (PyMC) → posterior sampling
     ↓
Answer with provenance chain + calibrated confidence
```

**What this gives you:**
- **Consistency**: Same query, same graph → same answer. Always.
- **Compositional scaling**: pgmpy handles 100+ node DAGs without breaking a sweat. The math scales.
- **Calibration**: Track predictions vs outcomes. Adjust CPTs. Get better over time.
- **Provenance**: Every answer traces back through specific edges, CPTs, and source documents.
- **Accumulation**: The graph grows. Each new source either confirms edges (strengthening CPTs), contradicts them (flagging for review), or adds new ones.

**What it costs:**
- Extraction pipeline to build (LLM → structured claims)
- Query translation layer to build (NL → GenSQL/SPARQL)
- Infrastructure: graph DB, pgmpy models, calibration tracking
- Cold start: the graph needs enough facts before it's useful

### How to compare them

Run both on the same questions. Measure:

| Metric | Path A (LLM-only) | Path B (structured) |
|--------|-------------------|---------------------|
| **Consistency** | Ask 10 times, measure variance | Should be zero variance for same evidence |
| **Calibration** | Track predicted probabilities vs outcomes over time | Same, but with mechanism to improve |
| **Provenance quality** | Human rates: "can I verify this reasoning?" | Provenance is machine-auditable |
| **Compositional accuracy** | Test on 5, 10, 20-variable problems | Same problems, compare |
| **Latency** | Fast (one LLM call) | Slower (extraction + query + inference) |
| **Cost per query** | Token cost only | Infrastructure + token cost |

**Priority assessment**: Start with Path A as the baseline (it works today). Build Path B incrementally — the MVP (see [Plans](#plans)) adds structure without requiring the full pipeline. The crossover point is when questions involve >5 variables or when calibration matters. For investment theses and geopolitical analysis, that's most questions.

### The hybrid reality

In practice, every query uses both paths. The LLM handles extraction and NL translation (Path A skills). The structured engine handles inference (Path B skills). The question is how much structure sits in between.

```
Pure Path A:  question → LLM → answer
Minimal B:    question → LLM → KG lookup → LLM summarizes → answer
Medium B:     question → LLM → SPARQL query → deterministic answer
Full B:       question → LLM → GenSQL query → pgmpy inference → calibrated answer
```

We move right along this spectrum as the knowledge graph and causal DAG grow.

## What Each Layer Provides

### Layer 0: What LLMs give you

LLMs are the linguistic glue. They solve three problems that blocked every prior attempt (CYC, Semantic Web, expert systems):

| Capability | Example | Quality |
|-----------|---------|---------|
| **Extraction** | Read an earnings call → structured facts + causal claims | Excellent for entity/value extraction; good for causal claims |
| **NL → formal translation** | "What causes revenue to drop?" → SPARQL or GenSQL query | Good for SPARQL; emerging for probabilistic queries |
| **Qualitative causal reasoning** | "High fixed costs + revenue growth → operating leverage" | Reliable for 3–5 variable chains |
| **Contradiction detection** | "Source A says X, source B says Y" | Good at spotting, poor at resolving quantitatively |
| **Schema suggestion** | "Given these facts, what's a plausible causal graph?" | Useful as priors; not authoritative |

#### What frontier models can and can't do (empirically)

LLMs are strong and getting stronger — the question is specifically where they plateau vs where dedicated tools scale without limit:

| Dimension | Frontier LLM performance (2024–2025) | What dedicated tools give you |
|-----------|--------------------------------------|-------------------------------|
| **Pairwise causal discovery** | 97% accuracy (GPT-4, [Kiciman et al. 2023](https://arxiv.org/abs/2305.00050)) | pgmpy PC algorithm: comparable on data, complementary on text |
| **Multi-variable causal graphs** | 60% performance deviation across encodings on 20–30 node graphs; 9.5% null-effect detection | SPARQL over KG: deterministic traversal, no encoding sensitivity. pgmpy: exact inference, scales to 100+ nodes |
| **Chain length** | Reasoning models: ~100–1000 reliable steps. Standard LLMs: ~5–15 before cliff collapse | pgmpy belief propagation: exact at any DAG size. SPARQL: multi-hop queries limited only by graph size |
| **Forecasting accuracy** | GPT-4.5 Brier 0.101 vs superforecasters 0.081 ([ForecastBench](https://www.forecastbench.org/), ICLR 2025) | pgmpy + calibration layer: Bayesian updating by construction |
| **Calibration (ECE)** | 0.12–0.45 across models; 0.03–0.05 for superforecasters | pgmpy/PyMC: calibrated by construction (Bayesian posteriors) |
| **Consistency** | 15–30% of answers change with rephrasing | SPARQL: deterministic — same query, same answer. pgmpy: same evidence, same posterior. Always. |
| **Bayesian updating** | No improvement from additional evidence without fine-tuning | pgmpy: updates CPTs with new evidence by design |
| **Causal graph suggestion** | LLM+data hybrid (LLM-CD): 170% recall improvement over data-only | CausalNex/pgmpy structure learning: data-only baseline. Best combined. |

**The honest picture**: LLMs are excellent at extraction and qualitative reasoning, and LLM+formal hybrids dramatically outperform either alone. But **calibration and consistency** — the two things that matter most for a system that accumulates and compounds knowledge — are structural properties of the tools, not benchmarks to chase. Knowledge graphs with SPARQL give you consistency. Probabilistic engines like pgmpy give you calibration and Bayesian updating. The layers below exist to provide what LLMs can't.

### Layer 1: What RDF-star / SPARQL / GraphQL give you

Store extracted facts as structured triples. Query them deterministically.

**RDF-star** (not plain RDF) because we need to annotate triples with metadata:

```turtle
# Plain RDF: Axon revenue was $560M
:Axon :revenue "560M"^^xsd:decimal .

# RDF-star: same fact, with provenance and confidence
<< :Axon :revenue "560M"^^xsd:decimal >>
    :source :AXON_Q4_2025_transcript ;
    :confidence 0.99 ;
    :extractedAt "2026-02-15"^^xsd:date .
```

**What this unlocks:**

| Capability | SPARQL example | Why it matters |
|-----------|---------------|---------------|
| **Factual lookup** | `SELECT ?rev WHERE { :Axon :revenue ?rev }` | Deterministic, fast, auditable |
| **Multi-hop traversal** | `SELECT ?co WHERE { ?co :sources_from :China ; :gross_margin ?m . FILTER(?m < 0.30) }` | Compositional queries over the fact graph |
| **Provenance queries** | `SELECT ?src WHERE { << :Axon :revenue ?r >> :source ?src }` | Every fact traces to its source document |
| **Temporal queries** | `SELECT ?rev ?date WHERE { << :Axon :revenue ?rev >> :extractedAt ?date }` | Track how facts change over time |
| **Contradiction detection** | Two triples for same property, different values → flag | Structural, not heuristic |

**NL-to-SPARQL is mature.** LLMs translate English to SPARQL reliably. This layer handles the majority of factual queries without any probabilistic machinery.

**GraphQL** is the API layer — if we expose the KG to external consumers, GraphQL is the natural interface. But internally, SPARQL is more expressive for the graph traversal we need.

**The ceiling**: SPARQL can't propagate uncertainty, can't answer "what if" questions, can't distinguish correlation from causation, can't compute `P(Y | do(X))`. For that, you need Layer 2.

### Layer 2: What pgmpy / DoWhy give you

Store causal relationships as a Bayesian network. Run formal inference.

**pgmpy** is our primary inference engine. It handles:

| Capability | pgmpy API | What it does |
|-----------|----------|-------------|
| **Conditional probability** | `model.query(variables=['Y'], evidence={'X': 1})` | P(Y \| X=1) — exact belief propagation |
| **Causal intervention** | via DoWhy: `model.do(x={'treatment': 1})` | P(Y \| do(X=1)) — Pearl's do-calculus |
| **Counterfactuals** | DoWhy: `model.counterfactual(...)` | "Would Y have been different if X hadn't happened?" |
| **Structure learning** | `HillClimbSearch(data).estimate()` | Discover DAG structure from data |
| **Causal identification** | DoWhy: `model.identify_effect(...)` | "Can I even estimate this effect given my DAG?" — before wasting compute |
| **Simulation** | `BayesianNetwork.simulate(n=1000)` | Monte Carlo samples from the joint distribution |

**GenSQL-style queries compile to these calls.** The query layer we're building translates:

```sql
PROBABILITY OF cloud_margin_delta > 2pp
  GIVEN tsmc_capex_change = +0.20, ai_demand = 'sustained'
  FROM causal_graph
```

into:

```python
model.query(
    variables=['cloud_margin_delta'],
    evidence={'tsmc_capex_change': 'high', 'ai_demand': 'sustained'}
)
# then: P(cloud_margin_delta > 2pp) from the resulting distribution
```

**The ceiling**: pgmpy works with discrete distributions and tabular CPTs. For continuous variables, hierarchical models, or MCMC-based inference, you escalate to PyMC/NumPyro (Layer 3). In practice, most of our queries stay in Layer 2.

### The full stack

```
Question: "If TSMC raises capex 20%, what happens to cloud margins?"

Step 1 (LLM):     Parse question → identify variables, intervention, target
Step 2 (SPARQL):   Look up current TSMC capex, cloud margin data from KG
Step 3 (GenSQL):   PROBABILITY OF cloud_margin_delta > 2pp GIVEN tsmc_capex_change = +0.20
Step 4 (pgmpy):    Belief propagation over causal DAG → distribution over cloud_margin_delta
Step 5 (LLM):      Render answer in English with provenance chain
```

Each layer does what it's best at. No layer tries to do what another layer does better.

## Architecture

```
Web Sources          Storage & Inference              Query Interface
─────────────        ──────────────────              ───────────────
                     ┌──────────────────────┐
Grokipedia      ┐    │ Layer 1: KG (facts)  │         NL question
Inv. theses     │LLM │ RDF-star + SPARQL    │←─────── ↓ (deterministic)
Earnings calls  ├──→ │                      │         LLM → SPARQL
News articles   │    │ Layer 2: Causal DAG  │←─────── ↓ (probabilistic)
Company filings ┘    │ pgmpy + CPTs         │         LLM → GenSQL-style
                     │                      │
  Data sources ──────│ Structure learning   │         DoWhy causal ID
  (time series,  ──→ │ (merge w/ extracted) │         pgmpy belief prop.
   financials)       │                      │
                     │ Layer 3: PyMC/NumPyro│←─────── ↓ (heavy inference)
                     │ (continuous/hier.)   │
                     └──────────────────────┘
                               ↓                      ↓
                     Calibration layer          NL answer + provenance
                     (predicted vs actual)      + confidence interval
```

### Five layers

1. **Extraction** — LLM reads source, outputs structured claims: facts, correlations, causal assertions, and inferences — each tagged by type and confidence. (See [Correlation, Causation, or Both?](#correlation-causation-or-both) for the taxonomy.)
2. **Knowledge graph** — Deterministic facts stored as RDF-star triples with provenance and confidence metadata (Layer 1). Queryable via SPARQL for factual lookups and multi-hop traversal. LLM translates NL → SPARQL.
3. **Causal DAG** — Causal claims and inferences stored as a pgmpy Bayesian network (Layer 2). Edges carry conditional probability tables. Structure learning from data merges with extracted claims. Contradictions flagged.
4. **Query translation** — User asks in English → LLM translates to GenSQL-style probabilistic query (`PROBABILITY OF ... GIVEN ...`, `SIMULATE ... GIVEN ...`) or SPARQL depending on query type → appropriate engine executes.
5. **Answer** — Results rendered back to English with provenance chain, confidence intervals, and source attribution.

## Correlation, Causation, or Both?

We do **three things**, and the distinction matters:

### 1. Extract claims from text (LLM)

The LLM reads sources and pulls out structured assertions. These come in flavors:

| Type | Example | How extracted | Confidence source |
|------|---------|--------------|-------------------|
| **Fact** | "Axon revenue was $560M in Q4 2025" | Named entity + value extraction | Source reliability |
| **Correlation claim** | "When semiconductor capex rises, chip prices tend to fall 18 months later" | Pattern described in analyst text | Author's track record, citation count |
| **Causal claim** | "TSMC's capex increase *causes* fab capacity expansion" | Explicit causal language ("causes", "leads to", "drives", "because") | Domain expertise of source |
| **Inference** | "Top-line growth + high fixed costs → operating leverage improves" | Causal reasoning chain in text | Logical validity + premise confidence |

The LLM is good at this — distinguishing "X correlates with Y" from "X causes Y" in natural language is exactly the kind of linguistic judgment LLMs excel at. Each extracted claim gets tagged with its type.

### 2. Learn structure from data (algorithms)

Independent of text extraction, we can discover statistical patterns from data:

| Method | What it finds | Tool |
|--------|--------------|------|
| **Correlation discovery** | "Columns A and B co-vary" | pandas, scipy |
| **Conditional independence testing** | "A and B are independent given C" | pgmpy `PC` algorithm, `HillClimbSearch` |
| **Causal structure learning** | "The best-fitting DAG is A → B ← C" | CausalNex, pgmpy structure learning, DoWhy |
| **Intervention estimation** | "The causal effect of A on B is Δ" | DoWhy `identify_effect` + `estimate_effect` |

This is the "let the data speak" path. It discovers structure that no one explicitly stated.

### 3. Merge extracted + learned (the hard part)

The real system uses **both** — and they complement each other:

```
Text extraction (expert knowledge)     Data-driven learning
─────────────────────────────────     ─────────────────────
"Analysts say X causes Y"             PC algorithm says X ⊥ Y | Z
         ↓                                      ↓
    Prior DAG structure              Statistical evidence for/against edges
         ↓                                      ↓
         └──────────── Merge ──────────────────┘
                         ↓
              Final causal DAG + CPTs
              (weighted by source quality + statistical evidence)
```

**Extracted causal claims become priors** on the DAG structure. Structure learning algorithms test them against data. When they agree, confidence is high. When they disagree — "the analyst says X causes Y but the data shows X ⊥ Y | Z" — that's a **finding**, flagged for human review or further investigation.

This is where the value compounds: each new source either confirms existing edges (strengthening CPTs), contradicts them (triggering review), or adds new ones (expanding the graph). Over time the network gets both larger and better calibrated.

### What we're NOT doing

- **Not training ML models to predict causality** — we're extracting claims and testing them, not building a "causal classifier"
- **Not assuming all extracted claims are causal** — correlations stay as correlations until validated
- **Not treating LLM confidence as calibrated** — LLM says "90% confident" but that gets recalibrated against actuals over time

## Toolkit

### Core Stack

> **Two pillars: pgmpy (inference engine) + GenSQL-inspired query layer (query semantics).**
> pgmpy gives us causal DAGs, belief propagation, and conditional queries — the computational spine.
> GenSQL (MIT, PLDI 2024) gives us the query language design — `PROBABILITY OF`, `SIMULATE`, `GIVEN` — which we implement in Python as a thin translation layer over pgmpy. PyMC/DoWhy extend the system for heavier inference and causal identification.

| Tool       | Role                                         | Why                                                    |
|------------|----------------------------------------------|--------------------------------------------------------|
| **pgmpy** ★ | Causal DAGs, belief propagation, CPTs        | **Inference engine.** Native DAG operations, `model.query(variables=['Y'], evidence={'X': 1})`, structure learning |
| **GenSQL semantics** ★ | Query language design target              | **Query layer.** SQL + probabilistic primitives (`PROBABILITY OF`, `SIMULATE`, `GIVEN`, `GENERATIVE JOIN`). We build a Python implementation; GenSQL itself is Clojure-only and dormant. See [GenSQL paper](https://arxiv.org/abs/2406.15652) |
| **DoWhy**  | Causal identification, do-calculus            | "Can I even estimate this effect given my DAG?" — identification before inference |
| **PyMC**   | Heavy-duty Bayesian inference, posteriors     | When pgmpy's discrete inference isn't enough — continuous distributions, hierarchical models |
| **NumPyro**| Fast inference (JAX-compiled)                 | Drop-in speed upgrade when PyMC is too slow            |

### Supporting

| Tool          | Role                                      |
|---------------|-------------------------------------------|
| **SPPL**      | Exact symbolic inference (Python). GenSQL uses it as a backend. Standalone pip package. [github](https://github.com/probsys/sppl) |
| **Bambi**     | R-formula-style model specification on PyMC ("y ~ x1 + x2") |
| **ArviZ**     | Diagnostics and visualization for Bayesian models |
| **CausalNex** | Bayesian network structure learning from data (McKinsey/QuantumBlack) |
| **Gen.jl**    | MIT probabilistic programming (Julia). Mansinghka's group — programmable inference, involutive MCMC. Separate from GenSQL. Most active repo in the ecosystem (1.8k stars). [gen.dev](https://www.gen.dev/) |
| **PyWhy-LLM** | Experimental: LLM capabilities for causal analysis (Microsoft) |

### Deliberately Skipped (for now)

- **Stan** — special syntax, slower iteration cycle
- **Turing.jl** — Julia ecosystem friction
- **BUGS/JAGS** — declining usage
- **Lean** — verification layer, practical payoff too far out
- **Hakaru/DICE/libDAI** — very specialized, limited training data

## Plans

### Plan 1: MVP from What We Have

**Goal**: Answer one of the motivating questions end-to-end, with provenance, using real data. Doesn't need to be fast or pretty — needs to work.

**What we have today**: LLM extraction (vario/), web fetching (lib/ingest/), financial data (finance/), company intel (intel/). No graph DB, no pgmpy models, no query layer.

**MVP scope**: One domain (investment thesis), one question type (conditional probability), hand-built causal DAG, real extracted facts.

| Step | What | Output | Effort |
|------|------|--------|--------|
| 1. **Extract** | LLM reads 3–5 earnings call transcripts, outputs structured facts + causal claims in JSON | `facts.json` with ~50 facts, ~10 causal edges | 1 day |
| 2. **Build DAG** | Hand-code a pgmpy BayesianNetwork from the extracted causal claims. ~10 nodes, ~12 edges. | `model.py` with working pgmpy model | 1 day |
| 3. **Query** | Run `model.query()` for the TSMC→cloud margins question. Compare to LLM-only answer. | Side-by-side comparison: LLM vs pgmpy | 0.5 day |
| 4. **Provenance** | Each pgmpy node links back to the source fact + document | Traceable answer | 0.5 day |
| 5. **GenSQL parser** | Minimal parser: `PROBABILITY OF ... GIVEN ...` → pgmpy `model.query()` | Working query language (subset) | 1 day |
| 6. **NL translation** | LLM translates English question → GenSQL query → execute → render English answer | End-to-end demo | 1 day |

**Total**: ~5 days to a working demo that answers "If TSMC raises capex 20%, what happens to cloud margins?" with a provenance chain back to earnings call transcripts.

**What this proves**: That the pipeline works end-to-end. That structured inference produces different (better?) answers than LLM-only. That provenance is achievable.

**What it doesn't prove**: That extraction scales. That the query language is expressive enough. That calibration works. Those are Plan 2+.

### Plan 2: Improving Fact Extraction

**The problem**: The MVP hand-codes the causal DAG. To scale, we need LLM extraction to be reliable enough to auto-populate the graph. That means we need to measure extraction quality against gold standards, then iterate.

**Extraction taxonomy** (what the LLM needs to output per source):

```yaml
- type: fact          # fact | correlation | causal | inference
  subject: TSMC
  predicate: capex_2025
  object: "$38B"
  confidence: 0.99
  source: "TSMC Q4 2025 earnings transcript"
  span: "We plan to invest approximately $38 billion..."

- type: causal
  cause: tsmc_capex_increase
  effect: fab_capacity_expansion
  direction: positive
  lag: "18-24 months"
  confidence: 0.85
  source: "Morgan Stanley semiconductor note, Jan 2026"
  mechanism: "Higher capex funds new N3/N2 fab construction"
```

**Gold standard datasets for tuning extraction**:

Causal relation extraction:

| Dataset | Size | Domain | Use for us |
|---------|------|--------|-----------|
| **SemEval 2010 Task 8** | 10,717 sentences, 9 relation types incl. Cause-Effect | General | Sentence-level causal vs non-causal classification |
| **BECauSE v2.1** | 629 sentences, rich causal cue annotation | News/wiki | Few-shot prompt examples for causal claim extraction |
| **EventStoryLine v1.5** | 2,608 event pairs across 258 docs | News | Document-level cross-sentence causality |
| **ADE Corpus v2** | 6,821 drug→condition causal relations | Biomedical | Domain-specific causal extraction (transferable patterns) |
| **PDTB 3.0** | 7,991 causal instances from 50K discourse relations | WSJ | Gold standard for discourse-level causality (LDC license) |

Knowledge graph / relation extraction:

| Dataset | Size | Domain | Use for us |
|---------|------|--------|-----------|
| **DocRED** | 5,053 docs, 132K entities, 56K relations | Wikipedia | **Primary**: document-level RE, 40%+ require cross-sentence reasoning |
| **Re-TACRED** | 106,264 sentences, 41 relation types (label-corrected) | News | Sentence-level RE baseline |
| **REBEL** | ~1.5GB, 220 relation types (Wikipedia→Wikidata) | General | Large-scale seq2seq extraction pretraining |
| **Text2KGBench** | 13,474 + 4,860 sentences with ontology schemas | Wiki/DBpedia | LLM-based KG construction with ontology compliance |

Financial domain (most directly relevant):

| Dataset | Size | Domain | Use for us |
|---------|------|--------|-----------|
| **FinCausal 2020–2023** | ~2–3K sentences per edition | Financial news + SEC | **Top priority**: causal extraction in finance. Up to 95.5 F1 achieved |
| **REFinD** | 28,676 instances, 22 relations, 8 entity types | SEC 10-X filings | Largest financial RE dataset. Structured facts from filings |
| **HiFi-KPI** (2025) | 41,211 quarterly + 14,188 annual reports | 10-K/10-Q filings | Very fresh. KPI extraction — metrics, values, temporal context |
| **FinReflectKG** (2024) | S&P 100 companies, 2024 10-K filings | Annual filings | Agentic KG construction benchmark — directly tests our pipeline |

Causal reasoning (evaluating inference quality, not extraction):

| Dataset | Size | Domain | Use for us |
|---------|------|--------|-----------|
| **CausalProbe-2024** | 3 difficulty levels, post-Jan-2024 text | General | **No contamination risk** — evaluates current LLMs fairly |
| **e-CARE** | 21,000 questions with causal explanations | General | Tests causal choice + explanation quality |
| **CausalBench** (2024) | Multi-dimensional (4 perspectives per problem) | Multi-domain | Cause→effect, effect→cause, + interventional variants |
| **EconCausal** (2025) | Economics-specific causal questions | Economics | Domain-specific causal reasoning — close to our use case |

Uncertainty/confidence (thin category — gap in the field):

| Dataset | Size | Notes |
|---------|------|-------|
| **FactBank** | 9,500 events, 208 docs | Assigns factuality values (certain/probable/possible) — closest to probabilistic claim extraction |
| **UW Factuality** | ~35K predicates, -3 to +3 scale | Continuous confidence. Small but directly relevant |
| **CoNLL-2010** | 14,541 + 11,110 sentences | Hedge/speculation detection — useful as extraction component |

**Gap**: No open dataset combines claim extraction + numeric confidence. FactBank is closest but old (2009). Financial causal + uncertainty is unserved — we'll likely need to build a small (200-example) gold set from our own sources.

**Extraction improvement plan**:

| Phase | What | Metric |
|-------|------|--------|
| **Baseline** | Run current LLM extraction (zero-shot) on FinCausal + BECauSE. Measure precision/recall for causal claim extraction. | P/R/F1 on causal edge detection |
| **Prompt engineering** | Iterate on extraction prompts using FinCausal as dev set. Add few-shot examples from financial domain. | F1 improvement on held-out test |
| **Confidence calibration** | Compare LLM-assigned confidence to ground truth. Plot calibration curve. | Expected Calibration Error (ECE) |
| **Structure validation** | Extract causal graph from a set of documents, compare to expert-annotated graph (if available) or to pgmpy structure learning output. | Edge overlap (Jaccard), SHD (structural Hamming distance) |
| **Domain tuning** | Build a small (200-example) gold set from our own financial sources. Use for few-shot and evaluation. | Domain-specific F1 |

**Key insight**: We don't need to train a model. We need to measure extraction quality, iterate on prompts, and calibrate confidence. LLMs are already good at this — the question is how good, and where the failure modes are.

## Related Work

Nobody has built the full pipeline (extract → accumulate → query with calibrated probabilities). But pieces exist:

### Closest to Our Vision

| Project / Paper | What It Does | Gap |
|-----------------|-------------|-----|
| **GenSQL** (Mansinghka, MIT, PLDI 2024) | SQL + probabilistic primitives over generative table models | Clojure-only, tabular-only, dormant since mid-2024. Query semantics are right; implementation is wrong for us. **Our design target.** |
| **Causal-Copilot** (Huang et al.) | Autonomous causal analysis agent | Single-session analysis, not persistent accumulation |
| **PyWhy-LLM** (Sharma/Kiciman, Microsoft) | LLM capabilities for causal graph specification | Experimental; human-guided, not automated extraction |
| **UAG** ([arxiv 2410.08985](https://arxiv.org/abs/2410.08985)) | Uncertainty quantification + conformal prediction in KG-LLM reasoning | Focuses on retrieval error control, not causal inference |
| **Textual Bayes** ([arxiv 2506.10060](https://arxiv.org/abs/2506.10060)) | Bayesian framework treating LLM prompts as parameters for calibration | Statistical foundation for calibration but no KG or causal layer |

### Adjacent Systems

| System | Relationship |
|--------|-------------|
| **Metaculus** ([forecasting-tools](https://github.com/Metaculus/forecasting-tools)) | Prediction market with crowd calibration — but human-authored rationales, no automated extraction |
| **OpenSPG** ([github](https://github.com/OpenSPG/openspg)) | Ant Group's industrial KG engine with logic fusion — deterministic, no probabilistic inference |
| **KQA Pro** ([ACL 2022](https://aclanthology.org/2022.acl-long.422/)) | Compositional QA over knowledge bases — deterministic symbolic reasoning, no uncertainty |
| **CausalNLP** ([reading list](https://github.com/zhijing-jin/CausalNLP_Papers)) | Curated papers on causality + NLP — literature, not a system |

### What's Missing (Our Wedge)

The field has pieces: causal extraction from text (surveys exist), KG construction (mature), compositional QA (benchmarked), uncertainty estimation in LLMs (active research), and GenSQL proved the query semantics (PLDI 2024). But no one has **connected these into a single pipeline** where facts flow from text → causal DAG → GenSQL-style queries → calibrated answers with provenance. GenSQL got closest but is Clojure-only, tabular-only, and dormant. We build the Python version on causal graphs.

## Key People & Events

### At MIT — Start Here

| Person                  | Role                                                                    |
|-------------------------|-------------------------------------------------------------------------|
| **Vikash Mansinghka**   | Leads MIT Probabilistic Computing Project. Built Gen.jl (Julia), GenSQL (Clojure, PLDI 2024), BayesDB (predecessor). GenSQL's query semantics are our design target. [probcomp.csail.mit.edu](https://probcomp.csail.mit.edu/) |
| **Alexander Lew**       | Mansinghka's group — automatic integration/differentiation of probabilistic programs |
| **Caroline Uhler**      | MIT — causal inference + ML + biology intersection                      |
| **David Sontag**        | MIT CSAIL — causal inference applied to healthcare                      |
| **Victor Chernozhukov** | MIT Economics — ML methods for causal inference (double/debiased ML)    |

### Theoretical Foundations

| Person                  | Role                                                                    |
|-------------------------|-------------------------------------------------------------------------|
| **Judea Pearl**         | UCLA. Godfather. Invented do-calculus, structural causal models, Pearl Causal Hierarchy |
| **Elias Bareinboim**    | Columbia. Pearl's student. AAAI Fellow. Leads [causalai.net](https://causalai.net/) — Causal Artificial Intelligence lab |
| **Bernhard Scholkopf**  | MPI Tubingen. Pioneered "causality for ML" perspective                  |

### LLM + Causality Hybrid (Most Relevant)

| Person                  | Role                                                                    |
|-------------------------|-------------------------------------------------------------------------|
| **Amit Sharma**         | Microsoft Research India. Co-created DoWhy. Leading LLM+causality work. [amitsharma.in](https://amitsharma.in/) |
| **Emre Kiciman**        | Microsoft Research. Co-leader of DoWhy + LLM+causality. Co-authored "[Causal Reasoning and LLMs: Opening a New Frontier](https://par.nsf.gov/biblio/10574854)" |
| **Biwei Huang**         | Co-authored Causal-Copilot — autonomous causal analysis agent           |
| **Kun Zhang**           | CMU. Causal discovery algorithms. Works with Peter Spirtes              |

### Industry

| Entity                  | Role                                                                    |
|-------------------------|-------------------------------------------------------------------------|
| **causaLens**           | London startup (Darko Matovski). Most prominent causal AI company. Agentic AI platform. $45M raised |
| **PyMC Labs**           | Commercial arm of PyMC. Bayesian modeling consulting                    |
| **PyWhy ecosystem**     | Open-source causal ML ecosystem. Houses DoWhy, EconML, PyWhy-LLM. [github.com/py-why](https://github.com/py-why) |

### Events

- **CLeaR 2026** (Causal Learning and Reasoning) — Cambridge, MA, MIT/Harvard area. [cclear.cc](https://www.cclear.cc/)
- **Causal AI Conference** — causaLens-organized, London

---

## Appendix: Worked Examples

These walk through how queries flow from English through the system. Each shows: the question, what gets extracted from sources, the causal graph that forms, the formal inference, and the answer.

### Example 1: Investment Thesis — TSMC Capex → Cloud Margins

**Question**: "If TSMC raises capex 20% and AI chip demand holds, what happens to cloud compute margins over the next two years?"

**Extracted facts** (from earnings calls, analyst reports, news):
- `TSMC_capex_2025 = $38B` (TSMC Q4 2025 earnings, confidence: 0.99)
- `TSMC_capex_growth → fab_capacity_expansion` (analyst consensus, 0.90)
- `fab_capacity_expansion → chip_supply_increase` (lagged 18-24 months, 0.85)
- `AI_chip_demand_growth = 35% YoY` (multiple sources, 0.80)
- `chip_supply_increase ∧ demand_holds → unit_price_pressure` (economic reasoning, 0.75)
- `cloud_providers_pass_through_rate ≈ 0.6` (historical pattern, 0.70)

**Causal graph**:
```
TSMC_capex ──→ fab_capacity ──→ chip_supply ──┐
                                               ├──→ chip_unit_price ──→ cloud_compute_cost ──→ cloud_margin
AI_chip_demand ───────────────────────────────┘
```

**Formal query**:
```sql
PROBABILITY OF cloud_margin_delta > 2pp
  GIVEN tsmc_capex_change = +0.20, ai_demand = 'sustained'
  FROM causal_graph
```

**Answer**: "Cloud compute gross margins likely expand 2-4 percentage points by 2027 (68% CI: 1.5–5.5pp). The main mechanism: TSMC's capex increase leads to ~15% more advanced fab capacity by late 2026, which — with sustained demand — means higher utilization without proportional cost increases for hyperscalers. The uncertainty is dominated by the demand variable: if AI workload growth slows below 20% YoY, the margin expansion narrows to 0–2pp. Sources: TSMC Q4 2025 transcript, Morgan Stanley semiconductor note (Jan 2026), AWS re:Invent pricing signals."

### Example 2: Supply Chain Exposure — Tariffs → Manufacturer Risk

**Question**: "Given the latest tariff announcements and shipping cost data, which US manufacturers are most exposed?"

**Extracted facts** (from trade policy news, SEC filings, shipping indices):
- `US_tariff_china_electronics = 35%` (USTR announcement, 0.95)
- `US_tariff_china_industrial = 25%` (0.95)
- `Freightos_Baltic_Index = $2,400/FEU` (up 40% QoQ, 0.99)
- `Company_X_china_sourcing_pct = 0.62` (10-K filing, 0.95)
- `Company_Y_china_sourcing_pct = 0.15` (10-K filing, 0.95)
- `Company_X_gross_margin = 0.28` (thin margin, 0.95)
- `high_china_sourcing ∧ thin_margin ∧ tariff_increase → margin_compression` (0.90)
- `shipping_cost_increase → COGS_increase` (pass-through ~0.8, 0.85)

**Causal graph**:
```
tariff_rate ─────────┐
china_sourcing_pct ──┼──→ input_cost_delta ──→ margin_impact ──→ exposure_score
shipping_cost ───────┘          │
gross_margin ───────────────────┘
```

**Formal query**:
```sql
SIMULATE exposure_score, margin_impact
  GIVEN tariff_electronics = 0.35, shipping_index_delta = +0.40
  ASSUMING DO(tariff_industrial = 0.25)
  FROM causal_graph
  GROUP BY company
  ORDER BY exposure_score DESC
```

**Answer**: "Highest exposure: Company X (exposure score 0.87 — 62% China-sourced inputs, 28% gross margin, directly in tariffed categories). Estimated margin hit: 4–7pp without repricing. Company Y is relatively insulated (score 0.23 — 15% China sourcing, 45% gross margin). The shipping cost spike adds ~1.5pp COGS pressure across the board but is secondary to tariff impact for high-China-sourced firms. Provenance: USTR Federal Register notice (Feb 2026), Company X 10-K §7A, Freightos index (Feb 24 2026)."

### Example 3: Macro — Recession Probability

**Question**: "Based on current Fed signaling, employment data, and yield curve, what's the probability of a US recession by Q4 2026?"

**Extracted facts** (from FOMC minutes, BLS data, Treasury rates):
- `fed_funds_rate = 4.25%` (FOMC, 0.99)
- `fed_dot_plot_median_2026 = 3.75%` (3 cuts signaled, 0.85)
- `unemployment_rate = 4.3%` (BLS Jan 2026, 0.99)
- `unemployment_trend = +0.4pp over 6 months` (triggers Sahm rule threshold, 0.95)
- `yield_curve_10y_2y = +0.15%` (recently uninverted, 0.99)
- `yield_curve_uninversion → recession_lag_6_18mo` (historical pattern, 0.70)
- `sahm_rule_triggered → recession_probability_elevated` (historical base rate ~0.85 when triggered, 0.80)
- `fed_cutting ∧ labor_weakening → demand_contraction` (0.65)

**Causal graph**:
```
fed_rate_path ────────────────┐
unemployment_trend ───────────┼──→ demand_trajectory ──→ recession_probability
yield_curve_signal ───────────┤
consumer_spending_trend ──────┘
```

**Formal query**:
```sql
PROBABILITY OF recession = TRUE
  GIVEN sahm_triggered = TRUE, curve_uninverted_months = 4, fed_cutting = TRUE
  FROM causal_graph
  WITH CONFIDENCE INTERVAL 0.90
```

**Answer**: "P(recession by Q4 2026) ≈ 0.42 (90% CI: 0.28–0.58). The Sahm rule trigger is the strongest single signal (historical base rate ~85% when triggered, but this cycle's trigger is marginal — 0.4pp vs the 0.5pp threshold, so we discount to ~60%). The yield curve uninversion 4 months ago is consistent with the 6–18 month recession lag window. Offsetting factors: Fed is actively cutting (3 cuts priced in), and consumer balance sheets remain relatively strong. The wide confidence interval reflects genuine uncertainty — the model's provenance shows 14 contributing facts from 8 sources, with the unemployment trajectory and consumer spending data pulling in opposite directions."

### Example 4: Geopolitical — Conflict Escalation

**Question**: "What's the probability that the South China Sea dispute escalates to a direct military confrontation between the US and China within 12 months?"

**Extracted facts** (from news, policy analysis, military tracking):
- `PLA_Navy_exercises_SCS_2026 = 12` (3x 2024 baseline, 0.90)
- `US_FONOPs_SCS_2026 = 8` (up from 5 in 2024, 0.90)
- `Philippines_mutual_defense_treaty_invocation_rhetoric = elevated` (0.80)
- `US_China_mil_to_mil_communication_channel = active` (restored 2024, 0.85)
- `economic_interdependence_trade_volume = $580B` (still high, 0.95)
- `historical_SCS_incidents → confrontation_rate ≈ 0.02` (low base rate, 0.75)
- `mil_communication_active → de_escalation_mechanism` (reduces confrontation risk ~40%, 0.70)
- `high_exercise_tempo → incident_probability_increase` (0.65)

**Causal graph**:
```
military_activity_tempo ──────┐
diplomatic_channel_status ────┼──→ incident_probability ──→ escalation_probability ──→ confrontation
economic_interdependence ─────┤                                    │
domestic_political_pressure ──┘                                    │
historical_base_rate ──────────────────────────────────────────────┘
```

**Formal query**:
```sql
PROBABILITY OF confrontation = TRUE
  GIVEN exercise_tempo = 'high', mil_comm = 'active', trade_volume = 'high'
  FROM causal_graph
  WITH CONFIDENCE INTERVAL 0.90, PROVENANCE
```

**Answer**: "P(direct military confrontation, 12 months) ≈ 0.04 (90% CI: 0.01–0.09). Despite elevated military activity (3x exercise tempo), the base rate for SCS incidents escalating to confrontation is very low (~2%), and active mil-to-mil communication channels reduce it further. The $580B bilateral trade volume creates strong deterrence on both sides. The model identifies domestic political pressure as the highest-uncertainty variable — a Taiwan-related crisis or election-year posturing could shift this estimate sharply. Provenance: 23 facts from CSIS Asia Maritime Initiative, DoD freedom-of-navigation reports, Reuters diplomatic coverage. Note: this estimate has wide tails — the 95th percentile reaches 0.14 if mil-to-mil channels break down."