# Semantic Net — Design Document

**Date**: 2026-02-23
**Status**: Draft
**Location**: `lib/semnet/`

## Problem

We have 13K+ investment writeups (VIC) and want to:
1. Extract and classify the atomic claims/arguments within each
2. Browse similar items via embedding-based retrieval with optional hard filters
3. Read any single item with its claims decomposed and annotated
4. Let the taxonomy evolve from the data, not be imposed top-down

This same capability applies to other corpora (Healthy Gamer YouTube, earnings calls, research papers, brain/strategies experiments). We need a reusable library.

## Consumers

| Consumer | Content | Status |
|----------|---------|--------|
| `projects/vic/` | 13K+ VIC investment writeups | Primary, drives v1 |
| `projects/healthygamer/` | Healthy Gamer YouTube transcripts | Planned |
| `brain/strategies/` | Strategy experiment results, Vario outputs | Planned — brain also provides Vario for taxonomy refinement |
| `finance/` | Earnings call transcripts | Future |

Note: brain/strategies is both a **consumer** (searching across experiments) and a **provider** (Vario powers taxonomy refinement).

## Goals

1. **Semantic search**: Given text or a collection of claims, find the most similar items via embeddings + optional taxonomy/tag hard filters
2. **Structured extraction**: Decompose each document into atomic claims with taxonomy labels, freeform tags, confidence scores
3. **Evolving taxonomy**: Seed taxonomy bootstraps extraction; periodic clustering discovers new categories from the data
4. **Dual viewing**: Summary cards for scanning across corpus; annotated detail view for deep reading
5. **Domain-agnostic core**: VIC writeups and Healthy Gamer content are adapters, not special cases

## Non-Goals (Phase 1)

- Full knowledge graph with entity relationships
- Real-time streaming ingestion
- Multi-tenant / auth
- Production web deployment (local-first)

## Architecture

```
┌──────────────────────────────────────────────────────────┐
│                    lib/semnet/                           │
│                                                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ extract  │  │ embed    │  │ store    │  │ evolve   │  │
│  │          │  │          │  │          │  │          │  │
│  │ Claims   │  │ 3-level  │  │ Qdrant + │  │ HDBSCAN  │  │
│  │ from docs│  │ vectors  │  │ SQLite   │  │ + LLM    │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
│                                                          │
│  ┌──────────┐  ┌──────────┐                              │
│  │ query    │  │ present  │                              │
│  │          │  │          │                              │
│  │ Hybrid   │  │ Cards +  │                              │
│  │ retrieval│  │ Atlas    │                              │
│  └──────────┘  └──────────┘                              │
└──────────────────────────────────────────────────────────┘
        ▲                  ▲
        │                  │
┌───────┴──────┐  ┌───────┴──────┐
│ vic adapter  │  │ hg adapter   │
│              │  │              │
│ Extraction   │  │ Extraction   │
│ prompt +     │  │ prompt +     │
│ seed taxonomy│  │ seed taxonomy│
└──────────────┘  └──────────────┘
```

### Modules

| Module | Responsibility |
|--------|---------------|
| `extract` | LLM-based claim extraction from documents. Domain adapter provides: extraction prompt, seed taxonomy, content preprocessing |
| `embed` | 3-level embedding: document summary, contextualized paragraphs, atomic claims. Uses voyage-finance-2 (configurable) |
| `store` | Dual storage: Qdrant (vectors + payloads) + SQLite (structured data). Schema management, collection setup |
| `evolve` | Taxonomy evolution: HDBSCAN clustering over claim embeddings, LLM labeling of new clusters, taxonomy health checks |
| `query` | Hybrid retrieval: semantic search, faceted search, "more like this". Wraps Qdrant's Query API with RRF fusion |
| `present` | Summary card generation, Apple Embedding Atlas integration, HTML output |

### Domain Adapters

Each domain provides:
```python
class DomainAdapter:
    name: str                          # "vic", "healthygamer"
    extraction_prompt: str             # LLM prompt for claim extraction
    seed_taxonomy: dict                # Initial taxonomy tree
    content_preprocessor: Callable     # Raw content → clean text
    metadata_schema: dict              # Domain-specific metadata fields
    embedding_model: str               # Default: "openai/text-embedding-3-large"
    domain_embedding_model: str | None  # Optional domain-specific (e.g., "voyage-finance-2" for VIC)
```

## Data Model

### SQLite (structured data, in domain's data dir)

```sql
-- Documents (domain-specific table, e.g., ideas for VIC)
-- Already exists; semanticnet references by doc_id

-- Extracted claims (one per atomic argument)
claims (
    claim_id        INTEGER PRIMARY KEY AUTOINCREMENT,
    doc_id          TEXT NOT NULL,          -- references domain's document ID
    claim_text      TEXT NOT NULL,          -- atomic claim, 1-2 sentences
    evidence        TEXT,                   -- supporting excerpt from document
    category        TEXT,                   -- top-level taxonomy node
    sub_type        TEXT,                   -- sub-type within category
    tags            TEXT,                   -- JSON array of freeform tags
    confidence      REAL,                   -- extraction confidence 0-1
    direction       TEXT,                   -- "bullish"/"bearish" or domain equivalent
    pass_number     INTEGER DEFAULT 1,      -- which extraction pass produced this
    model           TEXT,                   -- which LLM extracted this
    created_at      TEXT DEFAULT (datetime('now'))
)

-- Taxonomy nodes (evolving)
taxonomy_nodes (
    node_id         TEXT PRIMARY KEY,       -- e.g., "valuation.sotp"
    parent_id       TEXT,                   -- NULL for top-level
    name            TEXT NOT NULL,
    description     TEXT,
    decision_rule   TEXT,                   -- one-line classification rule
    source          TEXT,                   -- "seed_v1", "cluster_discovered", "manual"
    examples        TEXT,                   -- JSON array of example claim_ids
    claim_count     INTEGER DEFAULT 0,
    created_at      TEXT DEFAULT (datetime('now')),
    retired_at      TEXT                    -- NULL if active
)

-- Evolution log
taxonomy_changes (
    change_id       INTEGER PRIMARY KEY AUTOINCREMENT,
    change_type     TEXT,                   -- "add_node", "merge_nodes", "rename", "retire"
    node_id         TEXT,
    details         TEXT,                   -- JSON with before/after
    cluster_id      TEXT,                   -- HDBSCAN cluster that triggered this
    approved        BOOLEAN DEFAULT FALSE,  -- human-in-the-loop approval
    created_at      TEXT DEFAULT (datetime('now'))
)

-- Holdout tracking
holdout_set (
    doc_id          TEXT PRIMARY KEY,
    holdout_group   TEXT                    -- "validation", "test"
)
```

### Qdrant (vectors + payloads)

Single collection per domain with named vectors. Multi-model embedding support via Qdrant named vectors — can store OpenAI + domain-specific embeddings side by side:

```python
collection_config = {
    "vectors": {
        "openai": VectorParams(size=1536, distance=COSINE),   # text-embedding-3-small via `ai embed` (default, interop)
        "domain": VectorParams(size=1024, distance=COSINE),   # optional domain-specific (voyage-finance-2 for VIC)
    },
    "sparse_vectors": {
        "bm25": SparseVectorParams(modifier=Modifier.IDF),
    },
}

# Each point's payload:
{
    "doc_id": "4040335785",
    "level": "claim",           # "document" | "paragraph" | "claim"
    "claim_id": 42,             # only for level=claim
    "text": "Olympus has 75% market share in GI endoscopes",
    "category": "competitive_position",
    "sub_type": "moat_dominance",
    "tags": ["medical_devices", "market_share", "switching_costs"],
    "direction": "bullish",
    "confidence": 0.95,
    "parent_doc_id": "4040335785",  # for linking paragraphs/claims back
    "domain": "vic",
    # Domain-specific metadata
    "symbol": "7733",
    "sector": "healthcare",
    "trade_dir": "LONG",
}
```

Payload indexes on: `level`, `category`, `sub_type`, `tags`, `domain`, `direction`, `doc_id`.

## Learning Process

### Pass 1 — Bootstrap Extraction

1. **Holdout**: Randomly stratify 10% of documents (by thesis_type for VIC) into validation set
2. **Extract claims** from remaining 90% using Gemini Flash with seed taxonomy_v1
3. **Assign labels**: Each claim gets category + sub_type from taxonomy, plus freeform tags
4. **Store**: Claims to SQLite, skip embedding for now

Extraction prompt pattern (domain adapter provides):
```
Given this investment writeup, extract all distinct atomic claims.
For each claim, provide:
- claim_text: The argument in 1-2 sentences
- evidence: The relevant excerpt from the writeup (verbatim)
- category: One of {taxonomy_categories}
- sub_type: One of {taxonomy_subtypes_for_category}
- tags: 2-5 freeform descriptive tags
- confidence: 0-1 how confident you are in the classification
- direction: "bullish" or "bearish"

Taxonomy:
{seed_taxonomy}

IMPORTANT: If a claim doesn't fit any existing category well, use
category="uncategorized" and describe what it is in the tags.
```

### Pass 2 — Embed

1. **Level 1**: LLM-generate document summary (100-200 words) → embed
2. **Level 2**: Split into ~400-token chunks, prepend context (Contextual Retrieval) → embed
3. **Level 3**: Embed each extracted claim
4. **Index** all into Qdrant with metadata payloads

### Pass 3 — Cluster & Discover

1. Collect all claim embeddings
2. HDBSCAN with min_cluster_size=15
3. For each cluster:
   - What taxonomy labels dominate? (>80% same label = good fit)
   - Mixed cluster = potential boundary problem or new category
   - Cluster with many "uncategorized" = definitely new category
4. LLM labels new clusters, proposes taxonomy additions
5. Log proposals to taxonomy_changes table
6. Human reviews (or auto-approve if confidence > threshold)

### Pass 4 — Validate & Refine

1. Run extraction on holdout set with evolved taxonomy
2. Measure: inter-annotator agreement (LLM pass 1 vs pass 4)
3. If agreement > 85% → taxonomy has stabilized
4. If < 85% → iterate (update taxonomy, re-extract, re-cluster)

### Pass 5+ — Steady State

- New documents flow through extract → embed → store
- Monthly re-clustering checks for taxonomy drift
- Periodic full re-extraction when taxonomy changes significantly

## Presentation

### Summary Cards (HTML)

Per document, show:
```
┌─────────────────────────────────────────┐
│ AMAT — Applied Materials    LONG        │
│ Technology | 2014 | Quality: 9          │
│─────────────────────────────────────────│
│ VALUATION                               │
│ • Attractive at 8x EPS, 6x EV/EBITDA    │
│   on FY17 pro forma numbers         95% │
│                                         │
│ COMPETITIVE POSITION                    │
│ • #1 market share, 2x larger than #2    │
│   with massive switching costs      98% │
│ • 3D architecture transition expands    │
│   addressable market (secular)      90% │
│                                         │
│ CATALYSTS                               │
│ • Tokyo Electron merger closes Q4,      │
│   synergies drive margins 19% → 25%  92%│
│                                         │
│ MARKET PERCEPTION                       │
│ • Deal complexity means analysts lag    │
│   on pro forma earnings power       88% │
│                                         │
│ [Find Similar] [View Original]          │
└─────────────────────────────────────────┘
```

### Constellation Browser

Apple Embedding Atlas rendered in a Gradio tab (or standalone HTML):
- Points = claims (or documents, toggle-able)
- Color = taxonomy category
- Size = confidence
- Hover = claim text + source document
- Click = navigates to summary card
- Search bar = semantic query → highlights nearest neighbors
- Filter panel = taxonomy facets + tags + direction

### Retrieval API

```python
from lib.semnet import query

# Semantic search
results = query.search("operating leverage in SaaS companies",
                       domain="vic", level="claim", limit=20)

# Faceted search
results = query.search("margin expansion",
                       domain="vic",
                       filters={"category": "operational", "direction": "bullish"},
                       level="claim", limit=20)

# More like this
results = query.similar(doc_id="4040335785", level="document", limit=10)
results = query.similar(claim_id=42, level="claim", limit=20)
```

## File Layout

```
lib/semnet/
├── __init__.py          # Public API
├── extract.py           # Claim extraction pipeline
├── embed.py             # 3-level embedding (voyage-finance-2 + BM25)
├── store.py             # Qdrant + SQLite dual storage
├── evolve.py            # HDBSCAN clustering + taxonomy evolution
├── query.py             # Hybrid retrieval (semantic + faceted)
├── present.py           # Summary cards + Atlas visualization
├── schema.py            # SQLite schema, Qdrant collection config
├── adapter.py           # DomainAdapter base class
├── models.py            # Claim, TaxonomyNode, etc. dataclasses
└── tests/
    ├── test_extract.py
    ├── test_embed.py
    ├── test_store.py
    ├── test_evolve.py
    └── test_query.py

projects/vic/
├── semanticnet_adapter.py   # VIC-specific adapter
├── taxonomy_v1.md           # Seed taxonomy (already created)
├── run_extraction.py        # CLI to run extraction passes
└── ...

# Future:
projects/healthygamer/
├── semanticnet_adapter.py   # HG-specific adapter
├── taxonomy_seed.md         # Seed taxonomy for mental health topics
└── run_extraction.py
```

## Technology Choices

| Component | Choice | Why |
|-----------|--------|-----|
| Vector DB | Qdrant (Docker, local) | Native hybrid search, named vectors, payload filtering, RRF fusion, free |
| Dense embeddings | OpenAI text-embedding-3-small via `ai embed` (default) | Already deployed in rivus LLM server (:8120), 1536-dim, interoperable across all domains. voyage-finance-2 available as domain-specific upgrade for VIC (+7% on financial tasks). Multi-model support via named vectors in Qdrant. |
| Sparse | Qdrant BM25 via FastEmbed | Built-in, real-time IDF |
| Claim extraction | Gemini Flash | Already used in enrich.py, $0.001/call, structured JSON |
| Chunk context | Contextual Retrieval (Haiku) | 35% better retrieval, $3-5 one-time |
| Visualization | Apple Embedding Atlas | MIT, WebGPU, handles millions of points, auto-clustering |
| Clustering | HDBSCAN | Standard for embedding clustering, density-based, handles noise |

## Estimated Costs

| Step | Cost |
|------|------|
| Claim extraction (13K × Gemini Flash) | ~$13 |
| Embeddings (voyage-finance-2, ~50K vectors) | ~$10-15 |
| Contextual Retrieval (Haiku, chunk context) | ~$3-5 |
| Qdrant (self-hosted Docker) | $0 |
| **Total one-time for VIC** | **~$30** |

## Prior Art

- **No complete system exists** combining extraction + embedding + taxonomy evolution + constellation browsing
- **YouTube semantic search**: yt-semantic-search, yt-fts, YT Navigator — embedding search only, no extraction
- **Claim extraction**: Fabric (per-doc), Elicit (academic) — no corpus-wide aggregation
- **Visualization**: Apple Embedding Atlas, WizMap, Nomic Atlas — embedding maps, no claim extraction
- **Healthy Gamer**: Only HGSearch (keyword grep over transcripts) exists. No semantic navigator built.

## Open Questions

1. Should we use `ai embed` (the existing rivus LLM server) or call voyage-finance-2 directly?
2. Qdrant Docker vs embedded mode for development?
3. How to handle claims that genuinely span 2+ categories (multi-label)?
4. Gradio integration for the constellation browser vs standalone HTML?

## Dependencies

- `qdrant-client` + `fastembed` (pip)
- `voyageai` (pip, for embeddings)
- `hdbscan` (pip)
- `embedding-atlas` (npm/pip, Apple)
- Existing: `lib/llm` (for Gemini Flash calls), brain/vario (for Vario-based taxonomy refinement)