# Draft KB + References + Taxonomy — Design

**Date**: 2026-02-28
**Status**: Draft
**Builds on**: `2026-02-27-draft-review-design.md` (review lenses, orchestration, reports)

## Context

`draft/` does rhetorical role analysis (what each chunk of a doc *does*) and multi-lens review. Missing:

1. **References/related work** — what else exists on this topic? Does the doc cite the right stuff?
2. **Topic KB** — a kept-fresh knowledge base of refs per topic, so analysis isn't one-shot
3. **Finer role taxonomy** — splitting "claim" into opinion vs fact vs data

These three reinforce each other: facts can be cross-referenced against the KB, claims can be matched to supporting/contradicting refs, and the KB grows richer as docs are analyzed.

---

## I. Role Taxonomy Refinement

### Current Roles (12)

claim, evidence, example, explanation, appeal, definition, concession, context, description, analogy, qualification, transition

### Proposed Split

| Old | New | What it captures |
|-----|-----|------------------|
| claim | **claim** | Debatable assertion the author advances ("well-positioned for growth") |
| *(new)* | **fact** | Verifiable statement, statistic, named event ("Revenue grew 23% YoY") |
| *(new)* | **data** | Specific numbers, dates, quantities, citations ("$4.2B in Q3 2025") |
| evidence | **evidence** | Fact/data used *in service of* a claim (structural role, not content type) |

### Why Split?

- **For the user**: "what's argued" vs "what's established" at a glance. Color-code differently in the role map.
- **For the KB**: Facts and data are cross-referenceable ("is this number correct?"). Claims are matchable to supporting/contradicting refs.
- **For review**: The logic lens can check claim-evidence alignment more precisely: "this claim has no supporting facts in the document."

### Overlap Rules

The same sentence can serve multiple roles depending on context:

| Text | Standalone | After a claim |
|------|-----------|---------------|
| "Revenue grew 23% YoY" | **fact** | **evidence** (supporting the claim) |
| "The market will grow 40% by 2028" | **claim** (prediction, not verifiable yet) | **claim** |
| "$4.2B" | **data** | **data** (evidence if contextually supporting) |

The LLM extractor assigns the **primary** role. The evidence role is additive — a fact tagged as evidence retains its fact nature. In the role map, evidence segments get a secondary badge showing they're also fact/data.

### Implementation

- Update `draft/core/roles.py` ROLES dict: add `fact` and `data` with colors and descriptions
- Update extraction prompt to distinguish claim (opinion/argument/prediction) from fact (verifiable) from data (specific quantities)
- Test against 5+ real docs to verify the LLM reliably distinguishes them before shipping
- Role map UI: fact = blue tones, data = cyan, claim = warm tones (existing), evidence = gets secondary indicator

---

## II. Topic KB

### Purpose

A per-topic store of references (papers, articles, repos, tools) that stays fresh and grows as documents are analyzed. When reviewing a doc, the system knows what related work exists.

### Schema (`draft/data/refs.db`)

```sql
-- Topics are domain-level concepts (not per-document)
topics (
    id          INTEGER PRIMARY KEY,
    name        TEXT UNIQUE,            -- "LLM reasoning evaluation"
    description TEXT,                   -- one-paragraph scope
    embedding   BLOB,                   -- for similarity search (sqlite-vec)
    created_at  TEXT,
    updated_at  TEXT
)

-- References: papers, articles, repos, tools, datasets
refs (
    id          INTEGER PRIMARY KEY,
    url         TEXT UNIQUE,            -- canonical URL
    title       TEXT,
    authors     TEXT,                   -- comma-separated or JSON
    date        TEXT,                   -- publication date (YYYY-MM-DD or YYYY)
    ref_type    TEXT,                   -- paper | article | repo | tool | dataset | standard
    summary     TEXT,                   -- 2-3 sentence summary (LLM-generated)
    key_claims  TEXT,                   -- JSON array of main claims/findings
    cached_path TEXT,                   -- lib/ingest cache path (if fetched)
    quality     REAL,                   -- 0-1 source quality score
    created_at  TEXT,
    updated_at  TEXT
)

-- M:N: which refs are relevant to which topics
topic_refs (
    topic_id        INTEGER REFERENCES topics(id),
    ref_id          INTEGER REFERENCES refs(id),
    relevance       REAL,               -- 0-1 how relevant this ref is to the topic
    section_hint    TEXT,               -- which part of the ref is relevant ("section 3.2")
    relationship    TEXT,               -- supports | contradicts | extends | reviews | implements
    added_by        TEXT,               -- "discovery" | "analysis" | "manual"
    created_at      TEXT,
    PRIMARY KEY (topic_id, ref_id)
)

-- Per-document topic analysis (which topics does this doc touch?)
doc_topics (
    id              INTEGER PRIMARY KEY,
    doc_hash        TEXT,               -- hash of document content (dedup)
    doc_title       TEXT,
    topic_id        INTEGER REFERENCES topics(id),
    confidence      REAL,               -- 0-1 how strongly the doc relates to this topic
    extracted_claims TEXT,              -- JSON array of claims from this doc on this topic
    extracted_facts  TEXT,              -- JSON array of facts/data from this doc
    created_at      TEXT
)

-- Freshness tracking
refresh_log (
    id          INTEGER PRIMARY KEY,
    topic_id    INTEGER REFERENCES topics(id),
    strategy    TEXT,                   -- "serper" | "scholar" | "arxiv" | "manual"
    refs_found  INTEGER,
    refs_new    INTEGER,
    cost_usd    REAL,
    ts          TEXT
)
```

### Topic Lifecycle

1. **Auto-extraction**: When analyzing a doc, the role extractor also identifies 2-5 topics it touches
2. **Manual creation**: User can name topics explicitly ("add topic: semiconductor supply chain resilience")
3. **Discovery**: On creation or refresh, search for related refs (Serper, Google Scholar, arxiv)
4. **Growth**: Each doc analyzed may surface new refs → added to relevant topics
5. **Refresh**: Periodic re-search to catch new publications (configurable per topic)

### Topic Extraction Prompt

Added to the role extraction pipeline (same LLM call or cheap follow-up):

```
Given this document, identify 2-5 domain-level topics it addresses.
Topics should be specific enough to be useful for finding related work,
but general enough to apply across multiple documents.

Return as JSON:
[
  {"topic": "name", "confidence": 0.0-1.0, "claims": ["..."], "facts": ["..."]}
]
```

---

## III. Reference Discovery

### Discovery Strategies

| Strategy | Source | Best for |
|----------|--------|----------|
| `serper_web` | Serper API (Google) | Articles, blog posts, tools |
| `serper_scholar` | Serper scholar endpoint | Academic papers |
| `arxiv_search` | arxiv API | ML/CS preprints |
| `semantic_scholar` | S2 API | Citations, related papers |
| `manual` | User adds URL | Curated refs |
| `doc_extraction` | Refs cited in analyzed docs | Following citation chains |

### Discovery Flow

```
topic created/refreshed
    → search via strategies (parallel)
    → deduplicate by URL
    → for each candidate:
        → fetch summary page (lib/ingest, cached)
        → LLM scores relevance to topic (0-1)
        → if relevance > 0.3: add to topic_refs
        → extract key_claims from abstract/summary
    → log to refresh_log
```

### Cost Control

- Topic refresh is **on-demand by default** (when a doc mentions a topic, refresh if stale)
- Staleness threshold: configurable, default 7 days
- Search cost: ~$0.01 per Serper call, ~$0.005 per relevance scoring (haiku)
- Full topic refresh: ~$0.10-0.50 depending on result count
- Optional: jobs handler for periodic refresh of high-priority topics

### Jobs Integration (Optional)

```yaml
# jobs/jobs.yaml
draft_refs:
  enabled: false  # manual/on-demand by default
  handler: handlers/draft_refs
  discovery:
    strategy: manual  # topics added via draft analysis
  stages:
    - name: search
      concurrency: 3
    - name: score
      concurrency: 5
  pacing:
    max_per_hour: 60
```

---

## IV. Analysis Integration

### Enhanced Review Flow

When `draft/` analyzes a document:

```
1. Load document (lib/ingest/loader.py)
2. Extract roles (existing) — now with fact/data/claim split
3. Extract topics (new) — 2-5 topics per doc
4. For each topic:
   a. Check KB freshness → refresh if stale
   b. Query matching refs (by relevance)
   c. Match doc claims against ref claims (support/contradict)
   d. Match doc facts against ref facts (verify/augment)
5. Role map (existing) — enhanced with ref annotations
6. Review lenses (existing) — enhanced with ref context
7. Refs sidebar (new) — grouped by topic, scored by relevance
```

### Ref-Enhanced Review Lenses

The logic lens gets topic KB context:

```
## Additional Context: Related Work

The following references are relevant to this document's topics:

<topic name="LLM reasoning evaluation">
  <ref title="..." relationship="extends">
    Key claims: [...]
  </ref>
  <ref title="..." relationship="contradicts">
    Key claims: [...]
  </ref>
</topic>

When reviewing claims, note:
- Claims that align with or contradict known refs
- Important refs the document should cite but doesn't
- Facts that can be verified against ref data
```

### Report Additions

New section in the review report (after Prioritized Suggestions, before Per-Lens Findings):

```
┌──────────────────────────────────────────────────────┐
│ References & Related Work                            │
├──────────────────────────────────────────────────────┤
│                                                      │
│ Topics detected: 3                                   │
│                                                      │
│ 📚 LLM Reasoning Evaluation (8 refs, 3 high-rel)      │
│   ├─ ✅ Cited: Tam et al. 2025 — structured output    │
│   ├─ ✅ Cited: CriticGPT (OpenAI 2024)                │
│   ├─ ⚠️ Missing: DREAM (Feb 2025) — directly rel.    │
│   └─ 📖 3 more refs available                         │
│                                                      │
│ 📚 Multi-Agent Debate (5 refs, 2 high-rel)            │
│   ├─ ✅ Cited: FREE-MAD (Sep 2025)                    │
│   └─ ⚠️ Missing: llm-council — relevant pattern      │
│                                                      │
│ 📚 Document Review Systems (4 refs, 1 high-rel)       │
│   └─ 📖 All new refs (topic just discovered)          │
│                                                      │
│ Missing refs flagged: 2 (should cite but doesn't)    │
│ Contradicting refs: 0                                │
└──────────────────────────────────────────────────────┘
```

---

## V. Implementation Plan

### Phase 1: Taxonomy (small, ship fast)

| Step | What | Files |
|------|------|-------|
| 1 | Add `fact` and `data` roles to ROLES dict | `draft/core/roles.py` |
| 2 | Update extraction prompt to distinguish claim/fact/data | `draft/core/roles.py` |
| 3 | Update role map colors (blue/cyan for fact/data) | `draft/core/mapper.py` |
| 4 | Test against 5 real docs, tune prompt | manual |

### Phase 2: KB Schema + Basic Discovery

| Step | What | Files |
|------|------|-------|
| 1 | Create refs.db schema + migration | `draft/data/`, `draft/core/kb.py` |
| 2 | Topic CRUD (create, list, search) | `draft/core/kb.py` |
| 3 | Ref CRUD + dedup by URL | `draft/core/kb.py` |
| 4 | Serper-based discovery for a topic | `draft/core/refs.py` |
| 5 | Relevance scoring (haiku) | `draft/core/refs.py` |
| 6 | CLI: `draft topic add "name"`, `draft refs search TOPIC` | `draft/cli.py` |

### Phase 3: Analysis Integration

| Step | What | Files |
|------|------|-------|
| 1 | Topic extraction from doc analysis | `draft/core/roles.py` |
| 2 | Auto-refresh stale topics during analysis | `draft/core/refs.py` |
| 3 | Claim↔ref matching (support/contradict) | `draft/core/refs.py` |
| 4 | Refs sidebar in role map HTML | `draft/core/mapper.py` |
| 5 | Ref context injection into review lenses | `vario/review.py` |
| 6 | "References & Related Work" section in report | `vario/review_report.py` |

### Phase 4: Freshness + Jobs (optional)

| Step | What | Files |
|------|------|-------|
| 1 | Staleness check + auto-refresh | `draft/core/refs.py` |
| 2 | Jobs handler for periodic refresh | `jobs/handlers/draft_refs.py` |
| 3 | Multiple discovery strategies (scholar, arxiv, S2) | `draft/core/refs.py` |

---

## VI. Key Design Decisions

1. **Topics are global, not per-doc** — a topic like "supply chain resilience" spans many docs. Documents link to topics via `doc_topics`.

2. **Refs store summaries, not full text** — full content is cached by `lib/ingest`. The KB stores extracted claims and relevance scores for fast querying.

3. **Discovery is on-demand** — refreshes when a doc touches a stale topic. No always-on crawling unless explicitly enabled via jobs.

4. **Fact/data split is presentation, not storage** — both are stored as role segments in the same table. The split helps the user visually and helps the KB cross-reference.

5. **Relationship types are directional** — "ref X supports claim Y" vs "ref X contradicts claim Y". This powers the missing-refs and contradiction detection in the report.

6. **Section hints, not full-text search** — `topic_refs.section_hint` says "section 3.2 has the relevant data" so the user can jump directly to the useful part of a 40-page paper.

---

## VII. Reuse

| Component | From | For |
|-----------|------|-----|
| `lib/ingest` | fetch + cache | Fetching ref pages, caching content |
| `lib/ingest/loader.py` | doc loading | Loading docs for analysis |
| `lib/discovery_ops/serper` | search | Finding refs via Google/Scholar |
| `lib/llm` | LLM calls | Relevance scoring, topic extraction, claim matching |
| `/related-work` skill | discovery patterns | Reference discovery logic |
| `/cite-papers` skill | citation formatting | Proper citation in reports |
| `lib/vectors/` | sqlite-vec | Topic embedding similarity search |
| Vario review pipeline | lenses + synthesis | Enhanced with ref context |
