# Plan: Triplestore-First Architecture with NL In/Out

**Status**: Design — not yet implemented
**Date**: 2026-02-27
**Context**: Evolves the semanticnet dual-store design toward Oxigraph-first with natural language interfaces on both ends.

## What Changed

Previously: SQLite as source of truth, Oxigraph as secondary graph index.
Now: **Oxigraph-first** — claims and relationships live natively in the triplestore. SQLite stays for metadata/provenance that doesn't fit the graph model.

## The Stack

```bash
pip install pyoxigraph oxrdflib rdflib rdflib-endpoint
```

| Package | Role |
|---------|------|
| **pyoxigraph** | Rust-based embeddable triplestore. In-process, no server. SPARQL 1.1 + RDF 1.2 (triple terms). |
| **oxrdflib** | Makes pyoxigraph a drop-in rdflib store backend |
| **rdflib** | Standard Python RDF API — all ecosystem tools speak this |
| **rdflib-endpoint** | Expose any rdflib Graph as a SPARQL HTTP endpoint via FastAPI |

### What is Oxigraph?

A triplestore (graph database for RDF data) written in Rust. Embeddable via `pip install pyoxigraph` — runs in your Python process, no separate server, no JVM. Supports SPARQL queries and RDF 1.2 triple terms (claims about claims). Think "SQLite but for graph data." 38x faster than rdflib for SPARQL.

**Caveat**: On-disk storage format not yet stable between versions. For us this is fine — we rebuild from source data.

## Architecture

```
NL In (extraction):
  Document text → LLM → Pydantic claims (NL intermediate) → Turtle-star → pyoxigraph Store

NL Out (querying):
  NL question → LLM → SPARQL → pyoxigraph.query() → results → LLM → English answer

Optional:
  pyoxigraph → rdflib-endpoint → SPARQL HTTP endpoint → YASGUI web UI for browsing
```

## The Intermediate Format

### What is "Pydantic claims (NL intermediate)"?

Pydantic is a Python library for data validation. A Pydantic model defines a JSON schema that the LLM must match. The "NL intermediate" means the JSON fields contain **natural language strings** (entity names, claim sentences) rather than URIs or graph IDs, so a human can read and verify them.

The LLM outputs JSON like:
```json
{
  "subject": "First Energy",
  "predicate": "trades_at",
  "object": "11x P/E vs 19x peer average",
  "claim_text": "FE trades at 11x P/E, utility peers trade at 19x",
  "confidence": 0.95,
  "source_text": "First Energy currently trades at approximately 11x...",
  "source_id": "fe_thesis_42",
  "influence": 0.0
}
```

A thin conversion layer then resolves entity names to URIs and writes Turtle-star:
```turtle
:FirstEnergy :tradesAt "11x P/E vs 19x peer average" {|
    :confidence 0.95 ;
    :claimText "FE trades at 11x P/E, utility peers trade at 19x" ;
    :source :fe_thesis_42
|} .
```

### Why not have the LLM output Turtle/RDF directly?

LLMs botch URI syntax, prefix management, and escaping too often. Every working KG extraction tool (GraphRAG, LightRAG, Graphiti, Neo4j Builder) has the LLM output **names and descriptions in natural language**, then resolves to graph identifiers in post-processing.

### The Pydantic model

```python
from pydantic import BaseModel

class Entity(BaseModel):
    name: str                 # "First Energy (FE)"
    entity_type: str          # "company", "person", "concept"
    description: str = ""     # optional elaboration

class ClaimTriple(BaseModel):
    subject: str              # entity name (NL)
    predicate: str            # relationship type
    object: str               # entity name or literal value
    claim_text: str           # NL sentence a human can verify
    claim_type: str           # "atom", "assertion", "relationship"
    confidence: float         # 0.0-1.0
    direction: str = ""       # "bullish"/"bearish" (domain-specific)
    source_text: str = ""     # verbatim quote from document
    source_id: str = ""       # document identifier
    influence: float = 0.0    # -1.0 to +1.0 (for causal claims)

class ExtractionResult(BaseModel):
    entities: list[Entity]
    claims: list[ClaimTriple]
```

## NL Out: Querying

Two options explored:

1. **LLM generates SPARQL directly** — feed schema + few-shot examples → LLM produces SPARQL → execute → LLM synthesizes English answer. The `sparql-llm` package adds schema extraction, validation, retry loops.

2. **LangChain `GraphSparqlQAChain`** — NL question → SPARQL → endpoint → NL answer. Works with any SPARQL endpoint.

Both require exposing pyoxigraph as a SPARQL endpoint (via rdflib-endpoint) or using pyoxigraph's `.query()` directly.

## What the Research Found

### How existing tools handle the intermediate format

| Tool | Format | Confidence? | Provenance? |
|------|--------|-------------|-------------|
| GraphRAG | Bespoke `<\|>` delimited tuples | Claim status (TRUE/FALSE/SUSPECTED) | Source text quotes |
| LightRAG | Same delimited format | No | No |
| **Graphiti/Zep** | **Pydantic JSON** with NL `fact` field | No | Temporal bounds |
| Neo4j Builder | JSON with graph schema | No | No |
| KGGen | DSPy structured output | No | No |

**Graphiti is closest** — its Edge model has a `fact` field (NL claim) + structured relation fields. We add confidence, influence, and provenance.

### The gap we fill

None of the existing LLM+KG tools handle **uncertainty on triples**. They extract binary triples for retrieval. We build confidence-scored claims with influence-weighted causal edges for inference.

### RDF-star serialization options

| Format | Human-Readable? | LLM-Friendly? | Tooling |
|--------|----------------|---------------|---------|
| Turtle-star | Yes (best) | No (URIs/prefixes) | rdflib, pyoxigraph |
| JSON-LD-star | Moderate | Possible but verbose | Partial |
| YAML-LD | Good | No star extension yet | Experimental |
| N3 | Good | Different semantics | Limited |

**Decision**: LLM outputs Pydantic JSON (human-readable, machine-validated). Conversion to Turtle-star is deterministic post-processing.

## Implementation Steps

### Phase 0: Extraction quality (FIRST PRIORITY)
Before triplestore, graph visualization, or SPARQL — get extraction right.

0a. **Add "assertions" extraction config to vario** — new prompt in `extract_prompts.yaml`
    that outputs structured JSON claims (entities, atoms, assertions, relationships).
    Builds on existing `causal_claims` and `facts` prompts but adds:
    - claim_type (atom/entity/assertion/relationship)
    - references between claims (subject, evidence, input with influence)
    - Pydantic validation of output

0b. **Wire into vario Extract UI** — new button "assertions" alongside existing
    facts/causal/critique buttons. Shows claims in a readable table/tree.

0c. **Build extraction gym** (`learning/gyms/claim_extraction/`) — follows the
    llm_task gym pattern:
    - Corpus: 10-20 documents with gold-standard assertions (manually annotated)
    - Generate: multiple models × multiple prompt variants
    - Evaluate: LLM judge on completeness, precision, relationship accuracy
    - Report: model × prompt score matrix, interesting misses

    Existing infrastructure to reuse:
    - `lib/gym/base.py` — Candidate, CorpusStore, GymBase
    - `lib/gym/judge.py` — RubricJudge with cached scoring
    - `learning/gyms/llm_task/` — task YAML config pattern

### Phase 1: Triplestore (after extraction is solid)
1. Install stack: `pyoxigraph oxrdflib rdflib rdflib-endpoint`
2. Write NL→URI resolver (entity names → graph URIs)
3. Write Pydantic→Turtle-star converter
4. Wire up pyoxigraph Store with persistent storage
5. Test with FE thesis example (round-trip: text → claims → triples → SPARQL query → English answer)
6. Expose SPARQL endpoint via rdflib-endpoint

### Phase 2: NL query interface
7. Build NL query interface (LLM generates SPARQL from questions)
8. Integrate with existing semanticnet extraction pipeline

## Open Questions

- Do we keep SQLite claims table as a parallel store, or fully migrate to Oxigraph?
- How to handle entity resolution (same entity mentioned differently across documents)?
- Where does pgmpy fit — does it query Oxigraph directly, or do we extract a subgraph into pgmpy format?
- Schema/ontology: do we define a formal OWL ontology for our predicates, or keep it lightweight?

## References

- DESIGN.md: `lib/semnet/DESIGN.md` — full schema design, KV comparison, lineage
- Prior plan: `docs/plans/2026-02-25-vic-to-inference-network.md` — VIC thesis conversion
- Oxigraph: https://github.com/oxigraph/oxigraph
- sparql-llm: https://github.com/sib-swiss/sparql-llm
- Graphiti: https://github.com/getzep/graphiti
