# Idea Evaluation System — Design

**Date**: 2026-03-02
**Status**: Approved design
**Location**: New module, likely `ideas/` at rivus root — crosses semanticnet (extraction, embeddings) and draft (evaluation pipeline pattern)

## Problem

Given a piece of writing, assess the originality, quality, and fertility of its ideas computationally. This requires:
1. Extracting ideas from the writing
2. Building a landscape corpus of existing ideas on the topic (from web search)
3. Scoring ideas along multiple dimensions, with novelty measured relative to the corpus

## System Flow

```
Stage 0: Extract       → Pull ideas from YOUR writing (claims + theses)
Stage 1: Landscape     → Interactive source gathering (search, present, iterate with user)
Stage 2: Corpus Extract → Pull ideas from confirmed landscape sources
Stage 3: Score         → Evaluate YOUR ideas against rubric + landscape corpus
```

### Stage 0: Idea Extraction

Extract two levels of ideas from user's writing:
- **Claims**: Atomic factual assertions, causal arguments, predictions (reuse semanticnet `Claim` model)
- **Theses**: Higher-level insights, framings, mental models, surprising connections

Each extracted idea gets: `text`, `type` (claim | thesis), `source_span` (location in document), `confidence` (extraction confidence).

Reuse: semanticnet's `Claim` dataclass + `ClaimProvenance` for tracking extraction source.

### Stage 1: Landscape Gathering (Interactive)

This is an iterative, user-in-the-loop stage:

1. **Auto-generate search terms** from extracted ideas — key phrases, related concepts, domain terms
2. **Search the web** via Serper API — present results to user:
   - Search terms used
   - Sources found (title, URL, snippet, relevance score)
   - Suggested refinements (deeper in direction X, broaden to adjacent topic Y)
3. **User confirms/selects/corrects**:
   - Accept/reject individual sources
   - Add specific URLs manually
   - Adjust search terms, request deeper search in a direction
   - Signal "enough" when corpus is satisfactory
4. **Iterate** until user is satisfied with landscape coverage

Key design decision: present sources BEFORE extracting ideas from them. User should validate the corpus, not review 500 extracted ideas.

### Stage 2: Corpus Extraction

Once user confirms sources, extract ideas from each:
- Fetch content via `lib/ingest/fetch`
- Extract claims + theses using same extraction pipeline as Stage 0
- Store with provenance (source URL, extraction date, model used)
- Embed all ideas into Qdrant (reuse semanticnet's 3-level embedding approach)

### Stage 3: Scoring

Score each of the user's ideas (from Stage 0) against the evaluation rubric, with novelty dimensions computed relative to the landscape corpus (from Stage 2).

## Evaluation Taxonomy

8 dimensions in 4 groups. Derived from 4-model vario consensus (opus, gpt-pro, grok-reasoning, gemini).

### Group 1: SUBSTANCE — Does the idea say something real?

**1.1 Claim Precision** (merges specificity + falsifiability)

How specific and falsifiable is the core claim?

| Score | Description |
|-------|-------------|
| 1-2   | Pure abstraction, no referent. "Things are connected." |
| 3-4   | Identifies domain but only directional claims. |
| 5-6   | Specific mechanism with identifiable conditions. "When X, then Y, because Z." |
| 7-8   | Could be operationalized into a test. Clear variables implied. |
| 9-10  | Fully operationalized: population, mechanism, magnitude, failure conditions. |

Computational approach: Parse claim into structured components (subject, verb, object, mechanism, conditions). Score by how many slots are filled with concrete entities. NER density, quantifiers, conditional structures.

**1.2 Internal Coherence**

Does the argument's structure hold together?

| Score | Description |
|-------|-------------|
| 1-2   | Conclusion unrelated to premises. Non sequitur. |
| 3-4   | Plausible emotional logic but no inferential chain. |
| 5-6   | Chain exists but relies on unstated assumptions many would reject. |
| 7-8   | Chain complete, key assumptions identified and partially defended. |
| 9-10  | Argument valid or reconstructable as valid with minimal charity. |

Computational approach: Extract premise-conclusion structure (LLM). Check term consistency. Identify unstated assumptions via entailment gap detection. Hardest dimension — use adversarial LLM passes ("find the weakest inferential step").

**1.3 Evidential Grounding**

Is the idea anchored in evidence, examples, or established knowledge?

| Score | Description |
|-------|-------------|
| 1-2   | Pure assertion with no reference to evidence. |
| 3-4   | Anecdotal: one cherry-picked example. |
| 5-6   | Accurately references evidence but ignores contradictory evidence. |
| 7-8   | Evidence well-selected, accurately represented, appropriately weighted. |
| 9-10  | Synthesizes evidence revealing a pattern not visible in any single source. |

Computational approach: Claim extraction → fact-checking against corpus. Citation detection. Ratio of supported-to-unsupported claims.

### Group 2: NOVELTY — Does the idea add something new?

**2.1 Semantic Novelty** (merges originality + surprise)

How distant is this idea from the existing landscape corpus?

| Score | Description |
|-------|-------------|
| 1-2   | Verbatim or near-verbatim restatement of common corpus claim. |
| 3-4   | Common claim applied to slightly different context. |
| 5-6   | Familiar framing with a non-obvious twist or inversion. |
| 7-8   | Proposes mechanism or framing not present in reference corpus. |
| 9-10  | Creates a genuinely new conceptual primitive — a tool for thinking. |

Computational approach: Embed idea, compute distance from nearest neighbors in corpus. Critical refinement: need *topical relevance + semantic distance* — close in topic-space but far in claim-space. Nonsense is also "distant." Decompose into (topic embedding, claim embedding) pairs.

**2.2 Framing Novelty**

Does the idea reframe the problem in a way that opens new avenues?

| Score | Description |
|-------|-------------|
| 1-2   | Uses default framing for topic (e.g., "pros and cons of X"). |
| 3-4   | Adopts a known alternative framing (economic lens on social question). |
| 5-6   | Shifts level of analysis (individual → structural, static → dynamic). |
| 7-8   | Proposes reframe that dissolves an apparent contradiction. |
| 9-10  | Reframe is paradigmatic: changes what counts as a good question. |

Computational approach: Extract the implicit question the idea answers (via LLM). Compare that question to typical questions asked about this topic in corpus. Distance between implied question and typical questions = framing novelty.

### Group 3: EXPRESSION — How well communicated? (Scored separately)

This group evaluates the TEXT, not the IDEA. Report separately from substance/novelty. A brilliant idea expressed poorly should score high on Groups 1-2 and low here.

**3.1 Clarity**

Can a reader extract the core claim without ambiguity?

| Score | Description |
|-------|-------------|
| 1-2   | Incomprehensible or self-contradictory. |
| 3-4   | Core claim identifiable but ambiguous (multiple valid interpretations). |
| 5-6   | Claim and structure clear; some terms undefined or loose. |
| 7-8   | Claim, structure, and terms all clear. Reader could accurately paraphrase. |
| 9-10  | Expression so clear it feels inevitable — can't imagine a better way to say it. |

Computational approach: LLM extracts core claim. Measure consistency across multiple extraction attempts (high variance = low clarity). Readability metrics, parse complexity.

**3.2 Rhetorical Force** (refined from eloquence)

How effectively does the expression serve the idea?

| Score | Description |
|-------|-------------|
| 1-2   | Flat prose that undermines the idea. |
| 3-4   | Technically correct but generic — could be about any topic. |
| 5-6   | Consistently well-written; maintains engagement. |
| 7-8   | Contains memorable formulations; quotable. |
| 9-10  | Form and content unified — the expression demonstrates the idea. |

Computational approach: Lexical diversity, metaphor detection, sentence rhythm variation. Comparative distinctiveness vs corpus baseline (does this sound like everything else on this topic?).

### Group 4: FERTILITY — Does the idea open further thinking?

**4.1 Generativity**

Does the idea suggest new questions, research directions, or applications?

| Score | Description |
|-------|-------------|
| 1-2   | Dead end — if true, no implications beyond itself. |
| 3-4   | Implies several next questions, all within same narrow domain. |
| 5-6   | Implies testable hypotheses. |
| 7-8   | Opens questions in multiple domains. |
| 9-10  | A "generator" — produces a stream of non-obvious insights across domains. |

Computational approach: Prompt LLM to generate implications/next-questions. Score based on quantity, diversity (topic spread), and non-obviousness (distance from corpus). This is a "fertility test."

**4.2 Composability**

Can this idea be productively combined with other ideas?

| Score | Description |
|-------|-------------|
| 1-2   | Standalone assertion with no connection points. |
| 3-4   | Could modify or extend an existing theory. |
| 5-6   | Creates useful interface between two previously separate ideas. |
| 7-8   | Functions as a building block — other ideas can be built on top. |
| 9-10  | New primitive that expands the combinatorial space of what can be thought. |

Computational approach: Measure how much the new idea changes embedding-space neighborhood structure. Ideas that cause "reorganization" (other ideas become more/less similar in presence of new idea) are highly composable.

## Scoring Approach

### Aggregation

Do NOT use single weighted average. Report a profile with floor scores per group:

```
Substance:  [7, 6, 5]  → floor: 5
Novelty:    [8, 7]     → floor: 7
Expression: [6, 8]     → (reported separately)
Fertility:  [9, 6]     → floor: 6
```

Floor scores (weakest link) rather than averages — a highly novel idea with zero coherence is worthless, and averaging hides that.

### Evaluation Pipeline

Reuse draft/style's batched evaluation pattern: 2-3 dimensions per LLM call for better accuracy than monolithic evaluation. Multi-model consensus (3+ models, take median) for dimensions 4.1-4.2 and 1.2-2.2 which are hardest to score.

### Computational Feasibility (easiest → hardest)

1. Claim Precision — mostly structural/syntactic signals
2. Clarity — extractability and consistency checks
3. Semantic Novelty — embedding distance (with topic/claim decomposition)
4. Rhetorical Force — stylistic features well-studied
5. Evidential Grounding — requires fact-checking pipeline
6. Generativity — LLM-as-judge with structured prompting
7. Composability — requires graph-level corpus analysis
8. Internal Coherence — requires deep logical analysis
9. Framing Novelty — requires implicit question extraction

For dimensions 6-9, use multi-agent LLM evaluation (3+ passes, different prompts, take median).

## Architecture

### Module Location

New `ideas/` module at rivus root:

```
ideas/
├── models.py          # Idea, IdeaRef dataclasses (extend semanticnet Claim)
├── extract.py         # Stage 0 + Stage 2: idea extraction from text
├── landscape.py       # Stage 1: search, source gathering, user iteration
├── eval.py            # Stage 3: 8-dimension evaluation engine
├── store.py           # SQLite + Qdrant storage (mirror semanticnet pattern)
├── api.py             # FastAPI endpoints for UI integration
├── ui/                # Gradio UI (could be tab in vario or standalone)
│   └── app.py
└── tests/
```

### Reused Infrastructure

| Component | Source | What we reuse |
|-----------|--------|---------------|
| Claim model + provenance | `lib/semnet` | Data structures, storage pattern |
| 3-level embeddings | `lib/semnet` | Qdrant embedding approach |
| Batched evaluation | `draft/style/evaluate.py` | 2-3 criteria per LLM call pattern |
| Multi-model consensus | `vario/blocks/critique.py` | Parallel scoring + synthesis |
| Web search | Serper API + `lib/ingest` | Source fetching |
| Content extraction | `lib/ingest/fetch` | HTML → text |
| RubricJudge | `lib/gym/judge.py` | Calibrated per-criterion scoring |

### Data Model

```python
@dataclass
class Idea:
    id: str                    # UUID
    text: str                  # The idea as extracted
    idea_type: str             # "claim" | "thesis"
    source: str                # "user" | URL
    source_span: tuple[int, int] | None  # Location in source document
    extraction_model: str      # Model used for extraction
    extracted_at: float        # Timestamp

@dataclass
class IdeaScore:
    idea_id: str
    dimension: str             # e.g., "claim_precision", "semantic_novelty"
    score: float               # 1-10
    rationale: str             # LLM explanation
    model: str                 # Scoring model
    scored_at: float

@dataclass
class LandscapeSource:
    url: str
    title: str
    snippet: str
    search_term: str           # What search found this
    status: str                # "pending" | "accepted" | "rejected"
    user_notes: str | None     # User annotation
```

## Open Questions

1. **UI**: Standalone app on its own port, or tab within vario? Leaning standalone given the interactive Stage 1.
2. **Idea deduplication**: How aggressively should we merge near-duplicate ideas within the landscape corpus? Threshold TBD.
3. **Incremental updates**: When user adds more sources to landscape, do we re-score everything? Probably yes for novelty dimensions.
4. **Thesis extraction**: The prompt for extracting theses/insights (vs atomic claims) needs careful design — use vario to iterate on extraction prompt quality.

## Vario Consensus Notes

Design validated by 4-model vario consultation (opus, gpt-5.2-pro, grok-4-reasoning, gemini-2.5-pro):

- **All 4 agreed**: merge originality + surprise, decompose insightfulness, separate expression from substance, add coherence + generativity
- **Opus** contributed: floor-scoring per group, claim-space vs topic-space decomposition for novelty, "fertility test" pattern for generativity
- **Gemini** contributed: "explanatory power" as replacement for insightfulness, computational proxies per dimension
- **GPT-Pro** contributed: scope calibration as dimension (deferred — add if needed), 10-core lean setup
- **Grok** contributed: relevance/impact as utility dimensions (partially captured in generativity)
