# Claim Quality Rating — Design Notes

**Status:** Idea
**Context:** After `draft/` extracts rhetorical segments (claims, evidence, examples...), can we rate individual claims by quality dimensions?

## The Question

Role extraction tells you **what** each chunk does (claim, evidence, analogy...). But not all claims are equal. A document might contain 15 claims — some are restated conventional wisdom, others are genuinely novel observations. Rating them surfaces the diamonds.

## Proposed Dimensions

| Dimension | What it captures | Example low | Example high |
|-----------|-----------------|-------------|-------------|
| **Novelty** | Is this a fresh observation or well-known? | "AI is transforming healthcare" | "Nurse scheduling constraints, not clinical adoption, are the binding bottleneck for hospital AI deployment" |
| **Insight** | Does it reveal something non-obvious? | "Revenue grew 15% YoY" | "Revenue grew 15% but entirely from price increases — unit volume declined, suggesting demand ceiling" |
| **Eloquence** | Is the writing itself compelling/memorable? | Dry restatement of facts | Vivid, quotable formulation that sticks |

### Additional dimensions to consider

- **Specificity** — concrete and measurable vs vague hand-waving
- **Actionability** — can someone act on this immediately?
- **Evidence support** — how well-backed within the document?
- **Surprise factor** — does it challenge assumptions or confirm priors?
- **Explanatory power** — does it make many other things make sense?

Not all dimensions apply to all use cases. Rating an earnings call wants novelty + insight + specificity. Rating an essay might care more about eloquence + explanatory power.

## Approaches

### A. Rubric-based LLM scoring (simplest)

Pass each claim + surrounding context through a rating prompt. Score 1-5 per dimension.

```
Rate this claim on three dimensions (1-5):

Claim: "{claim_text}"
Context: "{surrounding_paragraph}"
Document type: {doc_type}

1. Novelty (1=widely known, 5=genuinely original observation)
2. Insight (1=surface-level, 5=reveals non-obvious mechanism or implication)
3. Eloquence (1=forgettable phrasing, 5=memorable, quotable formulation)

Return JSON: {"novelty": N, "insight": N, "eloquence": N, "reasoning": "..."}
```

**Pros:** Simple, works now, cheap with flash.
**Cons:** Absolute scores drift, hard to calibrate across documents.

### B. Comparative ranking (ELO-style)

Instead of absolute scores, present pairs of claims and ask "which is more novel/insightful?" Build ranking from pairwise comparisons.

**Pros:** More robust than absolute scores, well-understood methodology (connects to judge-calibration work).
**Cons:** O(n²) comparisons, expensive for documents with many claims.

### C. Multi-model consensus via Vario

Use vario's multi-model infrastructure to get ratings from 3-4 models, then aggregate. Disagreement = interesting signal (a claim one model finds novel but another finds obvious might be domain-dependent).

**Pros:** Reduces single-model bias, catches hallucinated confidence.
**Cons:** 3-4x cost, may be overkill for initial version.

### D. Hybrid: batch rubric + selective comparison

1. Fast pass: rate all claims with rubric (approach A), cheap model
2. Interesting tier: claims scoring 4-5 on any dimension get multi-model validation (approach C)
3. Top tier: highest-rated claims get pairwise comparison within their tier (approach B)

This is the likely winner — cheap at scale, rigorous where it matters.

## Data Model

Extend `RoleSegment` or add a companion:

```python
@dataclass
class ClaimRating:
    segment_id: str          # links to RoleSegment.id
    novelty: int             # 1-5
    insight: int             # 1-5
    eloquence: int           # 1-5
    reasoning: str           # LLM's rationale
    model: str               # which model rated
    confidence: float        # 0-1, derived from multi-model agreement
```

Keep ratings separate from segments — a segment can have multiple ratings (from different models, passes).

## Integration Points

- **draft/core/roles.py** — extraction already produces `RoleSegment`. Rating is a second pass on segments where `role == "claim"` (and optionally "evidence", "example")
- **vario** — multi-model orchestration for approach C/D
- **judge-calibration** — the judge framework (`docs/plans/2026-02-25-judge-calibration-stage-gyms.md`) could calibrate claim ratings against human annotations
- **triplestore** — rated claims with high novelty/insight are prime candidates for knowledge graph extraction (`docs/plans/2026-02-27-triplestore-nl-pipeline.md`)
- **UI** — role map sidebar could show novelty/insight badges, sort/filter by rating

## Open Questions

1. **Should rating apply only to "claim" segments or to others too?** Evidence can be novel. Examples can be insightful. But "transition" segments don't need rating.
2. **Document-relative vs absolute?** "Novel within this document" (compared to its own context section) vs "novel in general knowledge." The former is easier and probably more useful.
3. **Domain calibration?** A claim novel to a layperson might be textbook to a domain expert. Do we need a "reader profile" parameter?
4. **Batch vs per-claim?** Rating all claims in one LLM call (sees all claims, can rank relatively) vs one call per claim (simpler, parallelizable). Batch is probably better — the model can calibrate within the document.

## Prior Art

- The draft review considerations doc notes: structured output causes **-18.3% insight, -17.1% novelty** penalty (Tam et al.). Rating prompts should use markdown, not strict JSON — extract scores from natural language.
- The newsflow design (`2026-02-21`) mentions "alert on novelty" for first-mention detection — different scope (corpus-level) but related concept.
- Triplestore pipeline (`2026-02-27`) builds confidence-scored claims — the quality rating here is orthogonal (quality of observation vs confidence in truth).

## Next Steps

1. Prototype approach A on a real document — see if ratings feel calibrated
2. Decide which dimensions matter most (start with novelty + insight, add eloquence later?)
3. Design the batch prompt that rates all claims in one call
4. Wire into the role map UI — sort/filter claims by quality