Draft

Understand what a document does, not just what it says — rhetorical role mapping, claim quality rating, style evaluation

The Problem

You read a 3,000-word essay. You can tell it's "pretty good." But can you say why? Can you point to the three strongest claims? Can you tell which evidence actually supports a claim vs. which is decorative? Can you spot where the author sneaks in an undefended assertion between two well-evidenced ones?

Most document tools tell you what a document says: keywords, topics, summaries. Draft tells you what the document does — what rhetorical moves the author makes, how strong the claims are, and where the writing falls short.

Core insight A document is not a bag of words. It's a sequence of moves — claims, evidence, concessions, appeals. Making those moves visible changes how you read, write, and revise.

How It Works

Three analysis layers, each independent, each building on the role map as foundation:

Layer 1

Role Map

What does each chunk do?

Layer 2

Claim Rating

How novel & insightful?

Layer 3

Style

How well written?

Layer 4

Review

Multi-lens, multi-model

Layer 1: Role Map

Every sentence in a document serves a rhetorical function. An LLM reads the full text and classifies each contiguous chunk into one of 12 roles:

RoleWhat it doesExample
claimAssertion the author wants you to believe"Nurse scheduling is the binding bottleneck for hospital AI"
evidenceFacts, data, statistics supporting a claim"Revenue grew 15% YoY, driven by a 23% price increase"
exampleConcrete illustration or case study"Consider Norway's universal healthcare system..."
explanationMaking a concept or mechanism clear"This works because the incentive structure aligns..."
concessionAcknowledging a counterpoint or limitation"Critics rightly note that correlation is not causation"
appealCall to action, emotional persuasion"We cannot afford to wait another decade"
contextBackground, setup, prior work"Since the 2008 financial crisis, regulators have..."

(Plus: definition, description, analogy, qualification, transition — 12 roles total.)

The Interactive Map

The output is a two-panel interactive view. Left: a sidebar of color-coded cards, one per segment. Right: the full document text with role-colored highlights. Click either side and the other scrolls to match.

The semiconductor supply chain has grown into one of the most complex and geographically concentrated systems in the global economy. Understanding its structure is essential for anyone assessing technology risk.

Taiwan produces 92% of the world's most advanced semiconductors (<7nm), making it the single most critical chokepoint in the global technology supply chain.

TSMC alone accounts for 56% of global foundry revenue, with its most advanced nodes serving Apple, NVIDIA, AMD, and Qualcomm. Samsung holds roughly 12% at comparable nodes.

Intel and Samsung are investing heavily in domestic fabrication capacity, with Intel's Ohio fab and Samsung's Texas expansion both targeting sub-5nm by 2027.

But new leading-edge fabs take 3-5 years to build and cost $20 billion or more, meaning no amount of investment today can change the concentration risk for the rest of this decade.

The claim about Taiwan's 92% share gets N:4 I:5 ★ — high novelty (most people don't know the exact number) and high insight (it reframes "supply chain risk" as "single-island dependency"). The TSMC market share evidence gets N:2 I:3 — well-known fact, moderate insight.

Layer 2: Claim Quality Rating

After role extraction, the system can rate every claim, evidence segment, and example on two dimensions:

DimensionScaleWhat it captures
Novelty1–5Is this a fresh observation or restated conventional wisdom?
Insight1–5Does it reveal something non-obvious — a mechanism, implication, or reframing?

All claims are rated in a single batch prompt so the model can calibrate relative to each other within the document. A claim that's insightful compared to the document's other claims will score higher than the same claim rated in isolation.

Why markdown output, not JSON? Research (Tam et al., 2025) found that forcing structured JSON output causes an 18% drop in insight scores and 17% drop in novelty scores. The rating prompt uses natural language output, then parses scores with regex. The quality tradeoff is worth the parsing complexity.

Claims scoring 4+ on either dimension get a ★ star in the sidebar. High-rated claims are the diamonds — the observations worth quoting, building on, or challenging.

Layer 3: Style Evaluation

Evaluates prose quality against ~215 curated principles extracted from canonical style guides (Strunk & White, Zinsser, Pinker, etc.). The key architectural insight: evaluating 1–3 principles per LLM call dramatically outperforms dumping all criteria at once.

Step 1

Orient

Identify doc type & relevant principles

Step 2

Select

Pick 15–25 most applicable

Step 3

Batch Eval

Score 2–3 principles per call

Step 4

Improve

Rewrite addressing violations

Output: a per-violation report (principle, severity, specific quote, suggested fix) plus an improved version of the text with a changes summary.

Layer 4: Multi-Lens Review

The most comprehensive analysis: runs 2–4 evaluation lenses (structure, logic, language, readability), each with multiple models, then synthesizes findings into a unified report. Built on the vario multi-model infrastructure.

4
review lenses
12
rhetorical roles
215
style principles
2
rating dimensions

Architecture

draft/
+-- core/               # Pure logic, no UI deps
|   +-- roles.py        # 12-role taxonomy + LLM extraction
|   +-- mapper.py       # Self-contained HTML renderer (sidebar + highlights)
|   +-- rating.py       # Batch novelty/insight scoring
+-- style/              # Prose quality evaluation
|   +-- guides/         # 215 curated principles (YAML) + full text cache
|   +-- orient.py       # Document triage
|   +-- evaluate.py     # Few-principles-at-a-time scoring engine
|   +-- improve.py      # Rewrite engine
+-- ui/
|   +-- app.py          # Gradio server on :7980 (draft.localhost)
+-- tests/
+-- tasks.py            # inv draft.server, inv draft.map

Key Design Decisions

DecisionWhy
Core produces self-contained HTML stringsNo UI framework dependency — works from CLI, API, tests, notebooks. The Gradio layer is a thin shell.
12-role taxonomy (not 5, not 50)Covers the full range of rhetorical moves without requiring expert training to understand. Every educated reader knows what a "claim" and "evidence" are.
Batch claim rating (all claims in one call)The model calibrates relative to the document — "novel compared to what the author already said." Per-claim calls lose this context.
Markdown output for ratingsTam et al.: JSON output kills novelty/insight scores by 17–18%. Parse with regex, not schema.
Style eval: 2–3 principles per callCheckEval/DeCE research: small batches massively outperform monolithic "evaluate everything at once" prompts.
Document loading in lib/ingestPDF, DOCX, HTML, URL, paste — shared across all rivus projects, not duplicated in draft.

Reused Infrastructure

ComponentSourceWhat Draft Reuses
LLM callslib/llmcall_llm with model aliases, prompt caching
Document loadinglib/ingestPDF/DOCX/HTML/URL parsing with smart escalation
Multi-model reviewvario.reviewParallel lens evaluation across models
UI componentslib/gradioDocInput widget, emoji favicon, footer

What It Reveals

Structure becomes visible. Many well-written documents have a hidden imbalance: lots of claims, little evidence. Or evidence that doesn't actually support the adjacent claim. The role map makes this immediately visible — you can see a red (claim) followed by another red (claim) with no blue (evidence) in between.
Not all claims are equal. A document with 15 claims might have 2 genuinely insightful observations buried among 13 restatements of conventional wisdom. The rating layer surfaces those 2 with a star, so you know what's worth your attention.
Craft and substance are independent axes. Style evaluation measures how well something is written. Claim rating measures how good the ideas are. A beautifully written platitude still scores N:1 I:1. An awkwardly phrased breakthrough still scores N:5 I:5. The system keeps these separate so you know which lever to pull: improve the writing, or improve the thinking.

What's Next


Generated 2026-03-03 · Live at draft.localhost · Design doc: claim-quality-rating