Draft analyzes documents along three dimensions: rhetorical role mapping (what each paragraph does), style evaluation (prose quality against 215 curated principles), and claims analysis (factual accuracy + evidence linking).
This session focused on making Draft demo-ready using Paul Graham's "How to Do Great Work" (11,517 words) as the test document. The main finding: the role extraction prompt was labeling 87% of segments as "explanation", missing the essay's many claims, analogies, and examples. After prompt engineering, the distribution improved dramatically to 44% claims / 22% explanation / 34% other roles.
| Metric | Before | After | Change |
|---|---|---|---|
| Role diversity (non-explanation %) | 13% | 56% | +43pp |
| Total segments extracted | 62 | 73 | +11 |
| Claim segments identified | 4 | 32 | +28 |
| Explanation segments | 54 | 16 | -38 |
| Unique role types used | 6 | 10 | +4 |
| Style eval score (PG excerpt) | 6.8/10 | baseline | |
The role mapAn interactive HTML view showing what each paragraph does in the argument: claim, evidence, example, explanation, etc. Left sidebar lists segments, right panel highlights the text. Click either side to sync. identifies the rhetorical function of each text chunk using a 12-role taxonomy. It produces a self-contained HTML file with two-way scroll sync between sidebar cards and highlighted document text.
| Role | Count | % | Description |
|---|---|---|---|
| ⚡ claim | 32 | 44% | Assertions PG wants you to accept |
| 🔍 explanation | 16 | 22% | Clarifying how/why mechanisms work |
| 🔗 analogy | 6 | 8% | Comparisons making abstract concrete |
| 📖 definition | 5 | 7% | Introducing key terms |
| ⚠️ qualification | 4 | 5% | Caveats and scope limits |
| 💡 example | 3 | 4% | Concrete illustrations |
| 🤝 concession | 3 | 4% | Acknowledging counterpoints |
| 📢 appeal | 2 | 3% | Calls to action |
| → transition | 1 | 1% | Section connectors |
| 📋 context | 1 | 1% | Background/framing |
The distribution matches what you'd expect from PG's writing: predominantly claims (he states positions) with explanations supporting them, sprinkled with analogies and definitions. The previous 87% explanation rate was clearly wrong — PG's essay is argumentative, not expository.
Labels improved from vague references ("the four steps", "intersection has shape") to self-contained phrases ("four steps to great work", "great work techniques overlap across fields"). The prompt now requires labels to be understandable without reading the document, and to use proper punctuation when combining ideas.
The map header now displays author name, source URL (clickable), word count, and segment count — previously only showed the filename and segment count.
The style evaluationEvaluates prose quality by scoring a document against 215 curated principles from Strunk & White, Williams' Clarity, Pinker & Zinsser, Gopen & Swan, and Orwell. Uses few-principles-at-a-time batch evaluation (2-3 per LLM call) for accuracy. pipeline: orient (triage what kind of writing this is) → select (pick relevant principles) → batch evaluate (2-3 principles per LLM call) → aggregate (category scores, top issues, top strengths).
| Category | Score | Violations | Exemplars |
|---|---|---|---|
| Strunk & White | 6.2/10 | 5 | 8 |
| Williams' Clarity | 6.3/10 | 69 | 114 |
| Other (Pinker/Zinsser, Gopen/Swan, Orwell) | 7.5/10 | 54 | 130 |
| Severity | Principle | Issue |
|---|---|---|
| critical | No comma splice | Comma splice between independent clauses: "When you're young you don't know what you're good at or what..." |
| critical | Consistent topic strings | Topic shifts: 'you' → 'some kinds of work' → 'some people'/'most' within one paragraph |
| critical | One story per unit | Paragraph combines three stories: (1) young people lack self-knowledge, (2) some work doesn't exist yet, (3) discovery through working |
| critical | Eliminate zombie nouns | Nominalizations "intersection" and "shape" obscure the underlying action |
| Principle | Example |
|---|---|
| Avoid the not-un- formation | Uses direct "difficult" rather than hedging with "not straightforward" |
| Topic position establishes context | Opens with clear framing ("The first step") before the main point |
| Direct assertions | Makes concrete claims ("too conservative") without double negatives |
Assessment: The evaluation correctly identifies PG's informal conversational style — it catches genuine issues (comma splices, topic shifts in long paragraphs) while recognizing his strengths (direct assertions, clear framing). The 6.8/10 for PG seems calibrated for a conversational essay against formal style guides. The violations are specific with exact quotes, and the strengths highlight what makes PG's writing effective.
| Issue | Severity | Example | Fix Applied |
|---|---|---|---|
| Role skew: 87% explanation | critical | "You should follow your interests" labeled as explanation instead of claim | Added disambiguation section to prompt differentiating claim/explanation/evidence with concrete tests |
| Distribution monotony | critical | Only 6 of 12 role types used | Added distribution check: "if >50% of segments share one role, reconsider" |
| Labels not self-contained | moderate | "intersection has shape" — intersection of what? | Updated prompt: labels must be understandable without the document, with good vs bad examples |
| Labels lack punctuation | moderate | "be prolific start lots of small things" — no comma/semicolon | Added punctuation guidance: "use commas, semicolons when combining ideas" |
| No author/source metadata | moderate | Header showed only filename and segment count | Added author and source_url params to render_role_map() + word count |
| Markdown artifacts in text | moderate | Table borders | --- | --- | from markdownify in document pane |
Added clean_text_for_display() that strips table borders, pipe chars, excess whitespace |
| File | Change |
|---|---|
draft/core/roles.py | Prompt: disambiguation section, distribution check, self-contained labels with punctuation |
draft/core/mapper.py | Added author, source_url params to header; clean_text_for_display() |
draft/cli.py | CLI map: added --author flag, text cleaning, source URL inference |
54 explanation (87%) 4 claim (6%) 1 evidence (2%) 1 context (2%) 1 concession (2%) 1 appeal (2%)
Only 6 of 12 role types used. Nearly everything tagged "explanation" regardless of whether the text was asserting a position, giving an analogy, or defining a term.
32 claim (44%) 16 explanation (22%) 6 analogy (8%) 5 definition (7%) 4 qualification(5%) 3 example (4%) 3 concession (4%) 2 appeal (3%) 1 transition (1%) 1 context (1%)
10 of 12 role types used. Claims correctly dominate (PG is argumentative), with explanations supporting them. Analogies, qualifications, and concessions properly identified.
pg-great-work · 62 segments
pg-great-work · Paul Graham · paulgraham.com/greatwork.html · 11,638 words · 73 segments
Three gyms measure extraction/evaluation quality across models. Each runs 3 corpus documents per model, comparing against reference extractions (role gym, claims gym) or meta-judge scoring (style gym).
Measures: role accuracy (40%), coverage (25%), boundary quality (20%), distribution realism (15%)
| Model | Overall | Accuracy | Coverage | Boundary | Distribution | Avg Segments |
|---|---|---|---|---|---|---|
| gemini-flash | 77.0 | 67.4 | 100.0 | 64.3 | 81.0 | 12 |
| sonnet | 73.5 | 64.2 | 100.0 | 57.0 | 75.9 | 16 |
| grok-fast | 71.0 | 62.1 | 100.0 | 49.3 | 75.2 | 16 |
| haiku | 67.7 | 58.4 | 100.0 | 37.3 | 78.7 | 21 |
Best single run: sonnet on narrative_report (82.9)
Measures: specificity (30%), calibration (25%), coverage (25%), actionability (20%). Meta-judged by OpusClaude Opus — the most capable model, used here as the meta-evaluator for gym judging..
| Model | Overall | Avg Violations | Avg Latency |
|---|---|---|---|
| sonnet | 77.3 | 354.7 | 207.7s |
| grok-fast | 70.0 | 129.0 | 58.8s |
| haiku | 67.3 | 234.0 | 81.6s |
| Subscore | Average |
|---|---|
| Specificity | 72.8 |
| Actionability | 72.2 |
| Coverage | 71.8 |
| Calibration | 69.4 |
Best single run: sonnet on poor_academic (88.0)
Measures: claim detection recall, precision, evidence linking accuracy.
| Model | Overall | Detection | Precision | Evidence Linking |
|---|---|---|---|---|
| sonnet | 77.2 | 100.0 | 67.5 | 53.5 |
| grok-fast | 75.9 | 91.7 | 78.3 | 55.8 |
| haiku | 63.6 | 88.9 | 51.9 | 37.8 |
| gemini-flash | 59.7 | 66.7 | 61.1 | 44.4 |
| Model | Roles | Style | Claims | Avg | Best For |
|---|---|---|---|---|---|
| sonnet | 73.5 | 77.3 | 77.2 | 76.0 | Style + Claims (quality over speed) |
| gemini-flash | 77.0 | — | 59.7 | 68.4 | Role extraction (fast + accurate) |
| grok-fast | 71.0 | 70.0 | 75.9 | 72.3 | Balanced (good claims, fast) |
| haiku | 67.7 | 67.3 | 63.6 | 66.2 | Budget option |
Recommendation: Use gemini-flash for role extraction (best accuracy, fastest). Use sonnet for style evaluation and claims (highest quality).
"Draft analyzes documents along three dimensions: what each paragraph does (rhetorical role), how well it's written (style), and whether claims are supported (evidence). Let me show you with Paul Graham's essay."
draft map https://paulgraham.com/greatwork.html --author "Paul Graham"
Show the two-panel view. Click a claim card in the sidebar → document scrolls to it with flash animation. Scroll through the document → sidebar follows. Point out the role distribution: "44% claims — PG is argumentative, and the tool catches that." Click an analogy to show it identified PG's metaphorical reasoning.
draft style /tmp/pg-excerpt.md
Show the score (6.8/10) and explain: "PG's informal style intentionally breaks some formal rules — comma splices, topic shifts — but the tool correctly identifies his strengths: direct assertions, clear framing, no hedging." Read one specific violation with its fix suggestion to show actionability.
draft claims /tmp/pg-great-work.md
Show claims with novelty/insight ratings. Point out: "N:4 I:5 means this is a genuinely novel insight — the tool filters out process descriptions and highlights the substantive claims."
"This pipeline works on any document — earnings calls, pitch memos, research papers, blog posts. The gym system continuously measures extraction quality across models."