Draft Rhetorical Analysis — Demo Polish Report

2026-03-09 · Document: How to Do Great Work by Paul Graham · source · 11,517 words

Contents 1. Executive Summary 2. Role Map Results 3. Style Evaluation Results 4. Issues Found & Fixes Applied 5. Before/After Comparison 6. Gym Scores 7. Demo Script

1. Executive Summary

Draft analyzes documents along three dimensions: rhetorical role mapping (what each paragraph does), style evaluation (prose quality against 215 curated principles), and claims analysis (factual accuracy + evidence linking).

This session focused on making Draft demo-ready using Paul Graham's "How to Do Great Work" (11,517 words) as the test document. The main finding: the role extraction prompt was labeling 87% of segments as "explanation", missing the essay's many claims, analogies, and examples. After prompt engineering, the distribution improved dramatically to 44% claims / 22% explanation / 34% other roles.

Key Metrics

Metric	Before	After	Change
Role diversity (non-explanation %)	13%	56%	+43pp
Total segments extracted	62	73	+11
Claim segments identified	4	32	+28
Explanation segments	54	16	-38
Unique role types used	6	10	+4
Style eval score (PG excerpt)	6.8/10		baseline

2. Role Map Results

What the Role Map Does

The role mapAn interactive HTML view showing what each paragraph does in the argument: claim, evidence, example, explanation, etc. Left sidebar lists segments, right panel highlights the text. Click either side to sync. identifies the rhetorical function of each text chunk using a 12-role taxonomy. It produces a self-contained HTML file with two-way scroll sync between sidebar cards and highlighted document text.

Role Distribution (After Fix)

Role	Count	%	Description
⚡ claim	32	44%	Assertions PG wants you to accept
🔍 explanation	16	22%	Clarifying how/why mechanisms work
🔗 analogy	6	8%	Comparisons making abstract concrete
📖 definition	5	7%	Introducing key terms
⚠️ qualification	4	5%	Caveats and scope limits
💡 example	3	4%	Concrete illustrations
🤝 concession	3	4%	Acknowledging counterpoints
📢 appeal	2	3%	Calls to action
→ transition	1	1%	Section connectors
📋 context	1	1%	Background/framing

The distribution matches what you'd expect from PG's writing: predominantly claims (he states positions) with explanations supporting them, sprinkled with analogies and definitions. The previous 87% explanation rate was clearly wrong — PG's essay is argumentative, not expository.

Label Quality

Labels improved from vague references ("the four steps", "intersection has shape") to self-contained phrases ("four steps to great work", "great work techniques overlap across fields"). The prompt now requires labels to be understandable without reading the document, and to use proper punctuation when combining ideas.

Header Metadata

The map header now displays author name, source URL (clickable), word count, and segment count — previously only showed the filename and segment count.

3. Style Evaluation Results

How Style Evaluation Works

The style evaluationEvaluates prose quality by scoring a document against 215 curated principles from Strunk & White, Williams' Clarity, Pinker & Zinsser, Gopen & Swan, and Orwell. Uses few-principles-at-a-time batch evaluation (2-3 per LLM call) for accuracy. pipeline: orient (triage what kind of writing this is) → select (pick relevant principles) → batch evaluate (2-3 principles per LLM call) → aggregate (category scores, top issues, top strengths).

PG "Great Work" Excerpt — Style Evaluation (Haiku, 500 words)

6.8

111 principles evaluated, 102 skipped by orientation

128 violations · 252 exemplars · 91.8s latency

Category	Score	Violations	Exemplars
Strunk & White	6.2/10	5	8
Williams' Clarity	6.3/10	69	114
Other (Pinker/Zinsser, Gopen/Swan, Orwell)	7.5/10	54	130

Top Issues

Severity	Principle	Issue
critical	No comma splice	Comma splice between independent clauses: "When you're young you don't know what you're good at or what..."
critical	Consistent topic strings	Topic shifts: 'you' → 'some kinds of work' → 'some people'/'most' within one paragraph
critical	One story per unit	Paragraph combines three stories: (1) young people lack self-knowledge, (2) some work doesn't exist yet, (3) discovery through working
critical	Eliminate zombie nouns	Nominalizations "intersection" and "shape" obscure the underlying action

Top Strengths

Principle	Example
Avoid the not-un- formation	Uses direct "difficult" rather than hedging with "not straightforward"
Topic position establishes context	Opens with clear framing ("The first step") before the main point
Direct assertions	Makes concrete claims ("too conservative") without double negatives

Assessment: The evaluation correctly identifies PG's informal conversational style — it catches genuine issues (comma splices, topic shifts in long paragraphs) while recognizing his strengths (direct assertions, clear framing). The 6.8/10 for PG seems calibrated for a conversational essay against formal style guides. The violations are specific with exact quotes, and the strengths highlight what makes PG's writing effective.

4. Issues Found & Fixes Applied

Issue	Severity	Example	Fix Applied
Role skew: 87% explanation	critical	"You should follow your interests" labeled as explanation instead of claim	Added disambiguation section to prompt differentiating claim/explanation/evidence with concrete tests
Distribution monotony	critical	Only 6 of 12 role types used	Added distribution check: "if >50% of segments share one role, reconsider"
Labels not self-contained	moderate	"intersection has shape" — intersection of what?	Updated prompt: labels must be understandable without the document, with good vs bad examples
Labels lack punctuation	moderate	"be prolific start lots of small things" — no comma/semicolon	Added punctuation guidance: "use commas, semicolons when combining ideas"
No author/source metadata	moderate	Header showed only filename and segment count	Added `author` and `source_url` params to `render_role_map()` + word count
Markdown artifacts in text	moderate	Table borders `\| --- \| --- \|` from markdownify in document pane	Added `clean_text_for_display()` that strips table borders, pipe chars, excess whitespace

Files Modified

File	Change
`draft/core/roles.py`	Prompt: disambiguation section, distribution check, self-contained labels with punctuation
`draft/core/mapper.py`	Added `author`, `source_url` params to header; `clean_text_for_display()`
`draft/cli.py`	CLI `map`: added `--author` flag, text cleaning, source URL inference

5. Before/After Comparison

Role Distribution — Full Essay (11,517 words)

Before (62 segments)

54 explanation  (87%)
 4 claim        (6%)
 1 evidence     (2%)
 1 context      (2%)
 1 concession   (2%)
 1 appeal       (2%)

Only 6 of 12 role types used. Nearly everything tagged "explanation" regardless of whether the text was asserting a position, giving an analogy, or defining a term.

After (73 segments)

32 claim        (44%)
16 explanation  (22%)
 6 analogy      (8%)
 5 definition   (7%)
 4 qualification(5%)
 3 example      (4%)
 3 concession   (4%)
 2 appeal       (3%)
 1 transition   (1%)
 1 context      (1%)

10 of 12 role types used. Claims correctly dominate (PG is argumentative), with explanations supporting them. Analogies, qualifications, and concessions properly identified.

Label Quality

Before

"the four steps"
"intersection has shape"
"choosing a field"
"learning by doing"
"personal principles"

After

"four steps to great work"
"great work techniques overlap across fields"
"choose work at aptitude-interest intersection"
"learn by working, not just studying"
"be prolific; start many small things"

Header Metadata

Before

pg-great-work · 62 segments

After

pg-great-work · Paul Graham · paulgraham.com/greatwork.html · 11,638 words · 73 segments

6. Gym Scores

Three gyms measure extraction/evaluation quality across models. Each runs 3 corpus documents per model, comparing against reference extractions (role gym, claims gym) or meta-judge scoring (style gym).

Role Extraction Gym

Measures: role accuracy (40%), coverage (25%), boundary quality (20%), distribution realism (15%)

Model	Overall	Accuracy	Coverage	Boundary	Distribution	Avg Segments
gemini-flash	77.0	67.4	100.0	64.3	81.0	12
sonnet	73.5	64.2	100.0	57.0	75.9	16
grok-fast	71.0	62.1	100.0	49.3	75.2	16
haiku	67.7	58.4	100.0	37.3	78.7	21

Best single run: sonnet on narrative_report (82.9)

Style Evaluation Gym

Measures: specificity (30%), calibration (25%), coverage (25%), actionability (20%). Meta-judged by OpusClaude Opus — the most capable model, used here as the meta-evaluator for gym judging..

Model	Overall	Avg Violations	Avg Latency
sonnet	77.3	354.7	207.7s
grok-fast	70.0	129.0	58.8s
haiku	67.3	234.0	81.6s

Subscore	Average
Specificity	72.8
Actionability	72.2
Coverage	71.8
Calibration	69.4

Best single run: sonnet on poor_academic (88.0)

Claims Analysis Gym

Measures: claim detection recall, precision, evidence linking accuracy.

Model	Overall	Detection	Precision	Evidence Linking
sonnet	77.2	100.0	67.5	53.5
grok-fast	75.9	91.7	78.3	55.8
haiku	63.6	88.9	51.9	37.8
gemini-flash	59.7	66.7	61.1	44.4

Cross-Gym Summary

Model	Roles	Style	Claims	Avg	Best For
sonnet	73.5	77.3	77.2	76.0	Style + Claims (quality over speed)
gemini-flash	77.0	—	59.7	68.4	Role extraction (fast + accurate)
grok-fast	71.0	70.0	75.9	72.3	Balanced (good claims, fast)
haiku	67.7	67.3	63.6	66.2	Budget option

Recommendation: Use gemini-flash for role extraction (best accuracy, fastest). Use sonnet for style evaluation and claims (highest quality).

7. Demo Script (3 minutes)

Suggested Walkthrough

Setup (30s)

"Draft analyzes documents along three dimensions: what each paragraph does (rhetorical role), how well it's written (style), and whether claims are supported (evidence). Let me show you with Paul Graham's essay."

Role Map (60s)

draft map https://paulgraham.com/greatwork.html --author "Paul Graham"

Show the two-panel view. Click a claim card in the sidebar → document scrolls to it with flash animation. Scroll through the document → sidebar follows. Point out the role distribution: "44% claims — PG is argumentative, and the tool catches that." Click an analogy to show it identified PG's metaphorical reasoning.

Style Evaluation (60s)

draft style /tmp/pg-excerpt.md

Show the score (6.8/10) and explain: "PG's informal style intentionally breaks some formal rules — comma splices, topic shifts — but the tool correctly identifies his strengths: direct assertions, clear framing, no hedging." Read one specific violation with its fix suggestion to show actionability.

Claims + Rating (30s)

draft claims /tmp/pg-great-work.md

Show claims with novelty/insight ratings. Point out: "N:4 I:5 means this is a genuinely novel insight — the tool filters out process descriptions and highlights the substantive claims."

Close (15s)

"This pipeline works on any document — earnings calls, pitch memos, research papers, blog posts. The gym system continuously measures extraction quality across models."

Glossary

Role Map: Interactive HTML visualization showing the rhetorical function (claim, evidence, example, etc.) of each text segment. Two-panel layout with scroll sync.
Style Evaluation: Automated prose quality assessment using 215 curated principles from 5 canonical style guides. Uses few-principles-at-a-time batch evaluation for accuracy.
Opus: Claude Opus — Anthropic's most capable model, used as the meta-evaluator for gym judging.