# Draft Style Gym — Design

**Date**: 2026-02-28
**Status**: Approved
**Location**: `draft/style/`

## Context

The draft system analyzes documents across multiple dimensions. Style/craft evaluation
is one dimension alongside rhetorical structure (existing `core/roles.py`), with future
dimensions for cognitive load pacing, claim checking, and argument structure.

This design covers the **style/craft** dimension: evaluating and improving prose quality
against curated principles from canonical style guides.

## Key Research Finding

Evaluating 1-3 principles per LLM call dramatically outperforms dumping all criteria
at once (CheckEval: +0.45 inter-model agreement; DeCE: r=0.78 vs r=0.35 correlation
with human experts). The architecture follows this finding throughout.

References:
- CheckEval (EMNLP 2025): Binary decomposition of evaluation criteria
- DeCE (EMNLP 2025 Industry): Decomposed criteria-based evaluation, r=0.78
- LLM-Rubric (ACL 2024): Calibrated multi-dimensional aggregation
- WritingBench (NeurIPS 2025): Dynamic per-instance criteria generation
- G-Eval (EMNLP 2023): Chain-of-thought scoring with GPT-4

## Architecture

```
draft/style/
├── guides/
│   ├── checklists/              # Principle databases (YAML)
│   │   ├── strunk_white.yaml    # ~55 principles from Elements of Style
│   │   ├── williams_clarity.yaml # ~80 principles from Style: Clarity & Grace
│   │   ├── gopen_swan.yaml      # ~25 principles (scientific info flow)
│   │   ├── orwell.yaml          # ~15 principles (anti-pretension)
│   │   └── pinker_zinsser.yaml  # ~40 principles (distilled)
│   ├── cache/                   # Full texts for reference (gitignored)
│   │   ├── strunk_1918.txt      # Project Gutenberg, public domain
│   │   ├── gopen_swan_1990.pdf  # USENIX reprint, permission-granted
│   │   └── README.md            # Download URLs + instructions
│   └── __init__.py              # load_checklist(), list_guides(), filter()
├── orient.py                    # Document orientation/triage
├── evaluate.py                  # Few-principles-at-a-time scoring engine
├── gym.py                       # Gen→eval→learn loop (lib/gym integration)
├── report.py                    # HTML report rendering
├── __init__.py
└── tests/
```

## Three Layers

### Layer 1: Principle Library (`guides/`)

Each principle is structured YAML for programmatic use:

```yaml
source: "The Elements of Style (1918)"
url: "https://www.gutenberg.org/files/37134/37134-h/37134-h.htm"
focus: "Usage rules, composition principles, style reminders"
principles:
  - id: sw-13
    name: "Omit needless words"
    category: concision
    tags: [composition, clarity, filler]
    rule: >
      A sentence should contain no unnecessary words, a paragraph no
      unnecessary sentences. Every word should tell.
    detect: >
      Look for: 'the question as to whether' → 'whether',
      'there is no doubt but that' → 'doubtless',
      'the fact that' → cut, 'in order to' → 'to'
    examples:
      bad: ["owing to the fact that", "he is a man who", "the reason why is that"]
      good: ["since/because", "he", "because"]
    severity: moderate
```

### Layer 2: Orientation (`orient.py`)

A single cheap LLM call (gemini-flash, ~500 tokens out) that triages the document:

```python
@dataclass
class Orientation:
    document_type: str        # technical_report, essay, proposal, narrative, ...
    register: str             # academic, professional, casual, literary
    audience: str             # expert, general, mixed
    language: str             # en, fr, de, ... (multilingual support)
    length_class: str         # short (<500w), medium, long (>3000w)
    effort_level: str         # light, standard, deep
    primary_concerns: list[str]   # top 3-5 relevant principle categories
    relevant_guides: dict[str, list[str]]  # guide → applicable categories
    skip_categories: list[str]    # what NOT to evaluate
    rationale: str            # one-paragraph explanation of choices
```

This is what WritingBench calls "dynamic criteria generation" — but selecting from
our curated library rather than generating criteria from scratch.

### Layer 3: Principle-Level Evaluation (`evaluate.py`)

Core loop following CheckEval/DeCE architecture:

```
For each batch of 2-3 principles:
    → "Does this document violate principle X? Does it exemplify it?"
    → If violation: quote it, suggest fix, rate severity
    → Score: 0 (clear violation) to 10 (exemplary application)

Aggregate → per-category scores + overall score + prioritized fixes
```

Supports two modes:
- **Absolute**: score a single document
- **Comparative**: score doc A vs doc B on the same principles (version delta)

## Integration Points

### Into existing review pipeline
The Language lens in `vario/review.py` gains an optional `style_principles` parameter.
Orientation selects the relevant principles; they're injected as `<style-guide>` XML.

### Into lib/gym infrastructure
- `GymBase` subclass for writing evaluation
- Results stored in JSONL corpus
- Track quality over document revisions
- Calibration: test judge monotonicity (degrade input → score should drop)

### UI
Two new tabs in draft/ Gradio UI:
- **Evaluate**: paste doc → orientation → principle scores → report
- **Guides**: browse principle library, view full texts

## Guide Sources

### Cacheable Full Texts (free, legal)

| Source | Format | URL | Status |
|--------|--------|-----|--------|
| Strunk (1918) | TXT | Project Gutenberg #37134 | Public domain |
| Orwell — Politics & Language | HTML | orwellfoundation.com | Estate-authorized |
| Gopen & Swan (1990) | PDF | usenix.org reprint | Permission-granted |

### Distill Only (copyrighted — our own principle extraction)

| Source | Focus | ~Principles |
|--------|-------|-------------|
| Williams — Clarity & Grace | Sentence craft, cohesion, stress position | ~80 |
| Pinker — Sense of Style | Classic style, curse of knowledge | ~25 |
| Zinsser — On Writing Well | Simplicity, clutter, audience | ~15 |
| Lanham — Revising Prose | "Paramedic method" for cutting lard | ~10 |

### Future / Reference

| Source | Focus | Priority |
|--------|-------|----------|
| Garner — Modern English Usage | Usage disputes, word choice | C |
| Fowler (1926) | Classic usage reference | C (public domain) |
| Chicago Manual of Style | Editorial conventions | C |
| Tufte — Visual Display | Data presentation in prose | B |

## Relationship to Other Draft Dimensions

Style/craft is one evaluation dimension. The `draft/` system will grow to cover:

| Dimension | Subdirectory | What it evaluates |
|-----------|-------------|-------------------|
| **Rhetorical structure** | `core/` (exists) | What each chunk *does* (claim, evidence, ...) |
| **Style/craft** | `style/` (this design) | Prose quality against curated principles |
| **Cognitive load** | `cognitive/` (future) | Information pacing, working memory load, density |
| **Claims** | `claims/` (future) | Factual accuracy, evidence-claim alignment |
| **Argument** | `argument/` (future) | Logical structure, counterarguments, completeness |

Each dimension has its own orientation, evaluation, and gym loop. Cross-dimension
synthesis happens at the review level (existing `vario/review.py`).

## Notes

### Multilingual advantage
This architecture generalizes to other languages. The orientation system detects language;
principle libraries can have per-language variants or language-specific guides (e.g.,
Académie française style for French, Duden for German). The few-at-a-time evaluation
approach works in any language — principles are expressed as natural language criteria,
not English-specific regex rules. **This is a differentiator** — most writing evaluation
tools are English-only.

### These benchmarks will only get stronger
- LLM-as-judge accuracy improves with each model generation
- Finetuned critic models (WritingBench's Qwen-7B) will get cheaper and better
- The principle library is the durable asset — evaluation methods are swappable
- Today's ceiling (r=0.78 with human experts via DeCE) will be surpassed
- As models improve, the same principle library produces better evaluations for free

### Principle usage tracking (adaptive prioritization)
Track every principle's hit rate across real documents:
- **fired**: principle detected a violation or exemplary usage
- **skipped**: orientation excluded it as irrelevant
- **inert**: evaluated but found nothing (neither violation nor exemplar)

Over time, this produces a heat map: which principles actually matter in practice.
Principles that are consistently inert across many documents get deprioritized —
they're still in the library but dropped from default evaluation batches. Principles
that fire frequently get promoted to earlier evaluation batches.

Storage: append to the gym's JSONL corpus — each evaluation records per-principle
outcomes. Aggregation produces a `principle_stats.json` with hit rates, recency,
and a computed priority score.

This is the learning loop: the principle library self-prunes through usage, and
orientation can use historical hit rates as a signal when selecting principles
for a new document.

### Prompt ordering for caching

LLM prompt caches work on the prefix — put shared content first, varying content after.

**Within a single evaluation** (one doc, many principle batches):
- Document is shared → put it first as `<document>` prefix
- Principle batches vary → put them after as `<principles>` suffix
- Result: the document prefix is cached across all batch calls

**Cross-document evaluation** (same principles, many docs):
- Principles are shared → put them first
- Document varies → put it after

**General rule**: whichever operand is CONSTANT across a set of calls goes first.
This is a separate concern from evaluation logic — revisit ordering when usage
patterns change (e.g., batch-evaluating many documents on the same principles
for gym/corpus work would want the reverse order).

### Few-at-a-time is architecturally correct
The research is clear: decomposed evaluation beats monolithic evaluation. This means
the principle library's value scales with size — adding more principles doesn't degrade
quality because they're evaluated in small batches, not crammed into one prompt.

## Implementation Order

| # | What | Effort |
|---|------|--------|
| 1 | Principle YAML files (all 5 guides) | Medium — thorough principle extraction |
| 2 | `orient.py` + prompt | Small — single LLM call |
| 3 | `evaluate.py` + few-at-a-time engine | Medium — core scoring loop |
| 4 | Cache setup + download script | Small |
| 5 | Integrate into Language lens | Small — wire into existing review.py |
| 6 | Gym loop + JSONL storage | Small — extends lib/gym |
| 7 | UI tabs (Evaluate + Guides) | Medium |
| 8 | A/B comparison mode | Small — runs evaluate twice + diffs |
| 9 | HTML report | Medium |