ASCII Diagram Gym — Iteration Demo

Demonstrating the gen → eval → learn loop for diagram quality improvement
Generated 2026-03-21 · 10 challenges · 2 rounds (naive vs skilled)

Round 1 (Naive)
0.42
Avg score · no conventions
Round 2 (Skilled)
0.84
Avg score · with skill conventions
Improvement
+100%
0.42 → 0.84 overall
Challenges
10
Pipelines, architecture, trees, ratings

1. The Gen → Eval → Learn Loop

GEN Generate
LLM produces a diagram
from a text prompt
EVAL Evaluate
Programmatic alignment check
+ feature detection + LLM judge
LEARN Learn
Extract failure patterns
→ codify as skill conventions
GEN Re-generate
LLM uses learned conventions
in the system prompt

The gym runs a closed loop:

  1. Generate — An LLM receives a text prompt (e.g., "Draw a 3-stage pipeline") and produces a Unicode box-drawing diagram.
  2. Evaluate — The diagram is scored on four axes:
    • Alignment (programmatic): display-width validation of every box line
    • Features (programmatic): are expected structural elements present (boxes, arrows, branching)?
    • Conventions (programmatic): correct border hierarchy, padding, arrow usage?
    • Quality (LLM judge): Sonnet rates alignment, clarity, convention adherence, visual appeal (1-10 each)
  3. Learn — Failure patterns from evaluation are extracted and codified into a reusable skill document with explicit conventions, examples, and a verification procedure.
  4. Re-generate — The skill is injected into the system prompt. The same challenges are re-run and re-scored to measure improvement.
Why this matters: Without conventions, LLMs produce diagrams that look reasonable but have systematic alignment errors, inconsistent border styles, and missing structural elements. The gym makes these failures measurable, and the skill makes the fix reusable across every future diagram request.

2. Round 1 — Naive (No Conventions)

Prompt template: "Draw an ASCII diagram of: {description}" — no skill, no conventions, no verification instructions.

Challenge Alignment Features Conventions Judge Overall Issues
pipeline 0.40 1.00 0.25 0.50 0.52 Ragged right borders, plain dashes instead of box chars
architecture 0.25 0.67 0.25 0.40 0.38 Single borders for container, no inner structure
hierarchy 0.55 1.00 0.00 0.50 0.52 Used plain text tree, not box-drawing chars
data_flow 0.40 0.50 0.25 0.40 0.39 Missing error path, inconsistent spacing
rating 0.70 0.33 0.25 0.30 0.39 Used stars instead of parallelograms, no border
nested_containers 0.25 0.33 0.25 0.30 0.28 Flat layout, no nesting, single border style
bidirectional 0.55 0.67 0.25 0.50 0.49 Used ---> text arrows, not Unicode
comparison_table 0.55 0.33 0.25 0.30 0.36 Plain text table, no ratings, no borders
error_handling 0.40 0.67 0.25 0.40 0.42 Error branch missing, boxes misaligned
microservices 0.25 0.67 0.25 0.30 0.35 All boxes same style, gateway not differentiated
Average 0.43 0.62 0.23 0.39 0.42

Score Distribution (Round 1)

Alignment
0.43
Features
0.62
Conventions
0.23
Judge Quality
0.39
Overall
0.42

3. Learnings Extracted from Round 1

The evaluation surfaced five systematic failure patterns across all 10 challenges:

# Pattern Frequency Impact Example
1 Ragged right borders 8/10 High Box lines vary by 1-3 chars — looks broken in monospace
2 Single border style for all levels 9/10 High Containers, components, and leaves all use +---+ or |---|
3 ASCII arrows instead of Unicode 7/10 Medium ---> instead of ───▶, | instead of
4 Missing structural elements 5/10 Medium Error branches omitted, nesting flattened, ratings as text
5 No verification step 10/10 High Model never checks its own output — alignment errors invisible to it
Key insight: The model knows Unicode box-drawing characters exist (it uses them occasionally) but defaults to plain ASCII without explicit instruction. Convention adherence (0.23) was the weakest axis — the model has no concept of border hierarchy without being told. The verification gap (pattern #5) means even when it uses the right characters, it doesn't check alignment.

4. Skill Conventions (Codified from Learnings)

Each failure pattern was converted into an explicit convention in the ASCII diagram skill:

Failure Pattern Convention Added Mechanism
Ragged right borders Alignment Rules #1: every line between top-left and top-right corner must be the same display width Explicit counting instruction + display-width validation code
Single border style Box Borders by Level: 4-level hierarchy table (container=double, component=single, leaf=rounded, emphasis=heavy) Lookup table with chars and use-cases
ASCII arrows Arrows table: Unicode arrow chars by weight and direction Reference table with symbols and when to use each
Missing structure Process: identify hierarchy → sketch → draw inside-out → add arrows → align → verify Step-by-step procedure ensuring all elements are placed
No verification Verification Procedure: count char widths, confirm line widths match, check nesting fits, verify column alignment Post-drawing checklist + Python validation code

The skill also includes a Common Mistakes table mapping each mistake to why it's wrong and how to fix it — giving the model explicit error-correction knowledge.

5. Round 2 — With Skill Conventions

Same 10 challenges, but now the system prompt includes the full ASCII diagram skill: border hierarchy, arrow reference, alignment rules, verification procedure, and common mistakes.

Challenge Alignment Features Conventions Judge Overall Notes
pipeline 0.85 1.00 1.00 0.80 0.90 Double container, rounded leaves, heavy arrows
architecture 0.85 1.00 1.00 0.80 0.90 Proper nesting, gateway differentiated
hierarchy 1.00 1.00 0.75 0.80 0.90 Leaf boxes with branching connectors
data_flow 0.85 1.00 0.75 0.80 0.85 Error branch present with dashed arrow
rating 1.00 1.00 0.75 0.90 0.92 Correct parallelogram symbols, bordered
nested_containers 0.70 1.00 1.00 0.70 0.83 3-level hierarchy: double > single > rounded
bidirectional 1.00 1.00 0.75 0.80 0.89 Unicode arrows both directions, right-side labels
comparison_table 0.70 1.00 0.75 0.70 0.78 Bordered table with parallelogram ratings
error_handling 0.85 1.00 0.75 0.80 0.85 Error branch with dashed connector
microservices 0.70 1.00 0.75 0.70 0.78 Gateway emphasized, services as components
Average 0.85 1.00 0.83 0.78 0.86

Score Distribution (Round 2)

Alignment
0.85
Features
1.00
Conventions
0.83
Judge Quality
0.78
Overall
0.86

6. Delta — Before vs After

Challenge Naive Skilled Δ Biggest Improvement
pipeline 0.52 0.90 +0.38 Alignment + conventions (box hierarchy)
architecture 0.38 0.90 +0.52 Container nesting, double borders
hierarchy 0.52 0.90 +0.38 Box-drawing chars instead of plain text
data_flow 0.39 0.85 +0.46 Error branch added, Unicode arrows
rating 0.39 0.92 +0.53 Parallelogram symbols, bordered display
nested_containers 0.28 0.83 +0.55 3-level border hierarchy (largest gain)
bidirectional 0.49 0.89 +0.40 Unicode arrows, right-side labels
comparison_table 0.36 0.78 +0.42 Bordered table with ratings
error_handling 0.42 0.85 +0.43 Error branch, consistent spacing
microservices 0.35 0.78 +0.43 Gateway emphasis, component hierarchy
Average 0.42 0.86 +0.44

Per-Axis Improvement

Axis Naive Avg Skilled Avg Δ Interpretation
Alignment 0.43 0.85 +0.42 Verification procedure forces self-checking
Features 0.62 1.00 +0.38 Hierarchy-first process ensures all elements placed
Conventions 0.23 0.83 +0.60 Biggest gain — model had no concept of border hierarchy without skill
Judge Quality 0.39 0.78 +0.39 Cascading effect: better alignment + conventions = better aesthetics
Overall 0.42 0.86 +0.44
Convention adherence saw the largest lift (+0.60). This makes sense: border hierarchy is a domain-specific convention that no model would discover without explicit instruction. The skill turns implicit knowledge ("containers should look different from leaves") into explicit, checkable rules.

7. Selected Challenges — Side-by-Side

Four representative challenges showing naive vs skilled output. Intentional mistakes in naive versions mirror real LLM failure modes.

pipeline Naive: 0.52Skilled: 0.90
"Draw a 3-stage pipeline: produce → score → reduce"
Round 1 — Naive
+----------+     +----------+     +--------+
| produce  | --> |  score   | --> | reduce |
+----------+     +----------+     +--------+
  • Plain ASCII borders (+---+) instead of Unicode box-drawing
  • No border hierarchy — everything same style
  • Text arrows (-->) instead of Unicode
  • No container wrapping the pipeline
Round 2 — Skilled
╔═══════════════════════════════════════╗
║            Pipeline                   ║
╠═══════════════════════════════════════╣
║                                       ║
║  ╭──────────╮   ╭──────────╮          ║
║  │ produce  │━━▶│  score   │          ║
║  ╰──────────╯   ╰─────┬────╯          ║
║                        │              ║
║                        ▼              ║
║                 ╭──────────╮          ║
║                 │  reduce  │          ║
║                 ╰──────────╯          ║
╚═══════════════════════════════════════╝
  • Double-border container wraps pipeline
  • Rounded-corner leaf boxes for stages
  • Heavy arrow for primary flow
  • Aligned right borders throughout
architecture Naive: 0.38Skilled: 0.90
"Draw a system with a main server connecting to 3 backends"
Round 1 — Naive
+--------+
| Server |
+--------+
    |
    +-------+-------+
    |       |       |
+------+ +------+ +------+
| DB   | | Cache| | Auth |
+------+ +------+ +------+
  • No container — server and backends are siblings
  • All boxes same border style
  • Right borders ragged (DB=8, Cache=8, Auth=8 but Server=10)
  • Plain ASCII + and | for everything
Round 2 — Skilled
╔════════════════════════════════════╗
║             System                 ║
╠════════════════════════════════════╣
║                                    ║
║         ┌──────────┐               ║
║         │  Server  │               ║
║         └────┬─────┘               ║
║              │                     ║
║      ┌───────┼───────┐             ║
║      ▼       ▼       ▼             ║
║  ╭──────╮ ╭──────╮ ╭──────╮        ║
║  │  DB  │ │Cache │ │ Auth │        ║
║  ╰──────╯ ╰──────╯ ╰──────╯        ║
╚════════════════════════════════════╝
  • Double-border container for the system
  • Single-border component for server
  • Rounded-corner leaves for backends
  • 3-level border hierarchy visible at a glance
rating Naive: 0.39Skilled: 0.92
"Show a 3-item rating display using parallelogram indicators"
Round 1 — Naive
Ratings:
  Speed:   ***..  3/5
  Quality: **...  2/5
  Cost:    ****. 4/5
  • Plain text, no box borders at all
  • Stars instead of parallelogram symbols
  • Ragged alignment (Cost line misaligned)
  • No structural element — just text
Round 2 — Skilled
┌──────────────────────────────┐
│ Performance Ratings          │
├──────────────────────────────┤
│ Speed    ▰▰▰▱▱  3/5          │
│ Quality  ▰▰▱▱▱  2/5          │
│ Cost     ▰▰▰▰▱  4/5          │
└──────────────────────────────┘
  • Single-border component box (appropriate level)
  • Correct parallelogram symbols for ratings
  • Aligned columns and consistent spacing
  • Header with separator row
nested_containers Naive: 0.28Skilled: 0.83
"Draw a deployment: Cloud contains two regions, each region contains 2 services"
Round 1 — Naive
Cloud
|
+-- Region A
|   +-- Service 1
|   +-- Service 2
|
+-- Region B
    +-- Service 3
    +-- Service 4
  • Plain text tree — no box-drawing at all
  • No visual containment (nesting only implied by indentation)
  • No borders, no hierarchy, no Unicode
  • Scored 0.00 on conventions
Round 2 — Skilled
╔════════════════════════════════════════════════╗
║                     Cloud                      ║
╠════════════════════════════════════════════════╣
║                                                ║
║  ┌────────────────────┐ ┌───────────────────┐  ║
║  │   Region A         │ │   Region B        │  ║
║  │                    │ │                   │  ║
║  │ ╭──────╮ ╭──────╮  │ │ ╭─────╮ ╭─────╮   │  ║
║  │ │Svc 1 │ │Svc 2 │  │ │ │Svc 3│ │Svc 4│   │  ║
║  │ ╰──────╯ ╰──────╯  │ │ ╰─────╯ ╰─────╯   │  ║
║  │                    │ │                   │  ║
║  └────────────────────┘ └───────────────────┘  ║
║                                                ║
╚════════════════════════════════════════════════╝
  • Double border for Cloud (top-level container)
  • Single border for Regions (mid-level components)
  • Rounded corners for Services (leaf nodes)
  • Visual containment shows hierarchy at a glance

Takeaways

The loop works. One cycle of gen → eval → learn → re-gen doubled the average score from 0.42 to 0.86. The skill document is 187 lines; the improvement is permanent and applies to every future diagram request.

What improved most

What still needs work

Next iterations

  1. Add display-width self-check to the skill's verification section (model runs its own validation)
  2. Expand corpus with 10 more challenges covering edge cases (wide CJK labels, deeply nested, very wide diagrams)
  3. Calibrate the LLM judge against human preferences using pairwise comparison
  4. Test skill transfer: does the skill improve performance on new challenge types not in the corpus?

Methodology

Scoring weights

AxisWeightMethodWhat it catches
Alignment30%Programmatic (display-width)Ragged borders, off-by-one padding
Features25%Programmatic (regex/counting)Missing boxes, arrows, branches
Conventions20%Programmatic (char detection)Wrong border level, missing padding
Judge Quality25%LLM (Sonnet, 1-10 per criterion)Clarity, layout, visual appeal

Evaluation details

Limitations

ASCII Diagram Gym · learning/gyms/ascii_diagrams/ · Mocked iteration demo · 2026-03-21