Demonstrating the gen → eval → learn loop for diagram quality improvement
Generated 2026-03-21 · 10 challenges · 2 rounds (naive vs skilled)
The gym runs a closed loop:
Prompt template: "Draw an ASCII diagram of: {description}" — no skill, no conventions, no verification instructions.
| Challenge | Alignment | Features | Conventions | Judge | Overall | Issues |
|---|---|---|---|---|---|---|
| pipeline | 0.40 | 1.00 | 0.25 | 0.50 | 0.52 | Ragged right borders, plain dashes instead of box chars |
| architecture | 0.25 | 0.67 | 0.25 | 0.40 | 0.38 | Single borders for container, no inner structure |
| hierarchy | 0.55 | 1.00 | 0.00 | 0.50 | 0.52 | Used plain text tree, not box-drawing chars |
| data_flow | 0.40 | 0.50 | 0.25 | 0.40 | 0.39 | Missing error path, inconsistent spacing |
| rating | 0.70 | 0.33 | 0.25 | 0.30 | 0.39 | Used stars instead of parallelograms, no border |
| nested_containers | 0.25 | 0.33 | 0.25 | 0.30 | 0.28 | Flat layout, no nesting, single border style |
| bidirectional | 0.55 | 0.67 | 0.25 | 0.50 | 0.49 | Used ---> text arrows, not Unicode |
| comparison_table | 0.55 | 0.33 | 0.25 | 0.30 | 0.36 | Plain text table, no ratings, no borders |
| error_handling | 0.40 | 0.67 | 0.25 | 0.40 | 0.42 | Error branch missing, boxes misaligned |
| microservices | 0.25 | 0.67 | 0.25 | 0.30 | 0.35 | All boxes same style, gateway not differentiated |
| Average | 0.43 | 0.62 | 0.23 | 0.39 | 0.42 |
The evaluation surfaced five systematic failure patterns across all 10 challenges:
| # | Pattern | Frequency | Impact | Example |
|---|---|---|---|---|
| 1 | Ragged right borders | 8/10 | High | Box lines vary by 1-3 chars — looks broken in monospace |
| 2 | Single border style for all levels | 9/10 | High | Containers, components, and leaves all use +---+ or |---| |
| 3 | ASCII arrows instead of Unicode | 7/10 | Medium | ---> instead of ───▶, | instead of │ |
| 4 | Missing structural elements | 5/10 | Medium | Error branches omitted, nesting flattened, ratings as text |
| 5 | No verification step | 10/10 | High | Model never checks its own output — alignment errors invisible to it |
Each failure pattern was converted into an explicit convention in the ASCII diagram skill:
| Failure Pattern | Convention Added | Mechanism |
|---|---|---|
| Ragged right borders | Alignment Rules #1: every line between top-left and top-right corner must be the same display width | Explicit counting instruction + display-width validation code |
| Single border style | Box Borders by Level: 4-level hierarchy table (container=double, component=single, leaf=rounded, emphasis=heavy) | Lookup table with chars and use-cases |
| ASCII arrows | Arrows table: Unicode arrow chars by weight and direction | Reference table with symbols and when to use each |
| Missing structure | Process: identify hierarchy → sketch → draw inside-out → add arrows → align → verify | Step-by-step procedure ensuring all elements are placed |
| No verification | Verification Procedure: count char widths, confirm line widths match, check nesting fits, verify column alignment | Post-drawing checklist + Python validation code |
The skill also includes a Common Mistakes table mapping each mistake to why it's wrong and how to fix it — giving the model explicit error-correction knowledge.
Same 10 challenges, but now the system prompt includes the full ASCII diagram skill: border hierarchy, arrow reference, alignment rules, verification procedure, and common mistakes.
| Challenge | Alignment | Features | Conventions | Judge | Overall | Notes |
|---|---|---|---|---|---|---|
| pipeline | 0.85 | 1.00 | 1.00 | 0.80 | 0.90 | Double container, rounded leaves, heavy arrows |
| architecture | 0.85 | 1.00 | 1.00 | 0.80 | 0.90 | Proper nesting, gateway differentiated |
| hierarchy | 1.00 | 1.00 | 0.75 | 0.80 | 0.90 | Leaf boxes with branching connectors |
| data_flow | 0.85 | 1.00 | 0.75 | 0.80 | 0.85 | Error branch present with dashed arrow |
| rating | 1.00 | 1.00 | 0.75 | 0.90 | 0.92 | Correct parallelogram symbols, bordered |
| nested_containers | 0.70 | 1.00 | 1.00 | 0.70 | 0.83 | 3-level hierarchy: double > single > rounded |
| bidirectional | 1.00 | 1.00 | 0.75 | 0.80 | 0.89 | Unicode arrows both directions, right-side labels |
| comparison_table | 0.70 | 1.00 | 0.75 | 0.70 | 0.78 | Bordered table with parallelogram ratings |
| error_handling | 0.85 | 1.00 | 0.75 | 0.80 | 0.85 | Error branch with dashed connector |
| microservices | 0.70 | 1.00 | 0.75 | 0.70 | 0.78 | Gateway emphasized, services as components |
| Average | 0.85 | 1.00 | 0.83 | 0.78 | 0.86 |
| Challenge | Naive | Skilled | Δ | Biggest Improvement |
|---|---|---|---|---|
| pipeline | 0.52 | 0.90 | +0.38 | Alignment + conventions (box hierarchy) |
| architecture | 0.38 | 0.90 | +0.52 | Container nesting, double borders |
| hierarchy | 0.52 | 0.90 | +0.38 | Box-drawing chars instead of plain text |
| data_flow | 0.39 | 0.85 | +0.46 | Error branch added, Unicode arrows |
| rating | 0.39 | 0.92 | +0.53 | Parallelogram symbols, bordered display |
| nested_containers | 0.28 | 0.83 | +0.55 | 3-level border hierarchy (largest gain) |
| bidirectional | 0.49 | 0.89 | +0.40 | Unicode arrows, right-side labels |
| comparison_table | 0.36 | 0.78 | +0.42 | Bordered table with ratings |
| error_handling | 0.42 | 0.85 | +0.43 | Error branch, consistent spacing |
| microservices | 0.35 | 0.78 | +0.43 | Gateway emphasis, component hierarchy |
| Average | 0.42 | 0.86 | +0.44 |
| Axis | Naive Avg | Skilled Avg | Δ | Interpretation |
|---|---|---|---|---|
| Alignment | 0.43 | 0.85 | +0.42 | Verification procedure forces self-checking |
| Features | 0.62 | 1.00 | +0.38 | Hierarchy-first process ensures all elements placed |
| Conventions | 0.23 | 0.83 | +0.60 | Biggest gain — model had no concept of border hierarchy without skill |
| Judge Quality | 0.39 | 0.78 | +0.39 | Cascading effect: better alignment + conventions = better aesthetics |
| Overall | 0.42 | 0.86 | +0.44 |
Four representative challenges showing naive vs skilled output. Intentional mistakes in naive versions mirror real LLM failure modes.
+----------+ +----------+ +--------+ | produce | --> | score | --> | reduce | +----------+ +----------+ +--------+
╔═══════════════════════════════════════╗ ║ Pipeline ║ ╠═══════════════════════════════════════╣ ║ ║ ║ ╭──────────╮ ╭──────────╮ ║ ║ │ produce │━━▶│ score │ ║ ║ ╰──────────╯ ╰─────┬────╯ ║ ║ │ ║ ║ ▼ ║ ║ ╭──────────╮ ║ ║ │ reduce │ ║ ║ ╰──────────╯ ║ ╚═══════════════════════════════════════╝
+--------+
| Server |
+--------+
|
+-------+-------+
| | |
+------+ +------+ +------+
| DB | | Cache| | Auth |
+------+ +------+ +------+
╔════════════════════════════════════╗ ║ System ║ ╠════════════════════════════════════╣ ║ ║ ║ ┌──────────┐ ║ ║ │ Server │ ║ ║ └────┬─────┘ ║ ║ │ ║ ║ ┌───────┼───────┐ ║ ║ ▼ ▼ ▼ ║ ║ ╭──────╮ ╭──────╮ ╭──────╮ ║ ║ │ DB │ │Cache │ │ Auth │ ║ ║ ╰──────╯ ╰──────╯ ╰──────╯ ║ ╚════════════════════════════════════╝
Ratings: Speed: ***.. 3/5 Quality: **... 2/5 Cost: ****. 4/5
┌──────────────────────────────┐ │ Performance Ratings │ ├──────────────────────────────┤ │ Speed ▰▰▰▱▱ 3/5 │ │ Quality ▰▰▱▱▱ 2/5 │ │ Cost ▰▰▰▰▱ 4/5 │ └──────────────────────────────┘
Cloud
|
+-- Region A
| +-- Service 1
| +-- Service 2
|
+-- Region B
+-- Service 3
+-- Service 4
╔════════════════════════════════════════════════╗ ║ Cloud ║ ╠════════════════════════════════════════════════╣ ║ ║ ║ ┌────────────────────┐ ┌───────────────────┐ ║ ║ │ Region A │ │ Region B │ ║ ║ │ │ │ │ ║ ║ │ ╭──────╮ ╭──────╮ │ │ ╭─────╮ ╭─────╮ │ ║ ║ │ │Svc 1 │ │Svc 2 │ │ │ │Svc 3│ │Svc 4│ │ ║ ║ │ ╰──────╯ ╰──────╯ │ │ ╰─────╯ ╰─────╯ │ ║ ║ │ │ │ │ ║ ║ └────────────────────┘ └───────────────────┘ ║ ║ ║ ╚════════════════════════════════════════════════╝
| Axis | Weight | Method | What it catches |
|---|---|---|---|
| Alignment | 30% | Programmatic (display-width) | Ragged borders, off-by-one padding |
| Features | 25% | Programmatic (regex/counting) | Missing boxes, arrows, branches |
| Conventions | 20% | Programmatic (char detection) | Wrong border level, missing padding |
| Judge Quality | 25% | LLM (Sonnet, 1-10 per criterion) | Clarity, layout, visual appeal |
display_width() counts each Unicode character's visual width (CJK/emoji = 2, box-drawing = 1). Every line between a box's opening and closing corners must have the same display width. Each misaligned line deducts 0.15 from the score.ASCII Diagram Gym · learning/gyms/ascii_diagrams/ · Mocked iteration demo · 2026-03-21