Generated 2026-03-11 22:38:44
Gyms are gen→eval→learn loops that measure and improve system components. Each gym runs models against a corpus, judges the output, and extracts findings.
See report format spec for report structure conventions.
| Gym | Status | Description | Last Run | Report |
|---|---|---|---|---|
| apply | Runnable | /recall retrieval quality (NDCG over labeled applications) | — | none |
| badge | Runnable | Badge text quality from session replay | — | none |
| claim-extraction | Runnable | Claim extraction completeness and accuracy | 2026-03-11 | assertions_20260311_221808 |
| claims-eval | Runnable | Claim detection precision and evidence linking | — | none |
| code-cleanup | Design only | Codebase hygiene suggestions (design only) | — | none |
| extraction | Runnable | HTML extraction + LLM cleaning quality | 2026-03-11 | 20260311_203712_test |
| fetchability | Runnable | Proxy/fetch method selection and success rate | — | none |
| iterm2 | Test scripts | iTerm2 terminal control verification | — | none |
| judge-calibration | Runnable | Judge monotonicity, discrimination, cross-model diffusion | — | none |
| llm-task | Runnable | Generic prompt x model comparison (session condensation) | — | none |
| problem-solving | Config only | Strategy selection for reasoning tasks | — | none |
| rhetorical-roles | Runnable | Rhetorical role segmentation accuracy | — | none |
| style | Runnable | Writing style evaluation quality | — | none |
13 gyms registered