Gyms Overview

Generated 2026-03-11 22:38:44

Gyms are gen→eval→learn loops that measure and improve system components. Each gym runs models against a corpus, judges the output, and extracts findings.

See report format spec for report structure conventions.

Gym Status Description Last Run Report
apply Runnable /recall retrieval quality (NDCG over labeled applications) none
badge Runnable Badge text quality from session replay none
claim-extraction Runnable Claim extraction completeness and accuracy 2026-03-11 assertions_20260311_221808
claims-eval Runnable Claim detection precision and evidence linking none
code-cleanup Design only Codebase hygiene suggestions (design only) none
extraction Runnable HTML extraction + LLM cleaning quality 2026-03-11 20260311_203712_test
fetchability Runnable Proxy/fetch method selection and success rate none
iterm2 Test scripts iTerm2 terminal control verification none
judge-calibration Runnable Judge monotonicity, discrimination, cross-model diffusion none
llm-task Runnable Generic prompt x model comparison (session condensation) none
problem-solving Config only Strategy selection for reasoning tasks none
rhetorical-roles Runnable Rhetorical role segmentation accuracy none
style Runnable Writing style evaluation quality none

13 gyms registered