Gyms Overview

Generated 2026-03-11 22:38:44

Gyms are gen→eval→learn loops that measure and improve system components. Each gym runs models against a corpus, judges the output, and extracts findings.

See report format spec for report structure conventions.

Gym	Status	Description	Last Run	Report
apply	Runnable	/recall retrieval quality (NDCG over labeled applications)	—	none
badge	Runnable	Badge text quality from session replay	—	none
claim-extraction	Runnable	Claim extraction completeness and accuracy	2026-03-11	assertions_20260311_221808
claims-eval	Runnable	Claim detection precision and evidence linking	—	none
code-cleanup	Design only	Codebase hygiene suggestions (design only)	—	none
extraction	Runnable	HTML extraction + LLM cleaning quality	2026-03-11	20260311_203712_test
fetchability	Runnable	Proxy/fetch method selection and success rate	—	none
iterm2	Test scripts	iTerm2 terminal control verification	—	none
judge-calibration	Runnable	Judge monotonicity, discrimination, cross-model diffusion	—	none
llm-task	Runnable	Generic prompt x model comparison (session condensation)	—	none
problem-solving	Config only	Strategy selection for reasoning tasks	—	none
rhetorical-roles	Runnable	Rhetorical role segmentation accuracy	—	none
style	Runnable	Writing style evaluation quality	—	none

13 gyms registered