# Autonomous Iterative Development — Design

**Date**: 2026-03-03
**Status**: Approved, parallel exploration phase

## Goal

Build a system that autonomously iterates toward ambitious goals — trying strategy variants, evaluating results, learning, and trying again — with human review as steering, not bottleneck. The system should explore speculatively, generate options, and maximize output per unit of human attention.

## Design Principle

**Explore ahead of human guidance.** The biggest bottleneck is human attention. The system should:
- Generate hypotheses and test them before being asked
- Produce ranked results and reports for efficient human review
- Start with more supervision, work toward fully autonomous
- Err on the side of speculative implementation over waiting

## Architecture

### Two loops, different cadences

| Loop | Cadence | Does what | Example |
|------|---------|-----------|---------|
| **Inner (agent)** | Minutes | Generate variants → run → evaluate → log → next | Try new feature combo → backtest → "Spearman dropped" → revert |
| **Outer (orchestrator)** | Hours/daily | Review trajectory → judge direction → redirect | "Short-term signal exhausted — pivot to CEO quality correlation" |

### Where code lives

```
helm/autodo/orchestrator.py    ← common orchestration pattern (extract AFTER domains prove it)

lib/vision/                    ← screenshot capture + LLM visual analysis
  analyze.py                   ← screenshot + rubric → structured eval
  compare.py                   ← before/after regression detection
  capture.py                   ← unify appctl/it2/playwright capture

lib/gen/                       ← variety generation primitives
  variety.py                   ← intelligent prompt-driven variation
  calibrate_temp_variety.py    ← temperature diversity calibration

finance/eval/                  ← trading strat autonomous iteration
  iterate.py                   ← try strategy variants → backtest → evaluate → log → next
  evaluator.py                 ← Spearman r, Q5-Q1 spread, win rate, permutation test

intel/eval/                    ← CEO/exec assessment iteration
  iterate.py                   ← discover CEOs → score → correlate with returns → refine
  evaluator.py                 ← TFTF↔returns correlation, persistence metrics

doctor/ (existing)             ← visual/UX iteration (uses lib/vision/)
  expect.py                    ← refactored to consume lib/vision/
```

### Common iteration loop shape

Every domain follows the same pattern:

```python
while budget_remaining:
    variants = generate(current_best, variety_strategy)  # lib/gen/
    results = [run(v) for v in variants]                 # domain-specific
    scores = [evaluate(r) for r in results]              # domain evaluator
    log(variants, results, scores)                       # learning extraction
    current_best = select_best(scores)                   # or explore further
    report()                                             # human-reviewable artifact
```

### Outputs per domain

- `{domain}/eval/data/iterations/` — raw results per run
- `{domain}/eval/reports/` — HTML reports (.share'd at static.localhost)
- `{domain}/eval/LOGBOOK.md` — what was tried, what worked, what failed
- Learnings → `learning.db` as instances linked to principles

## Priority Domains

### 1. Trading Strats Eval (`finance/eval/`)

**What exists**: VIC returns pipeline (25K theses, walk-forward V2), earnings call alignment (Tesla POC), 25 strategies from Bear notes, predict_robust.py with proper embargo.

**Proven signal**: 30d Spearman r=0.163 (p<0.001), Q5 win rate 69%, Q5-Q1 median spread +8.9%.

**What to iterate on**:
- Feature combinations (fundamentals, embeddings, metadata, sector filters)
- Model architectures (Ridge, GBM, stacking, NN)
- Horizon selection (30d signal is real, 365d needs more data)
- Strategy-specific filters (short-only, small-cap-only, sector rotation)
- Earnings features integration (when available)

**Eval function**: Well-defined — Spearman correlation, Q5-Q1 spread, win rate, permutation test survival.

**Key files**: `finance/vic_analysis/returns.py`, `predict_robust.py`, `finance/STRATEGIES.md`

### 2. CEO/Exec Assessment (`intel/eval/`)

**What exists**: Founder deep dive assessment (10 semi companies, production-ready), TFTF scoring (6 dimensions), people discovery pipeline (10+ APIs), SEC executive extraction (planned).

**Missing link**: TFTF/founder scores ↔ actual stock returns correlation. Framework 80% there.

**What to iterate on**:
- Scoring rubric weights (which of 6 TFTF dimensions predicts returns?)
- Discovery source quality (which APIs find the most useful CEO signal?)
- Assessment prompt variations (what questions extract the most predictive info?)
- Persistence analysis (does CEO quality predict *sustained* outperformance?)
- Cross-reference with VIC theses (do VIC ideas with great CEOs outperform?)

**Eval function**: TFTF score correlation with 1y/3y returns, persistence of outperformance quintile, information ratio.

**Key files**: `intel/companies/semi/assess.py`, `intel/people/README.md`, `projects/skillz/domains/companies/README.md`

### 3. Visual/UX Assessment (`lib/vision/`)

**What exists**: appctl screenshot, it2 screenshot, Playwright screenshots, doctor/expect.py (LLM verification), expectations skill.

**What to build**: Reusable visual analysis primitives that doctor, gyms, and autonomous iteration all consume.

**What to iterate on**:
- Rubric design (what makes a good layout assessment prompt?)
- Model comparison (which LLM is best at visual eval?)
- Regression detection sensitivity (how much change is "regression" vs "improvement"?)
- Multi-viewport testing (desktop/tablet/mobile screenshots, compare consistency)

**Eval function**: LLM visual assessment score against rubric, before/after comparison, regression detection rate.

**Key files**: `doctor/expect.py`, expectations skill

## Approach: Parallel Exploration

All 3 domains build their iteration loops simultaneously. Each starts as a standalone script, proves the pattern works for its domain, then we extract the common orchestration into `helm/autodo/orchestrator.py`.

### Phase 1: Domain-specific loops (parallel, now)
- Fork 3 sessions, one per domain
- Each builds `{domain}/eval/iterate.py` + `evaluator.py`
- Each runs at least one iteration cycle end-to-end
- Each produces a LOGBOOK.md and initial report

### Phase 2: Extract common pattern
- Compare the 3 iterate.py files
- Extract shared structure into `helm/autodo/orchestrator.py`
- Domains keep their evaluators, orchestrator handles the loop

### Phase 3: Wire to helm/autodo
- Orchestrator can be triggered by autodo engine
- Results feed back to learning.db
- Human review via existing review.md pattern
- Gradually reduce supervision frequency

## Budget & Constraints

- Each domain fork should stay under $10 for initial exploration
- Use grok-fast for bulk scoring, opus/gemini-pro for quality evaluation
- Git commits as checkpoints (reversible)
- Reports at static.localhost for easy review