```yaml
# Defaults: status=open, needs=autonomous, effort=M
items:
  - id: image-token-benchmarking
    title: "Add --include-images option to benchmark vision models and measure tokens per image"
    meta: {tags: [feature, research], scope: [lib/llm/benchmarks/]}
  - id: quality-benchmarks
    title: "Add quality benchmarks — code gen accuracy, reasoning, instruction following, rivus-specific evals"
    meta: {tags: [feature, research], effort: L, needs: research, scope: [lib/llm/benchmarks/]}
  - id: prompt-caching-deep-dive
    title: "Benchmark prompt caching across providers — trigger thresholds, hit rates, cost savings"
    meta: {tags: [research, data], scope: [lib/llm/benchmarks/]}
  - id: benchmark-config-refactor
    title: "Split benchmark config into separate config.py (PROMPT_SIZES, SKIP_MODELS, MODEL_CATEGORIES)"
    meta: {tags: [cleanup], effort: S, scope: [lib/llm/benchmarks/]}
  - id: additional-metrics
    title: "Add TBT, P50/P95/P99 latencies, cost per request, concurrent request testing"
    meta: {tags: [feature, data], scope: [lib/llm/benchmarks/]}
  - id: benchmark-scheduling
    title: "Add cron-friendly mode for regular benchmarking with latency regression alerts"
    meta: {tags: [infra], scope: [lib/llm/benchmarks/]}



```

# LLM Benchmark TODOs

## Future Features

### Image Token Benchmarking
- Add `--include-images` option to test vision models
- Measure tokens per image at different resolutions
- Typical: ~85-170 tokens per 512x512 tile (varies by provider)
- Test: GPT-5, Claude, Gemini vision capabilities

### Quality Benchmarks (not just speed)
- Code generation accuracy (pass@k on standard problems)
- Reasoning accuracy (math, logic puzzles)
- Instruction following (format compliance)
- Task-specific evals for rivus use cases:
  - Content extraction quality
  - SVG generation
  - Code review accuracy

### Prompt Caching Deep Dive
- Test with 4096+ token prompts to trigger caching
- Measure cache hit rates over time
- Compare caching across providers:
  - Anthropic: 1024 (Sonnet) / 4096 (Haiku/Opus) min tokens
  - OpenAI: ~1024 min, uses `previous_response_id`
  - Gemini: 1024 min, explicit `cache_control`
  - xAI: Automatic (~99% hit rate)

### Config Refactor
- Split config into separate `config.py`:
  - PROMPT_SIZES
  - TEMP_1_ONLY_MODELS
  - SKIP_MODELS
  - MODEL_CATEGORIES

### Additional Metrics
- Time Between Tokens (TBT) - smoothness of streaming
- P50/P95/P99 latencies (not just median)
- Cost per request tracking
- Concurrent request testing (throughput under load)

### Scheduling
- Add cron-friendly mode for regular benchmarking
- Alert on latency regressions
- Integration with monitoring (Prometheus metrics?)
