/---\
                   /     \___ .
                  /           \  .
                 /
                |
                |   w r a p t u n e
               |
              |
             /
           /
         /
------__/
  

lib/tune — Every Hardcoded Decision Is an Experiment You're Not Running

A lightweight Python library that makes any decision point in your code observable, and optionally self-optimizing, with zero hot-path overhead.

pip install wraptune

February 2026

The problem: invisible, stale decisions

Every codebase is full of decisions that were made once, hardcoded, and never revisited.

Not just the big, deliberate ones. The small, implicit ones. Which HTTP client library to use for a particular fetch. What retry count and backoff strategy. What cache TTL. Which parsing approach. What similarity threshold. What batch size. What concurrency limit. What report template. What content-length minimum to bother processing. Which LLM model for a particular task.

These aren't decisions that got careful analysis. Someone tried a thing, it worked, and they wrote timeout=30 or model="gpt-4o" or max_retries=3 into the code. Months pass. Dependencies update. Provider characteristics shift. Network conditions change. The hardcoded choice quietly becomes suboptimal, and nobody notices because nobody is measuring.

Even teams that benchmark rigorously have this problem. A benchmark is a snapshot. Production is a stream. The fetch strategy that was fastest when you tested it might be the slowest under today's traffic patterns. The retry policy that seemed conservative might be masking a downstream issue. The model that scored highest on your eval set last quarter might be the most expensive option today.

The core issue isn't that developers make bad decisions. It's that the decisions are invisible. There's no record of what was chosen, why, how it performed, or whether an alternative would have been better. Every function call that involves a choice is a natural experiment, and we're throwing away the data.

The decisions hiding in your code

Open any backend codebase and look for hardcoded values. You'll find dozens of decision points that could be alternatives:

CategoryWhat's hardcodedWhat you could be testing
Fetch strategyhttpx.get(url)httpx vs. browser vs. proxy vs. cached
Retry policymax_retries=3, backoff=1.5Different retry counts, linear vs. exponential, jitter
Cache TTLttl=36001h vs. 4h vs. 24h — measure cache hit rates and staleness
Report layoutOne templateA/B test layouts, section ordering, detail levels
Thresholdssimilarity > 0.850.80 vs. 0.85 vs. 0.90 — precision/recall tradeoffs
Batch sizesbatch_size=10050 vs. 100 vs. 500 — throughput vs. memory vs. latency
Concurrencymax_concurrent=5Different limits for different providers or times of day
LLM modelmodel="gpt-4o"haiku vs. flash vs. gpt-mini — cost, latency, quality
ParsingBeautifulSoup(html)BS4 vs. lxml vs. regex vs. LLM extraction
Prompt templateOne system promptVariants with different structure, examples, tone
Model + prompt pairingmodel="gpt-4o", prompt=v1Which LLM model with what prompt for a particular task

Each of these is a choice that affects performance, cost, reliability, or quality. Each one was picked based on intuition or a quick test. Each one could be wrong today even if it was right when it was written.

The insight: observation as a primitive

Every hardcoded value is a choice — one option among alternatives that were never tried. model="gpt-4o" is a choice among models. timeout=30 is a choice among timeout policies. httpx.get(url) is a choice among fetch strategies. The first step to improving any of these is knowing how the current choice performs.

What if every decision point in your code was observable by default? Not as a heavyweight tracing system, not as a separate analytics platform, but as a single decorator that wraps any function and records what happened — which choice was made, how long it took, whether it succeeded, and any domain-specific metrics you care about?

The key constraints:

This gives you two independent capabilities: Observer (always runs, extracts metrics from the result) and Chooser (optionally selects among alternatives). They're separate concerns. You can use either or both.

Two primitives

The library provides two primitives. Both work with any async function and any set of choices. Choices are just strings — labels for the alternatives.

@tune — transparent decorator

Wraps any async function. Records timing, outcome, and optional custom metrics on every call. If you give it choices and a chooser, it also selects among alternatives and feeds results back to the chooser.

race() — parallel comparison

Fires N choices in parallel. Returns the first valid result. Logs all results — winners and losers — to the observation store. Supports pin mode (shadow testing) where a pinned choice always wins but alternatives race in the background for data collection.

Both primitives have a kill switch: set TUNE_DISABLED=1 in the environment, or lib.tune.core.enabled = False in code. The decorator becomes a passthrough. race() runs only the pinned or first choice with no parallel work and no logging. Zero behavior change, zero overhead, instant rollback.

The design

The library has four modules:

ModulePurpose
core.py@tune decorator, Observation dataclass, observers, choosers, kill switch
race.pyGeneric race() — parallel fire, first-valid-wins, shadow/pin mode
store.pyNon-blocking SQLite WAL storage
__init__.pyPublic API surface

How a call flows through the decorator

caller | v @tune # auto-named: "module.fn_name" | |-- 1. Chooser.choose() [in-memory, sub-microsecond] | inject chosen value into kwargs | |-- 2. await fn(*args, **kwargs) [the actual work] | |-- 3. Observers extract metrics [inline, no I/O] | CostObserver, custom... | |-- 4. Chooser.update(obs) [in-memory dict write] | |-- 5. asyncio.create_task( [fire-and-forget] | store.write(obs)) | v return result (before write completes)

Steps 1, 3, and 4 are nanoseconds. Step 5 is non-blocking. The only thing that adds latency to the caller is a few dictionary operations and a time.perf_counter() pair. In practice, this is unmeasurable.

The Observation record

Every call produces an Observation dataclass. It captures timing and outcome universally, plus optional domain-specific metrics via observers:

@dataclass
class Observation:
    experiment: str           # "fetch_method", "badge_model", "retry_policy"
    choice: str | None       # "httpx" (what was chosen)
    choices: list[str] | None # ["httpx", "browser", "proxy"]
    timing_ms: float         # wall-clock time
    meta: dict               # observers put everything here (cost, tokens, model, ...)
    error: str | None        # first 200 chars if exception
    created_at: float        # unix timestamp

The record is domain-agnostic. CostObserver populates meta with LLM-specific fields (cost_usd, tokens_in, model) when present. For non-LLM use cases, meta holds whatever your custom observers extract. The core record needs only a name, a choice label, and a timer.

Records go to a SQLite WAL database via a thread pool executor. WAL mode means reads never block writes, and the fire-and-forget pattern means writes never block the caller. The database is a side channel, not a dependency.

Show me the code

1. Observe a fetch strategy

You have a function that fetches URLs. Sometimes httpx is fastest, sometimes the browser approach works better, sometimes a proxy is needed. Right now you've hardcoded one approach. Start by observing it:

from lib.tune import tune

@tune
async def fetch_url(url, method="httpx"):
    match method:
        case "httpx":   return await httpx_fetch(url)
        case "browser": return await browser_fetch(url)
        case "proxy":   return await proxy_fetch(url)

# Usage is identical. Callers see no difference.
html = await fetch_url("https://example.com")

Every call now records timing, success/failure, and which method was used. No logic change. No overhead. You now have production data on how your fetch strategy performs.

2. Observe an LLM call

Same pattern, different domain. Add CostObserver to capture LLM-specific metrics:

from lib.tune import tune, CostObserver

@tune(observers=[CostObserver()])
async def generate_badge(prompt, model="haiku"):
    return await call_llm(prompt, model=model)

Now you get cost, token counts, model name, and timing on every call. CostObserver is just an observer plugin — it duck-types the result object and extracts whatever metadata it finds. For non-LLM functions, leave it off and you still get timing and error tracking.

3. Let a bandit choose

You have three models that could generate badges. Instead of hardcoding one, let a Thompson sampling bandit learn which is best:

from lib.tune import tune, CostObserver, BanditChooser

@tune(
    choice_key="model", choices=["haiku", "grok-fast", "flash"],
    chooser=BanditChooser(),
    observers=[CostObserver()])
async def generate_badge(prompt, model="haiku"):
    return await call_llm(prompt, model=model)

choice_key names the kwarg to inject. choices lists the alternatives. The chooser picks one and injects it. The choices are opaque strings — they could be model names, strategy labels, template IDs, anything. The library doesn't interpret them.

This works identically for non-LLM decisions:

@tune(
    choice_key="method", choices=["httpx", "browser", "proxy"],
    chooser=BanditChooser())
async def fetch_url(url, method="httpx"):
    ...

The BanditChooser maintains Beta distributions per arm in memory. On first setup, it loads historical success/failure counts from the observation database. After that, choose() is a random.betavariate() call per arm — sub-microsecond, no I/O.

4. Custom reward functions

The default reward is binary: did it error or not? But you can define richer signals. A reward function takes an Observation and returns a boolean:

# Reward = fast AND successful
def fast_success(obs):
    return obs.error is None and obs.timing_ms < 500

# For fetch strategies: reward fast, penalize slow or failed
@tune(
    choice_key="method", choices=["httpx", "browser", "proxy"],
    chooser=BanditChooser(reward_fn=fast_success))
async def fetch_url(url, method="httpx"):
    ...

# For LLM calls: reward fast AND cheap AND no error
def fast_and_cheap(obs):
    return (
        obs.error is None
        and obs.timing_ms < 500
        and (obs.meta.get("cost_usd", 0)) < 0.01
    )

The bandit optimizes for whatever you define as success. A fetch method that's fast but unreliable won't dominate. An LLM model that's cheap but slow won't either. The bandit finds the arm that best satisfies your definition of reward.

5. Custom observers

Observers are simple: inspect the result, return a dictionary of metrics. Here's one that tracks whether the response contains valid JSON:

import json
class JsonValidObserver:
    def on_result(self, result, timing_ms, kwargs):
        if result is None:
            return {}
        try:
            json.loads(str(result))
            return {"json_valid": True}
        except json.JSONDecodeError:
            return {"json_valid": False}

And one that measures content quality for fetch results:

class ContentQualityObserver:
    def on_result(self, result, timing_ms, kwargs):
        if not result:
            return {}
        text = str(result)
        return {
            "content_length": len(text),
            "has_content": len(text) > 500,
            "is_error_page": "403" in text or "Access Denied" in text,
        }

Observers run inline (no I/O, no blocking). Their output lands in meta in the observation record. Now you can query content quality per fetch method, JSON validity rates per model, or any custom metric — per time window, per caller.

Racing: parallel comparison for any decision

Every bandit algorithm has a cold start problem. You need to explore bad arms to learn they're bad, and that costs you. Thompson sampling handles this elegantly, but the exploration cost is still real.

Racing eliminates it.

race() fires the same request at multiple choices in parallel and takes the first valid response. The losers keep running in the background for the log. This means:

Racing fetch strategies

from lib.tune import race

# Race three fetch methods — fastest valid response wins
result, winner, ms = await race(
    "fetch_method",
    choices=["httpx", "browser", "proxy"],
    action=lambda method: fetch_url(url, method=method),
    validator=lambda r: len(r) > 500,  # reject error pages
)

Racing LLM models

# Same primitive, different choices
result, winner, ms = await race(
    "summarizer",
    choices=["haiku", "flash", "grok-fast", "gpt-mini"],
    action=lambda model: call_llm(prompt, model=model),
    validator=lambda r: len(str(r)) > 50,
)

Notice that race() is identical in both cases. It doesn't know what the choices represent. It fires the action for each choice string, collects results, picks a winner, and logs everything.

Shadow mode: zero-risk data collection

You have a production system that works. You want to know if an alternative would be better, but you don't want to change behavior while you find out. That's what pin mode is for.

# Pin = always return httpx's result (current production behavior).
# But race browser and proxy in background for data.
result, winner, ms = await race(
    "fetch_method",
    choices=["httpx", "browser", "proxy"],
    action=lambda method: fetch_url(url, method=method),
    pin="httpx",  # winner is always httpx
)

With pin="httpx", the system always returns httpx's result. Same behavior as before. But browser and proxy run in the background, and their timing, success/failure, and content quality are all logged. After a week of data, you can look at the numbers and decide whether to switch — with evidence, not intuition.

If the pinned choice fails, race() automatically falls back to the fastest valid alternative. You get shadow testing for free and automatic failover.

The shadow deployment pattern. Deploy with a pin. Collect data. Analyze. Remove the pin when you're confident. At no point did you change production behavior before having evidence. This works for any kind of decision — fetch strategies, LLM models, parsing approaches, retry policies.

The kill switch

Two ways to disable everything:

# Environment variable
TUNE_DISABLED=1 python my_app.py

# Runtime toggle
from lib.tune import set_enabled
set_enabled(False)

When disabled:

This is your safety valve. If anything goes wrong with observation or racing in production, one environment variable turns it all off. No code change, no redeploy.

The accumulation pattern

The observation layer is the always-on foundation. It doesn't decide anything. It records what happened. This is valuable on its own — most teams don't even have this for their implicit decisions.

observe accumulate analyze optimize | | | | @tune SQLite WAL SQL queries Chooser plugins race() (fire-and-forget) | | | | | | every call all calls, group by choice BanditChooser records zero overhead error rates, p50s RandomChooser timing, to the caller per caller, trends custom reward outcome meta breakdowns functions

The layers are independent. You can run pure observation for weeks, look at the data, and then add a chooser. You can swap a BanditChooser for a RandomChooser without touching the observed function. You can write a new chooser and plug it in.

A manual A/B test tells you what's better now. Continuous observation tells you what's better over time, by caller, under what conditions, and whether the answer has changed since last month. The data accumulates. It becomes a competitive advantage that grows with every call.

Architecture choices worth explaining

Why SQLite, not Postgres or a time-series DB?

SQLite WAL mode gives us concurrent reads and writes with zero deployment overhead. The database is a single file. There's no server to run, no connection pool to manage, no credentials to rotate. For a library that's meant to be dropped into any project, this matters. The observation store is a local side channel, not a service dependency.

Why thread-local connections?

The async write path uses loop.run_in_executor(), which runs in a thread pool. SQLite connections aren't thread-safe, so each thread gets its own connection via threading.local(). This is a well-tested pattern for SQLite in async Python.

Why in-memory state for choosers?

The chooser's choose() method is on the hot path. It runs before every function call. If it touched the database, it would add milliseconds of latency. Instead, choosers load history from the DB once on setup, then maintain state as simple dictionaries. A Thompson sampling choose() call is a random.betavariate() per arm. That's nanoseconds.

Why fire-and-forget writes?

Observations are valuable but not critical. If a write fails (full disk, locked database, no event loop), the function's result is unaffected. The decorator catches the exception silently. This is the right tradeoff: observation should never degrade the thing being observed.

Why are choices opaque strings?

The library doesn't interpret choice values. "httpx", "haiku", "layout_v2", "retry_3x_exp" — they're all just labels. This means the same primitives work for any kind of decision without domain-specific code. The library records which label was chosen and how it performed. What the label means is your concern, not the library's.

What this is NOT

This is not an API gateway or router (TensorZero, LiteLLM). It doesn't route requests or manage API keys. It decorates your existing functions.

This is not an observability platform (Langfuse, Datadog, Honeycomb). There's no dashboard, no hosted service, no vendor. It's a SQLite file on your machine.

This is not an A/B testing framework (LaunchDarkly, Optimizely). It doesn't manage feature flags or user segments. It operates at the function level, not the product level.

This is not a prompt testing framework (Promptfoo, Braintrust). It doesn't run offline evals. It observes production calls.

It's a lightweight Python library with a SQLite backing store that makes any decision point observable and optionally adaptive. The only dependency is loguru for debug logging (trivially swappable). It lives in your codebase, not in a cloud.

Those tools are all useful. Some of them are excellent. But they solve different problems. This library solves the lowest-level problem: making the implicit decisions in your code visible and recordable, with the option to close the loop and make them adaptive.

What you get from this

After running with @tune for a week, you can answer questions like:

# Query the observation store directly (it's just SQLite)
import sqlite3, json
conn = sqlite3.connect("~/.coord/tune.db")

# What experiments are running?
for row in conn.execute(
    "SELECT experiment, COUNT(*) FROM observations GROUP BY experiment"
):
    print(f"{row[0]}: {row[1]} calls")

# How's the fetch method experiment doing?
for row in conn.execute("""
    SELECT choice, AVG(timing_ms), COUNT(*),
           SUM(CASE WHEN error IS NULL THEN 1.0 ELSE 0 END) / COUNT(*)
    FROM observations WHERE experiment = 'fetch_method'
    GROUP BY choice"""):
    print(f"  {row[0]:12} {row[1]:6.0f}ms  {row[3]:.0%}")

Output:

fetch_method: 4291 calls
  httpx          142ms  97%
  browser        483ms  99%
  proxy          201ms  91%

badge_model: 1847 calls
  haiku          289ms  99%  $0.0003
  flash          194ms  98%  $0.0001
  grok-fast      312ms  97%  $0.0004

Those tables tell you httpx is the fastest fetch method but proxy has reliability issues, while flash is the best LLM option right now. Not because you benchmarked once, but because your production system measured thousands of calls across all alternatives over the last week. If the numbers shift next week, you'll see it.

Getting started

Three steps. Five minutes.

Step 1. Pick a decision point in your codebase that matters. A function that runs often, uses a hardcoded strategy, and where you've wondered whether an alternative might be better.

Step 2. Add @tune. No chooser yet — just watch.

@tune
async def my_function(input, strategy="default"):
    ...

Step 3. After a day of production data, look at the numbers. Then decide if you want to add a chooser, add more alternatives, define a custom reward function, or set up a race with shadow mode.

The observation layer pays for itself immediately. You get production metrics on every call. The optimization layer is there when you want it.