/---\
/ \___ .
/ \ .
/
|
| w r a p t u n e
|
|
/
/
/
------__/
A lightweight Python library that makes any decision point in your code observable, and optionally self-optimizing, with zero hot-path overhead.
Every codebase is full of decisions that were made once, hardcoded, and never revisited.
Not just the big, deliberate ones. The small, implicit ones. Which HTTP client library to use for a particular fetch. What retry count and backoff strategy. What cache TTL. Which parsing approach. What similarity threshold. What batch size. What concurrency limit. What report template. What content-length minimum to bother processing. Which LLM model for a particular task.
These aren't decisions that got careful analysis. Someone tried a thing, it
worked, and they wrote timeout=30 or model="gpt-4o"
or max_retries=3 into the code. Months pass. Dependencies update.
Provider characteristics shift. Network conditions change. The hardcoded choice
quietly becomes suboptimal, and nobody notices because nobody is measuring.
Even teams that benchmark rigorously have this problem. A benchmark is a snapshot. Production is a stream. The fetch strategy that was fastest when you tested it might be the slowest under today's traffic patterns. The retry policy that seemed conservative might be masking a downstream issue. The model that scored highest on your eval set last quarter might be the most expensive option today.
The core issue isn't that developers make bad decisions. It's that the decisions are invisible. There's no record of what was chosen, why, how it performed, or whether an alternative would have been better. Every function call that involves a choice is a natural experiment, and we're throwing away the data.
Open any backend codebase and look for hardcoded values. You'll find dozens of decision points that could be alternatives:
| Category | What's hardcoded | What you could be testing |
|---|---|---|
| Fetch strategy | httpx.get(url) | httpx vs. browser vs. proxy vs. cached |
| Retry policy | max_retries=3, backoff=1.5 | Different retry counts, linear vs. exponential, jitter |
| Cache TTL | ttl=3600 | 1h vs. 4h vs. 24h — measure cache hit rates and staleness |
| Report layout | One template | A/B test layouts, section ordering, detail levels |
| Thresholds | similarity > 0.85 | 0.80 vs. 0.85 vs. 0.90 — precision/recall tradeoffs |
| Batch sizes | batch_size=100 | 50 vs. 100 vs. 500 — throughput vs. memory vs. latency |
| Concurrency | max_concurrent=5 | Different limits for different providers or times of day |
| LLM model | model="gpt-4o" | haiku vs. flash vs. gpt-mini — cost, latency, quality |
| Parsing | BeautifulSoup(html) | BS4 vs. lxml vs. regex vs. LLM extraction |
| Prompt template | One system prompt | Variants with different structure, examples, tone |
| Model + prompt pairing | model="gpt-4o", prompt=v1 | Which LLM model with what prompt for a particular task |
Each of these is a choice that affects performance, cost, reliability, or quality. Each one was picked based on intuition or a quick test. Each one could be wrong today even if it was right when it was written.
Every hardcoded value is a choice — one option among
alternatives that were never tried. model="gpt-4o" is a choice
among models. timeout=30 is a choice among timeout policies.
httpx.get(url) is a choice among fetch strategies. The first step
to improving any of these is knowing how the current choice performs.
What if every decision point in your code was observable by default? Not as a heavyweight tracing system, not as a separate analytics platform, but as a single decorator that wraps any function and records what happened — which choice was made, how long it took, whether it succeeded, and any domain-specific metrics you care about?
The key constraints:
asyncio.create_task — fire-and-forget. The decorated
function returns before the observation hits the database.@tune
shouldn't change any caller's behavior."httpx" is a fetch strategy,
"haiku" is an LLM model, or "layout_v2" is a
report template. It records what was chosen, how it performed, and moves
on.This gives you two independent capabilities: Observer (always runs, extracts metrics from the result) and Chooser (optionally selects among alternatives). They're separate concerns. You can use either or both.
The library provides two primitives. Both work with any async function and any set of choices. Choices are just strings — labels for the alternatives.
@tune — transparent decorator
Wraps any async function. Records timing, outcome, and optional custom metrics
on every call. If you give it choices and a chooser,
it also selects among alternatives and feeds results back to the chooser.
race() — parallel comparisonFires N choices in parallel. Returns the first valid result. Logs all results — winners and losers — to the observation store. Supports pin mode (shadow testing) where a pinned choice always wins but alternatives race in the background for data collection.
Both primitives have a kill switch: set TUNE_DISABLED=1 in the
environment, or lib.tune.core.enabled = False in code. The
decorator becomes a passthrough. race() runs only the pinned or
first choice with no parallel work and no logging. Zero behavior change, zero
overhead, instant rollback.
The library has four modules:
| Module | Purpose |
|---|---|
core.py | @tune decorator, Observation dataclass, observers, choosers, kill switch |
race.py | Generic race() — parallel fire, first-valid-wins, shadow/pin mode |
store.py | Non-blocking SQLite WAL storage |
__init__.py | Public API surface |
Steps 1, 3, and 4 are nanoseconds. Step 5 is non-blocking. The only thing that
adds latency to the caller is a few dictionary operations and a
time.perf_counter() pair. In practice, this is unmeasurable.
Every call produces an Observation dataclass. It captures timing
and outcome universally, plus optional domain-specific metrics via observers:
@dataclass class Observation: experiment: str # "fetch_method", "badge_model", "retry_policy" choice: str | None # "httpx" (what was chosen) choices: list[str] | None # ["httpx", "browser", "proxy"] timing_ms: float # wall-clock time meta: dict # observers put everything here (cost, tokens, model, ...) error: str | None # first 200 chars if exception created_at: float # unix timestamp
The record is domain-agnostic. CostObserver populates
meta with LLM-specific fields (cost_usd,
tokens_in, model) when present. For non-LLM use
cases, meta holds whatever your custom observers extract. The core
record needs only a name, a choice label, and a timer.
Records go to a SQLite WAL database via a thread pool executor. WAL mode means reads never block writes, and the fire-and-forget pattern means writes never block the caller. The database is a side channel, not a dependency.
You have a function that fetches URLs. Sometimes httpx is fastest, sometimes the browser approach works better, sometimes a proxy is needed. Right now you've hardcoded one approach. Start by observing it:
from lib.tune import tune @tune async def fetch_url(url, method="httpx"): match method: case "httpx": return await httpx_fetch(url) case "browser": return await browser_fetch(url) case "proxy": return await proxy_fetch(url) # Usage is identical. Callers see no difference. html = await fetch_url("https://example.com")
Every call now records timing, success/failure, and which method was used. No logic change. No overhead. You now have production data on how your fetch strategy performs.
Same pattern, different domain. Add CostObserver to capture
LLM-specific metrics:
from lib.tune import tune, CostObserver @tune(observers=[CostObserver()]) async def generate_badge(prompt, model="haiku"): return await call_llm(prompt, model=model)
Now you get cost, token counts, model name, and timing on every call.
CostObserver is just an observer plugin — it duck-types the
result object and extracts whatever metadata it finds. For non-LLM functions,
leave it off and you still get timing and error tracking.
You have three models that could generate badges. Instead of hardcoding one, let a Thompson sampling bandit learn which is best:
from lib.tune import tune, CostObserver, BanditChooser @tune( choice_key="model", choices=["haiku", "grok-fast", "flash"], chooser=BanditChooser(), observers=[CostObserver()]) async def generate_badge(prompt, model="haiku"): return await call_llm(prompt, model=model)
choice_key names the kwarg to inject. choices lists
the alternatives. The chooser picks one and injects it. The choices are opaque
strings — they could be model names, strategy labels, template IDs,
anything. The library doesn't interpret them.
This works identically for non-LLM decisions:
@tune( choice_key="method", choices=["httpx", "browser", "proxy"], chooser=BanditChooser()) async def fetch_url(url, method="httpx"): ...
The BanditChooser maintains Beta distributions per arm in memory.
On first setup, it loads historical success/failure counts from the observation
database. After that, choose() is a random.betavariate()
call per arm — sub-microsecond, no I/O.
The default reward is binary: did it error or not? But you can define richer
signals. A reward function takes an Observation and returns a boolean:
# Reward = fast AND successful def fast_success(obs): return obs.error is None and obs.timing_ms < 500 # For fetch strategies: reward fast, penalize slow or failed @tune( choice_key="method", choices=["httpx", "browser", "proxy"], chooser=BanditChooser(reward_fn=fast_success)) async def fetch_url(url, method="httpx"): ... # For LLM calls: reward fast AND cheap AND no error def fast_and_cheap(obs): return ( obs.error is None and obs.timing_ms < 500 and (obs.meta.get("cost_usd", 0)) < 0.01 )
The bandit optimizes for whatever you define as success. A fetch method that's fast but unreliable won't dominate. An LLM model that's cheap but slow won't either. The bandit finds the arm that best satisfies your definition of reward.
Observers are simple: inspect the result, return a dictionary of metrics. Here's one that tracks whether the response contains valid JSON:
import json class JsonValidObserver: def on_result(self, result, timing_ms, kwargs): if result is None: return {} try: json.loads(str(result)) return {"json_valid": True} except json.JSONDecodeError: return {"json_valid": False}
And one that measures content quality for fetch results:
class ContentQualityObserver: def on_result(self, result, timing_ms, kwargs): if not result: return {} text = str(result) return { "content_length": len(text), "has_content": len(text) > 500, "is_error_page": "403" in text or "Access Denied" in text, }
Observers run inline (no I/O, no blocking). Their output lands in
meta in the observation record. Now you can query content
quality per fetch method, JSON validity rates per model, or any custom metric
— per time window, per caller.
Every bandit algorithm has a cold start problem. You need to explore bad arms to learn they're bad, and that costs you. Thompson sampling handles this elegantly, but the exploration cost is still real.
Racing eliminates it.
race() fires the same request at multiple choices in parallel and
takes the first valid response. The losers keep running in the background for
the log. This means:
from lib.tune import race # Race three fetch methods — fastest valid response wins result, winner, ms = await race( "fetch_method", choices=["httpx", "browser", "proxy"], action=lambda method: fetch_url(url, method=method), validator=lambda r: len(r) > 500, # reject error pages )
# Same primitive, different choices result, winner, ms = await race( "summarizer", choices=["haiku", "flash", "grok-fast", "gpt-mini"], action=lambda model: call_llm(prompt, model=model), validator=lambda r: len(str(r)) > 50, )
Notice that race() is identical in both cases. It doesn't know
what the choices represent. It fires the action for each choice string, collects
results, picks a winner, and logs everything.
You have a production system that works. You want to know if an alternative
would be better, but you don't want to change behavior while you find out.
That's what pin mode is for.
# Pin = always return httpx's result (current production behavior). # But race browser and proxy in background for data. result, winner, ms = await race( "fetch_method", choices=["httpx", "browser", "proxy"], action=lambda method: fetch_url(url, method=method), pin="httpx", # winner is always httpx )
With pin="httpx", the system always returns httpx's result. Same
behavior as before. But browser and proxy run in the background, and their
timing, success/failure, and content quality are all logged. After a week
of data, you can look at the numbers and decide whether to switch —
with evidence, not intuition.
If the pinned choice fails, race() automatically falls back to the
fastest valid alternative. You get shadow testing for free and
automatic failover.
The shadow deployment pattern. Deploy with a pin. Collect data. Analyze. Remove the pin when you're confident. At no point did you change production behavior before having evidence. This works for any kind of decision — fetch strategies, LLM models, parsing approaches, retry policies.
Two ways to disable everything:
# Environment variable TUNE_DISABLED=1 python my_app.py # Runtime toggle from lib.tune import set_enabled set_enabled(False)
When disabled:
@tune becomes a zero-overhead passthrough. The decorated
function runs exactly as if the decorator weren't there.race() runs only the pinned choice (or the first choice if
no pin). No parallel work. No logging. Minimal overhead.This is your safety valve. If anything goes wrong with observation or racing in production, one environment variable turns it all off. No code change, no redeploy.
The observation layer is the always-on foundation. It doesn't decide anything. It records what happened. This is valuable on its own — most teams don't even have this for their implicit decisions.
The layers are independent. You can run pure observation for weeks, look at the
data, and then add a chooser. You can swap a BanditChooser for a
RandomChooser without touching the observed function. You can
write a new chooser and plug it in.
A manual A/B test tells you what's better now. Continuous observation tells you what's better over time, by caller, under what conditions, and whether the answer has changed since last month. The data accumulates. It becomes a competitive advantage that grows with every call.
SQLite WAL mode gives us concurrent reads and writes with zero deployment overhead. The database is a single file. There's no server to run, no connection pool to manage, no credentials to rotate. For a library that's meant to be dropped into any project, this matters. The observation store is a local side channel, not a service dependency.
The async write path uses loop.run_in_executor(), which runs in a
thread pool. SQLite connections aren't thread-safe, so each thread gets its own
connection via threading.local(). This is a well-tested pattern for
SQLite in async Python.
The chooser's choose() method is on the hot path. It runs before
every function call. If it touched the database, it would add milliseconds of
latency. Instead, choosers load history from the DB once on setup, then maintain
state as simple dictionaries. A Thompson sampling choose() call is
a random.betavariate() per arm. That's nanoseconds.
Observations are valuable but not critical. If a write fails (full disk, locked database, no event loop), the function's result is unaffected. The decorator catches the exception silently. This is the right tradeoff: observation should never degrade the thing being observed.
The library doesn't interpret choice values. "httpx",
"haiku", "layout_v2", "retry_3x_exp"
— they're all just labels. This means the same primitives work for any
kind of decision without domain-specific code. The library records which label
was chosen and how it performed. What the label means is your concern,
not the library's.
This is not an API gateway or router (TensorZero, LiteLLM). It doesn't route requests or manage API keys. It decorates your existing functions.
This is not an observability platform (Langfuse, Datadog, Honeycomb). There's no dashboard, no hosted service, no vendor. It's a SQLite file on your machine.
This is not an A/B testing framework (LaunchDarkly, Optimizely). It doesn't manage feature flags or user segments. It operates at the function level, not the product level.
This is not a prompt testing framework (Promptfoo, Braintrust). It doesn't run offline evals. It observes production calls.
It's a lightweight Python library with a SQLite backing store that makes any decision point observable and optionally adaptive. The only dependency is loguru for debug logging (trivially swappable). It lives in your codebase, not in a cloud.
Those tools are all useful. Some of them are excellent. But they solve different problems. This library solves the lowest-level problem: making the implicit decisions in your code visible and recordable, with the option to close the loop and make them adaptive.
After running with @tune for a week, you can answer questions
like:
# Query the observation store directly (it's just SQLite) import sqlite3, json conn = sqlite3.connect("~/.coord/tune.db") # What experiments are running? for row in conn.execute( "SELECT experiment, COUNT(*) FROM observations GROUP BY experiment" ): print(f"{row[0]}: {row[1]} calls") # How's the fetch method experiment doing? for row in conn.execute(""" SELECT choice, AVG(timing_ms), COUNT(*), SUM(CASE WHEN error IS NULL THEN 1.0 ELSE 0 END) / COUNT(*) FROM observations WHERE experiment = 'fetch_method' GROUP BY choice"""): print(f" {row[0]:12} {row[1]:6.0f}ms {row[3]:.0%}")
Output:
fetch_method: 4291 calls
httpx 142ms 97%
browser 483ms 99%
proxy 201ms 91%
badge_model: 1847 calls
haiku 289ms 99% $0.0003
flash 194ms 98% $0.0001
grok-fast 312ms 97% $0.0004
Those tables tell you httpx is the fastest fetch method but proxy has reliability issues, while flash is the best LLM option right now. Not because you benchmarked once, but because your production system measured thousands of calls across all alternatives over the last week. If the numbers shift next week, you'll see it.
Three steps. Five minutes.
Step 1. Pick a decision point in your codebase that matters. A function that runs often, uses a hardcoded strategy, and where you've wondered whether an alternative might be better.
Step 2. Add @tune. No chooser yet —
just watch.
@tune async def my_function(input, strategy="default"): ...
Step 3. After a day of production data, look at the numbers. Then decide if you want to add a chooser, add more alternatives, define a custom reward function, or set up a race with shadow mode.
The observation layer pays for itself immediately. You get production metrics on every call. The optimization layer is there when you want it.