Theme: production One-liner: A CLI that diffs two LLM execution traces (requests, responses, tool calls, token counts) and highlights where behavior or cost diverged—useful for regression testing after prompt changes, model swaps, or provider migrations.
Teams tweak prompts, swap model versions, or migrate providers (OpenAI → Anthropic, GPT-4 → GPT-4o) and have no automated way to verify the system still behaves the same. Manual spot-checks miss edge cases. Existing observability tools show individual traces but don't diff them. You merge the prompt refactor, deploy, and hope.
Instrument your LLM calls to emit JSON traces (one file per execution: prompts, tool calls, responses, token counts, latencies). Run your test suite twice—once on the baseline, once on the candidate change—and collect two directories of trace files. trace-diff baseline/ candidate/ pairs up matching test cases, compares them structurally, and outputs a human-readable diff: which prompts changed, which tool calls were added/removed/reordered, which responses semantically diverged (via embedding similarity), and which runs got cheaper or more expensive. Exit code 1 if semantic drift exceeds a threshold.
Structured outputs and tool-use are stable enough that you can actually diff them programmatically. Model version churn (4o, 4.5, Sonnet 3.7, Haiku 4.0) means teams are doing A/B testing in production whether they admit it or not. OpenTelemetry has normalized trace export, so the instrumentation part is solved—someone just needs to build the diff layer.
Terminal output showing a side-by-side diff of two agent runs: baseline used three tool calls, candidate used two (one was redundant); response text 94% similar but candidate saved $0.003; one test case flagged for manual review because embedding cosine dropped to 0.81.
Semantic similarity thresholds are arbitrary and domain-specific—what counts as "same behavior" for a customer support bot vs. a code generator is different. You'll need heuristics and the user will need to tune them, which kills the "just run it" appeal. Also, if the trace format isn't standardized, every team has to write their own instrumentation adapter.
Python, click for CLI, difflib + sentence-transformers for semantic diff, Pydantic for trace schema validation, support for OpenTelemetry JSON export as input format.
Spawned from auto-brainstorm on 2026-04-22. Theme: production.