trace-diff

Theme: production One-liner: A CLI that diffs two LLM execution traces (requests, responses, tool calls, token counts) and highlights where behavior or cost diverged—useful for regression testing after prompt changes, model swaps, or provider migrations.

Problem

Teams tweak prompts, swap model versions, or migrate providers (OpenAI → Anthropic, GPT-4 → GPT-4o) and have no automated way to verify the system still behaves the same. Manual spot-checks miss edge cases. Existing observability tools show individual traces but don't diff them. You merge the prompt refactor, deploy, and hope.

The sketch

Instrument your LLM calls to emit JSON traces (one file per execution: prompts, tool calls, responses, token counts, latencies). Run your test suite twice—once on the baseline, once on the candidate change—and collect two directories of trace files. trace-diff baseline/ candidate/ pairs up matching test cases, compares them structurally, and outputs a human-readable diff: which prompts changed, which tool calls were added/removed/reordered, which responses semantically diverged (via embedding similarity), and which runs got cheaper or more expensive. Exit code 1 if semantic drift exceeds a threshold.

Why now

Structured outputs and tool-use are stable enough that you can actually diff them programmatically. Model version churn (4o, 4.5, Sonnet 3.7, Haiku 4.0) means teams are doing A/B testing in production whether they admit it or not. OpenTelemetry has normalized trace export, so the instrumentation part is solved—someone just needs to build the diff layer.

Demo surface

Terminal output showing a side-by-side diff of two agent runs: baseline used three tool calls, candidate used two (one was redundant); response text 94% similar but candidate saved $0.003; one test case flagged for manual review because embedding cosine dropped to 0.81.

Risks / honest take

Semantic similarity thresholds are arbitrary and domain-specific—what counts as "same behavior" for a customer support bot vs. a code generator is different. You'll need heuristics and the user will need to tune them, which kills the "just run it" appeal. Also, if the trace format isn't standardized, every team has to write their own instrumentation adapter.

Stack guess

Python, click for CLI, difflib + sentence-transformers for semantic diff, Pydantic for trace schema validation, support for OpenTelemetry JSON export as input format.

Spawned from auto-brainstorm on 2026-04-22. Theme: production.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trace-diff

Problem

The sketch

Why now

Demo surface

Risks / honest take

Stack guess

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

trace-diff

Problem

The sketch

Why now

Demo surface

Risks / honest take

Stack guess

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages