braintrustdata · Abhijeet Prasad (AbhiPrasad) · Mar 24, 2026 · Mar 24, 2026
diff --git a/.agents/skills/sdk-benchmarking/SKILL.md b/.agents/skills/sdk-benchmarking/SKILL.md
@@ -0,0 +1,101 @@
+---
+name: sdk-benchmarking
+description: Run, compare, and extend Braintrust Python SDK pyperf benchmarks. Use when touching hot-path code in `py/src/braintrust/` such as serialization, deep-copy, span creation, or logging; when adding or updating files under `py/benchmarks/`; or when you need baseline/branch performance measurements with `cd py && make bench` and `make bench-compare`.
+---
+
+# SDK Benchmarking
+
+Use this skill for benchmark work in the Braintrust Python SDK repository.
+
+Benchmark support already exists in `py/benchmarks/`. Use the current repo workflow, not commit archaeology, once you have identified the relevant benchmark surface.
+
+## Read First
+
+Always read:
+
+- `AGENTS.md`
+- `CONTRIBUTING.md`
+- `py/Makefile`
+- `py/benchmarks/__main__.py`
+- `py/benchmarks/_utils.py`
+- `py/benchmarks/benches/__init__.py`
+
+Read when relevant:
+
+- `py/benchmarks/benches/bench_bt_json.py` for the module pattern
+- `py/benchmarks/fixtures.py` for shared payload builders
+- `py/setup.py` when benchmarking the optional `orjson` fast path
+- `references/benchmark-patterns.md` in this skill for command and module templates
+
+## Workflow
+
+1. Identify the hot path or API surface that changed.
+2. Find the nearest existing benchmark module under `py/benchmarks/benches/`.
+3. Run the narrowest useful benchmark first.
+4. Add or update a `bench_*.py` module only if the current suite does not cover the changed path.
+5. Reuse or extend `py/benchmarks/fixtures.py` for realistic shared payloads instead of inlining bulky test data.
+6. Save before/after results and compare them when the task is about regression detection or improvement claims.
+
+## Commands
+
+Run benchmarks from `py/`:
+
+```bash
+cd py
+make bench
+make bench BENCH_ARGS="--fast"
+make bench BENCH_ARGS="-o /tmp/before.json"
+make bench BENCH_ARGS="-o /tmp/after.json"
+make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json
+python -m benchmarks.benches.bench_bt_json
+```
+
+Use `python -m benchmarks --help` for extra `pyperf` flags.
+
+If the benchmark should measure the optional `orjson` path, install the performance extra first:
+
+```bash
+cd py
+python -m uv pip install -e '.[performance]'
+```
+
+## Adding Benchmarks
+
+Put new modules in `py/benchmarks/benches/` and name them `bench_<name>.py`.
+
+Every benchmark module must:
+
+- expose `main(runner: pyperf.Runner | None = None) -> None`
+- create its own `pyperf.Runner()` only when `runner` is `None`
+- call `disable_pyperf_psutil()` before creating that runner
+- register benchmarks with stable, descriptive names via `runner.bench_func(...)`
+- remain executable directly with `python -m benchmarks.benches.bench_<name>`
+
+Do not add manual registration. `python -m benchmarks` auto-discovers every `bench_*.py` module in `py/benchmarks/benches/`.
+
+## Fixtures
+
+Keep reusable payload builders and synthetic objects in `py/benchmarks/fixtures.py`.
+
+Prefer fixture helpers when:
+
+- several benchmark cases share similar payloads
+- the inputs are large enough to distract from the benchmark itself
+- you need variants such as small, medium, large, circular, or non-string-key cases
+
+Keep fixture builders deterministic and focused on representative data shapes.
+
+## Validation
+
+- Run the narrowest affected benchmark first.
+- Use `BENCH_ARGS="--fast"` for quick local sanity checks while iterating.
+- Save JSON outputs and use `make bench-compare` for baseline versus branch comparisons.
+- If you changed code paths that also have correctness tests, run the smallest relevant test target in addition to the benchmark.
+
+## Pitfalls
+
+- Measuring import/setup overhead instead of the hot function under test.
+- Inlining ad hoc payload construction in each benchmark instead of reusing fixtures.
+- Forgetting the standalone `main()` pattern, which breaks auto-discovery and direct execution symmetry.
+- Claiming performance changes from a single unsaved local run instead of comparing saved results.
+- Benchmarking the `orjson` fast path without explicitly installing `.[performance]`.
diff --git a/.agents/skills/sdk-benchmarking/agents/openai.yaml b/.agents/skills/sdk-benchmarking/agents/openai.yaml
@@ -0,0 +1,4 @@
+interface:
+  display_name: "SDK Benchmarking"
+  short_description: "Run and extend Braintrust SDK benchmarks"
+  default_prompt: "Use $sdk-benchmarking to run, compare, or add Braintrust Python SDK benchmarks."
diff --git a/.agents/skills/sdk-benchmarking/references/benchmark-patterns.md b/.agents/skills/sdk-benchmarking/references/benchmark-patterns.md
@@ -0,0 +1,93 @@
+# Benchmark Patterns
+
+Use this reference when adding or updating SDK benchmarks.
+
+## Command Cheatsheet
+
+```bash
+cd py
+
+# Run everything
+make bench
+
+# Faster local iteration
+make bench BENCH_ARGS="--fast"
+
+# Save results for comparison
+make bench BENCH_ARGS="-o /tmp/before.json"
+make bench BENCH_ARGS="-o /tmp/after.json"
+make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json
+
+# Run one module directly
+python -m benchmarks.benches.bench_bt_json
+
+# Inspect all forwarded pyperf flags
+python -m benchmarks --help
+```
+
+## Module Skeleton
+
+```python
+import pathlib
+import sys
+
+import pyperf
+
+
+if __package__ in (None, ""):
+    sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2]))
+
+from benchmarks._utils import disable_pyperf_psutil
+
+
+def target(value):
+    return value
+
+
+def main(runner: pyperf.Runner | None = None) -> None:
+    if runner is None:
+        disable_pyperf_psutil()
+        runner = pyperf.Runner()
+
+    runner.bench_func("example.target[case-name]", target, "value")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+Follow the existing `py/benchmarks/benches/bench_bt_json.py` pattern when importing repo code. The `sys.path` adjustment keeps direct module execution working from inside `py/`.
+
+## Fixture Guidance
+
+Put reusable builders in `py/benchmarks/fixtures.py` when:
+
+- several benchmark cases need the same payload shape
+- the payload should model realistic nested SDK inputs
+- the benchmark should cover edge cases such as circular references or non-string keys
+
+Current fixture patterns already cover:
+
+- small, medium, and large nested payloads
+- circular structures
+- non-string dictionary keys
+- dataclass-like and pydantic-like values
+
+Extend those helpers before creating one-off payload factories in a new benchmark module.
+
+## Comparison Workflow
+
+For branch-to-branch comparisons:
+
+```bash
+cd py
+git checkout main
+make bench BENCH_ARGS="-o /tmp/main.json"
+
+git checkout my-branch
+make bench BENCH_ARGS="-o /tmp/branch.json"
+
+make bench-compare BENCH_BASE=/tmp/main.json BENCH_NEW=/tmp/branch.json
+```
+
+Use `--rigorous` only when you need lower-noise final numbers; use `--fast` while iterating.
diff --git a/AGENTS.md b/AGENTS.md
@@ -12,6 +12,7 @@ Guide for contributing to the Braintrust Python SDK repository.
 ## Repo Map
 
 - `py/`: main Python package, tests, examples, nox sessions, release build
+- `py/benchmarks/`: pyperf performance benchmarks
 - `integrations/`: separate integration packages
 - `internal/golden/`: compatibility and golden projects
 - `docs/`: supporting docs
@@ -122,6 +123,25 @@ BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest
 
 Only re-record HTTP or subprocess cassettes when the behavior change is intentional. If in doubt, ask the user.
 
+## Benchmarks
+
+Run `cd py && make bench` when touching hot-path code (serialization, deep-copy, span creation, logging). Not required for every change.
+
+Benchmarks use pyperf. All `bench_*.py` files in `py/benchmarks/benches/` are auto-discovered — no registration needed.
+
+Key commands:
+
+```bash
+cd py
+make bench                                   # run all benchmarks
+make bench BENCH_ARGS="--fast"               # quick sanity check
+make bench BENCH_ARGS="-o /tmp/before.json"  # save baseline before a change
+make bench BENCH_ARGS="-o /tmp/after.json"   # save after a change
+make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json
+```
+
+New benchmark files go in `py/benchmarks/benches/bench_<name>.py`. Each must expose `main(runner: pyperf.Runner | None = None)`. Shared payload builders go in `py/benchmarks/fixtures.py`. See existing `bench_bt_json.py` for the pattern.
+
 ## Build Notes
 
 Build from `py/`:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -147,6 +147,87 @@ Common ones include:
 
 The `memory_logger` fixture from `braintrust.test_helpers` is useful for asserting on logged spans without a real Braintrust backend.
 
+## Benchmarks
+
+The SDK includes local performance benchmarks powered by [pyperf](https://pyperf.readthedocs.io/), located in `py/benchmarks/`. These cover hot paths like serialization and deep-copy routines.
+
+### Running benchmarks
+
+```bash
+cd py
+
+# Run all benchmarks
+make bench
+
+# Quick sanity check (fewer iterations)
+make bench BENCH_ARGS="--fast"
+
+# Save results for later comparison
+make bench BENCH_ARGS="-o /tmp/results.json"
+
+# Run a single benchmark module directly
+python -m benchmarks.benches.bench_bt_json
+```
+
+To benchmark with the optional `orjson` fast-path installed:
+
+```bash
+cd py
+python -m uv pip install -e '.[performance]'
+make bench
+```
+
+### Comparing across branches
+
+```bash
+cd py
+
+git checkout main
+make bench BENCH_ARGS="-o /tmp/main.json"
+
+git checkout my-branch
+make bench BENCH_ARGS="-o /tmp/branch.json"
+
+make bench-compare BENCH_BASE=/tmp/main.json BENCH_NEW=/tmp/branch.json
+```
+
+### Useful pyperf flags
+
+| Flag            | Purpose                                           |
+| --------------- | ------------------------------------------------- |
+| `--fast`        | Fewer iterations — good for a quick sanity check  |
+| `--rigorous`    | More iterations — reduces noise for final numbers |
+| `-o FILE`       | Write results to a JSON file for later comparison |
+| `--append FILE` | Append to an existing results file                |
+
+Run `python -m benchmarks --help` for the full list.
+
+### Adding a new benchmark
+
+Drop a new `bench_<name>.py` file into `py/benchmarks/benches/`. It will be picked up automatically — no registration required.
+
+Your module needs to expose a `main()` function that accepts an optional `pyperf.Runner`:
+
+```python
+import pyperf
+
+from benchmarks._utils import disable_pyperf_psutil
+
+
+def main(runner: pyperf.Runner | None = None) -> None:
+    if runner is None:
+        disable_pyperf_psutil()
+        runner = pyperf.Runner()
+
+    runner.bench_func("my_benchmark", my_func, my_arg)
+
+
+if __name__ == "__main__":
+    main()
+```
+
+If your benchmark needs reusable test data, add builder functions to `py/benchmarks/fixtures.py`.
+
 ## CI
 
 GitHub Actions workflows live in `.github/workflows/`.

diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 SHELL := /bin/bash
 
-.PHONY: help develop install-dev install-deps fixup test test-core test-wheel lint pylint nox
+.PHONY: help develop install-dev install-deps fixup test test-core test-wheel lint pylint nox bench bench-compare
 
 develop: install-dev
 	mise exec -- pre-commit install
@@ -30,19 +30,27 @@ lint:
 pylint:
 	mise exec -- $(MAKE) -C py pylint
 
+bench:
+	mise exec -- $(MAKE) -C py bench BENCH_ARGS="$(BENCH_ARGS)"
+
+bench-compare:
+	mise exec -- $(MAKE) -C py bench-compare BENCH_BASE="$(BENCH_BASE)" BENCH_NEW="$(BENCH_NEW)"
+
 nox: test
 
 help:
 	@echo "Available targets:"
-	@echo "  develop      - Install tools with mise, install py/ deps, and install pre-commit hooks"
-	@echo "  fixup        - Run pre-commit hooks across the repo"
-	@echo "  install-deps - Install Python SDK dependencies via py/Makefile"
-	@echo "  install-dev  - Install pinned tools and create/update the repo env via mise"
-	@echo "  lint         - Run pre-commit hooks plus Python SDK pylint via py/Makefile"
-	@echo "  pylint       - Run Python SDK pylint only via py/Makefile"
-	@echo "  nox          - Alias for test"
-	@echo "  test         - Run the Python SDK nox matrix via py/Makefile"
-	@echo "  test-core    - Run Python SDK core tests via py/Makefile"
-	@echo "  test-wheel   - Run Python SDK wheel sanity tests via py/Makefile (requires a built wheel)"
+	@echo "  bench         - Run benchmarks via py/Makefile (pass extra flags via BENCH_ARGS=)"
+	@echo "  bench-compare - Compare two benchmark results via py/Makefile (BENCH_BASE=... BENCH_NEW=...)"
+	@echo "  develop       - Install tools with mise, install py/ deps, and install pre-commit hooks"
+	@echo "  fixup         - Run pre-commit hooks across the repo"
+	@echo "  install-deps  - Install Python SDK dependencies via py/Makefile"
+	@echo "  install-dev   - Install pinned tools and create/update the repo env via mise"
+	@echo "  lint          - Run pre-commit hooks plus Python SDK pylint via py/Makefile"
+	@echo "  pylint        - Run Python SDK pylint only via py/Makefile"
+	@echo "  nox           - Alias for test"
+	@echo "  test          - Run the Python SDK nox matrix via py/Makefile"
+	@echo "  test-core     - Run Python SDK core tests via py/Makefile"
+	@echo "  test-wheel    - Run Python SDK wheel sanity tests via py/Makefile (requires a built wheel)"
 
 .DEFAULT_GOAL := help