Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions .agents/skills/sdk-benchmarking/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
name: sdk-benchmarking
description: Run, compare, and extend Braintrust Python SDK pyperf benchmarks. Use when touching hot-path code in `py/src/braintrust/` such as serialization, deep-copy, span creation, or logging; when adding or updating files under `py/benchmarks/`; or when you need baseline/branch performance measurements with `cd py && make bench` and `make bench-compare`.
---

# SDK Benchmarking

Use this skill for benchmark work in the Braintrust Python SDK repository.

Benchmark support already exists in `py/benchmarks/`. Use the current repo workflow, not commit archaeology, once you have identified the relevant benchmark surface.

## Read First

Always read:

- `AGENTS.md`
- `CONTRIBUTING.md`
- `py/Makefile`
- `py/benchmarks/__main__.py`
- `py/benchmarks/_utils.py`
- `py/benchmarks/benches/__init__.py`

Read when relevant:

- `py/benchmarks/benches/bench_bt_json.py` for the module pattern
- `py/benchmarks/fixtures.py` for shared payload builders
- `py/setup.py` when benchmarking the optional `orjson` fast path
- `references/benchmark-patterns.md` in this skill for command and module templates

## Workflow

1. Identify the hot path or API surface that changed.
2. Find the nearest existing benchmark module under `py/benchmarks/benches/`.
3. Run the narrowest useful benchmark first.
4. Add or update a `bench_*.py` module only if the current suite does not cover the changed path.
5. Reuse or extend `py/benchmarks/fixtures.py` for realistic shared payloads instead of inlining bulky test data.
6. Save before/after results and compare them when the task is about regression detection or improvement claims.

## Commands

Run benchmarks from `py/`:

```bash
cd py
make bench
make bench BENCH_ARGS="--fast"
make bench BENCH_ARGS="-o /tmp/before.json"
make bench BENCH_ARGS="-o /tmp/after.json"
make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json
python -m benchmarks.benches.bench_bt_json
```

Use `python -m benchmarks --help` for extra `pyperf` flags.

If the benchmark should measure the optional `orjson` path, install the performance extra first:

```bash
cd py
python -m uv pip install -e '.[performance]'
```

## Adding Benchmarks

Put new modules in `py/benchmarks/benches/` and name them `bench_<name>.py`.

Every benchmark module must:

- expose `main(runner: pyperf.Runner | None = None) -> None`
- create its own `pyperf.Runner()` only when `runner` is `None`
- call `disable_pyperf_psutil()` before creating that runner
- register benchmarks with stable, descriptive names via `runner.bench_func(...)`
- remain executable directly with `python -m benchmarks.benches.bench_<name>`

Do not add manual registration. `python -m benchmarks` auto-discovers every `bench_*.py` module in `py/benchmarks/benches/`.

## Fixtures

Keep reusable payload builders and synthetic objects in `py/benchmarks/fixtures.py`.

Prefer fixture helpers when:

- several benchmark cases share similar payloads
- the inputs are large enough to distract from the benchmark itself
- you need variants such as small, medium, large, circular, or non-string-key cases

Keep fixture builders deterministic and focused on representative data shapes.

## Validation

- Run the narrowest affected benchmark first.
- Use `BENCH_ARGS="--fast"` for quick local sanity checks while iterating.
- Save JSON outputs and use `make bench-compare` for baseline versus branch comparisons.
- If you changed code paths that also have correctness tests, run the smallest relevant test target in addition to the benchmark.

## Pitfalls

- Measuring import/setup overhead instead of the hot function under test.
- Inlining ad hoc payload construction in each benchmark instead of reusing fixtures.
- Forgetting the standalone `main()` pattern, which breaks auto-discovery and direct execution symmetry.
- Claiming performance changes from a single unsaved local run instead of comparing saved results.
- Benchmarking the `orjson` fast path without explicitly installing `.[performance]`.
4 changes: 4 additions & 0 deletions .agents/skills/sdk-benchmarking/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "SDK Benchmarking"
short_description: "Run and extend Braintrust SDK benchmarks"
default_prompt: "Use $sdk-benchmarking to run, compare, or add Braintrust Python SDK benchmarks."
93 changes: 93 additions & 0 deletions .agents/skills/sdk-benchmarking/references/benchmark-patterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Benchmark Patterns

Use this reference when adding or updating SDK benchmarks.

## Command Cheatsheet

```bash
cd py

# Run everything
make bench

# Faster local iteration
make bench BENCH_ARGS="--fast"

# Save results for comparison
make bench BENCH_ARGS="-o /tmp/before.json"
make bench BENCH_ARGS="-o /tmp/after.json"
make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json

# Run one module directly
python -m benchmarks.benches.bench_bt_json

# Inspect all forwarded pyperf flags
python -m benchmarks --help
```

## Module Skeleton

```python
import pathlib
import sys

import pyperf


if __package__ in (None, ""):
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2]))

from benchmarks._utils import disable_pyperf_psutil


def target(value):
return value


def main(runner: pyperf.Runner | None = None) -> None:
if runner is None:
disable_pyperf_psutil()
runner = pyperf.Runner()

runner.bench_func("example.target[case-name]", target, "value")


if __name__ == "__main__":
main()
```

Follow the existing `py/benchmarks/benches/bench_bt_json.py` pattern when importing repo code. The `sys.path` adjustment keeps direct module execution working from inside `py/`.

## Fixture Guidance

Put reusable builders in `py/benchmarks/fixtures.py` when:

- several benchmark cases need the same payload shape
- the payload should model realistic nested SDK inputs
- the benchmark should cover edge cases such as circular references or non-string keys

Current fixture patterns already cover:

- small, medium, and large nested payloads
- circular structures
- non-string dictionary keys
- dataclass-like and pydantic-like values

Extend those helpers before creating one-off payload factories in a new benchmark module.

## Comparison Workflow

For branch-to-branch comparisons:

```bash
cd py
git checkout main
make bench BENCH_ARGS="-o /tmp/main.json"

git checkout my-branch
make bench BENCH_ARGS="-o /tmp/branch.json"

make bench-compare BENCH_BASE=/tmp/main.json BENCH_NEW=/tmp/branch.json
```

Use `--rigorous` only when you need lower-noise final numbers; use `--fast` while iterating.
20 changes: 20 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Guide for contributing to the Braintrust Python SDK repository.
## Repo Map

- `py/`: main Python package, tests, examples, nox sessions, release build
- `py/benchmarks/`: pyperf performance benchmarks
- `integrations/`: separate integration packages
- `internal/golden/`: compatibility and golden projects
- `docs/`: supporting docs
Expand Down Expand Up @@ -122,6 +123,25 @@ BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest

Only re-record HTTP or subprocess cassettes when the behavior change is intentional. If in doubt, ask the user.

## Benchmarks

Run `cd py && make bench` when touching hot-path code (serialization, deep-copy, span creation, logging). Not required for every change.

Benchmarks use pyperf. All `bench_*.py` files in `py/benchmarks/benches/` are auto-discovered — no registration needed.

Key commands:

```bash
cd py
make bench # run all benchmarks
make bench BENCH_ARGS="--fast" # quick sanity check
make bench BENCH_ARGS="-o /tmp/before.json" # save baseline before a change
make bench BENCH_ARGS="-o /tmp/after.json" # save after a change
make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json
```

New benchmark files go in `py/benchmarks/benches/bench_<name>.py`. Each must expose `main(runner: pyperf.Runner | None = None)`. Shared payload builders go in `py/benchmarks/fixtures.py`. See existing `bench_bt_json.py` for the pattern.

## Build Notes

Build from `py/`:
Expand Down
81 changes: 81 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,87 @@ Common ones include:

The `memory_logger` fixture from `braintrust.test_helpers` is useful for asserting on logged spans without a real Braintrust backend.

## Benchmarks

The SDK includes local performance benchmarks powered by [pyperf](https://pyperf.readthedocs.io/), located in `py/benchmarks/`. These cover hot paths like serialization and deep-copy routines.

### Running benchmarks

```bash
cd py

# Run all benchmarks
make bench

# Quick sanity check (fewer iterations)
make bench BENCH_ARGS="--fast"

# Save results for later comparison
make bench BENCH_ARGS="-o /tmp/results.json"

# Run a single benchmark module directly
python -m benchmarks.benches.bench_bt_json
```

To benchmark with the optional `orjson` fast-path installed:

```bash
cd py
python -m uv pip install -e '.[performance]'
make bench
```

### Comparing across branches

```bash
cd py

git checkout main
make bench BENCH_ARGS="-o /tmp/main.json"

git checkout my-branch
make bench BENCH_ARGS="-o /tmp/branch.json"

make bench-compare BENCH_BASE=/tmp/main.json BENCH_NEW=/tmp/branch.json
```

### Useful pyperf flags

| Flag | Purpose |
| --------------- | ------------------------------------------------- |
| `--fast` | Fewer iterations — good for a quick sanity check |
| `--rigorous` | More iterations — reduces noise for final numbers |
| `-o FILE` | Write results to a JSON file for later comparison |
| `--append FILE` | Append to an existing results file |

Run `python -m benchmarks --help` for the full list.

### Adding a new benchmark

Drop a new `bench_<name>.py` file into `py/benchmarks/benches/`. It will be picked up automatically — no registration required.

Your module needs to expose a `main()` function that accepts an optional `pyperf.Runner`:

```python
import pyperf

from benchmarks._utils import disable_pyperf_psutil


def main(runner: pyperf.Runner | None = None) -> None:
if runner is None:
disable_pyperf_psutil()
runner = pyperf.Runner()

runner.bench_func("my_benchmark", my_func, my_arg)


if __name__ == "__main__":
main()
```

If your benchmark needs reusable test data, add builder functions to `py/benchmarks/fixtures.py`.

## CI

GitHub Actions workflows live in `.github/workflows/`.
Expand Down
30 changes: 19 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
SHELL := /bin/bash

.PHONY: help develop install-dev install-deps fixup test test-core test-wheel lint pylint nox
.PHONY: help develop install-dev install-deps fixup test test-core test-wheel lint pylint nox bench bench-compare

develop: install-dev
mise exec -- pre-commit install
Expand Down Expand Up @@ -30,19 +30,27 @@ lint:
pylint:
mise exec -- $(MAKE) -C py pylint

bench:
mise exec -- $(MAKE) -C py bench BENCH_ARGS="$(BENCH_ARGS)"

bench-compare:
mise exec -- $(MAKE) -C py bench-compare BENCH_BASE="$(BENCH_BASE)" BENCH_NEW="$(BENCH_NEW)"

nox: test

help:
@echo "Available targets:"
@echo " develop - Install tools with mise, install py/ deps, and install pre-commit hooks"
@echo " fixup - Run pre-commit hooks across the repo"
@echo " install-deps - Install Python SDK dependencies via py/Makefile"
@echo " install-dev - Install pinned tools and create/update the repo env via mise"
@echo " lint - Run pre-commit hooks plus Python SDK pylint via py/Makefile"
@echo " pylint - Run Python SDK pylint only via py/Makefile"
@echo " nox - Alias for test"
@echo " test - Run the Python SDK nox matrix via py/Makefile"
@echo " test-core - Run Python SDK core tests via py/Makefile"
@echo " test-wheel - Run Python SDK wheel sanity tests via py/Makefile (requires a built wheel)"
@echo " bench - Run benchmarks via py/Makefile (pass extra flags via BENCH_ARGS=)"
@echo " bench-compare - Compare two benchmark results via py/Makefile (BENCH_BASE=... BENCH_NEW=...)"
@echo " develop - Install tools with mise, install py/ deps, and install pre-commit hooks"
@echo " fixup - Run pre-commit hooks across the repo"
@echo " install-deps - Install Python SDK dependencies via py/Makefile"
@echo " install-dev - Install pinned tools and create/update the repo env via mise"
@echo " lint - Run pre-commit hooks plus Python SDK pylint via py/Makefile"
@echo " pylint - Run Python SDK pylint only via py/Makefile"
@echo " nox - Alias for test"
@echo " test - Run the Python SDK nox matrix via py/Makefile"
@echo " test-core - Run Python SDK core tests via py/Makefile"
@echo " test-wheel - Run Python SDK wheel sanity tests via py/Makefile (requires a built wheel)"

.DEFAULT_GOAL := help
Loading
Loading