Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 31 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,46 @@
# pytest-codingagents

**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.**
**Test-driven prompt engineering for GitHub Copilot.**

Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely?
Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?

**You don't know, because you're not testing it.**

pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why.
pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations:

1. **Write a test** — define what the agent *should* do
2. **Run it** — see it fail (or pass)
3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion
4. **A/B confirm** — use `ab_run` to prove the change actually helps
5. **Ship it** — you now have evidence, not vibes

Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon.

```python
from pytest_codingagents import CopilotAgent

async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path):
"""Does 'always use FastAPI' actually change what the agent produces?"""
# Config A: generic instructions
baseline = CopilotAgent(
instructions="You are a Python developer.",
working_directory=str(tmp_path / "a"),
)
# Config B: framework mandate
with_fastapi = CopilotAgent(
instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.",
working_directory=str(tmp_path / "b"),
from pytest_codingagents import CopilotAgent, optimize_instruction
import pytest


async def test_docstring_instruction_works(ab_run):
"""Prove the docstring instruction actually changes output, and get a fix if it doesn't."""
baseline = CopilotAgent(instructions="Write Python code.")
treatment = CopilotAgent(
instructions="Write Python code. Add Google-style docstrings to every function."
)
(tmp_path / "a").mkdir()
(tmp_path / "b").mkdir()

task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.'
result_a = await copilot_run(baseline, task)
result_b = await copilot_run(with_fastapi, task)
b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")

assert b.success and t.success

if '"""' not in t.file("math.py"):
suggestion = await optimize_instruction(
treatment.instructions or "",
t,
"Agent should add docstrings to every function.",
)
pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}")

assert result_a.success and result_b.success
code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py"))
assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect"
assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
```

## Install
Expand All @@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local).
| Capability | What it proves | Guide |
|---|---|---|
| **A/B comparison** | Config B actually produces different (and better) output than Config A | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) |
| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](https://sbroenne.github.io/pytest-codingagents/how-to/optimize/) |
| **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) |
| **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](https://sbroenne.github.io/pytest-codingagents/how-to/skills/) |
| **Models** | Which model works best for your use case and budget | [Model Comparison](https://sbroenne.github.io/pytest-codingagents/getting-started/model-comparison/) |
Expand Down
29 changes: 25 additions & 4 deletions docs/how-to/ab-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,32 @@ The core use case of pytest-codingagents is **A/B testing**: run the same task w

This stops cargo cult configuration — copying instructions and skills from blog posts without knowing if they work.

## The Pattern
## The `ab_run` Fixture

Every A/B test follows the same structure:
The `ab_run` fixture is the fastest way to write an A/B test. It handles directory isolation, sequential execution, and aitest reporting automatically:

```python
from pytest_codingagents import CopilotAgent


async def test_docstring_instruction(ab_run):
baseline = CopilotAgent(instructions="Write Python code.")
treatment = CopilotAgent(
instructions="Write Python code. Add Google-style docstrings to every function."
)

b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")

assert b.success and t.success
assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
assert '"""' in t.file("math.py"), "Treatment: docstring instruction was ignored"
```

`ab_run` automatically creates `baseline/` and `treatment/` subdirectories under `tmp_path`, overrides `working_directory` on each agent (so they never share a workspace), and runs them sequentially.

## The Manual Pattern

For full control — custom paths, conditional logic, more than two configs — use `copilot_run` directly:

```python
from pytest_codingagents import CopilotAgent
Expand Down Expand Up @@ -41,8 +64,6 @@ async def test_config_a_vs_config_b(copilot_run, tmp_path):

**The key rule**: assert something that is present in Config B *because of the change* and absent (or different) in Config A.

---

## Testing Instructions

Does adding a documentation mandate actually change the code written?
Expand Down
1 change: 1 addition & 0 deletions docs/how-to/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Practical guides for common tasks.

- [A/B Testing](ab-testing.md) — Prove that your config changes actually make a difference
- [Optimize Instructions](optimize.md) — Use AI to turn test failures into actionable instruction improvements
- [Assertions](assertions.md) — File helpers and semantic assertions with `llm_assert`
- [Load from Copilot Config](copilot-config.md) — Build a `CopilotAgent` from your real `.github/` config files
- [Skill Testing](skills.md) — Measure the impact of domain knowledge
Expand Down
124 changes: 124 additions & 0 deletions docs/how-to/optimize.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Optimizing Instructions with AI

`optimize_instruction()` closes the test→optimize→test loop.

When a test fails — the agent ignored an instruction or produced unexpected output — call `optimize_instruction()` to get a concrete, LLM-generated suggestion for improving the instruction. Drop the suggestion into `pytest.fail()` so the test failure message includes a ready-to-use fix.

## The Loop

```
write test → run → fail → optimize → update instruction → run → pass
```

This is **test-driven prompt engineering**: your tests define the standard; the optimizer helps you reach it.

## Basic Usage

```python
import pytest
from pytest_codingagents import CopilotAgent, optimize_instruction


async def test_docstring_instruction(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Write Python code.",
working_directory=str(tmp_path),
)

result = await copilot_run(agent, "Create math.py with add(a, b) and subtract(a, b).")

if '"""' not in result.file("math.py"):
suggestion = await optimize_instruction(
agent.instructions or "",
result,
"Agent should add Google-style docstrings to every function.",
)
pytest.fail(f"No docstrings found.\n\n{suggestion}")
```

The failure message will look like:

```
FAILED test_math.py::test_docstring_instruction

No docstrings found.

💡 Suggested instruction:

Write Python code. Add Google-style docstrings to every function.
The docstring should describe what the function does, its parameters (Args:),
and its return value (Returns:).

Changes: Added explicit docstring format mandate with Args/Returns sections.
Reasoning: The original instruction did not mention documentation. The agent
produced code without docstrings because there was no requirement to add them.
```

## With A/B Testing

Pair `optimize_instruction()` with `ab_run` to test the fix before committing:

```python
import pytest
from pytest_codingagents import CopilotAgent, optimize_instruction


async def test_docstring_instruction_iterates(ab_run, tmp_path):
baseline = CopilotAgent(instructions="Write Python code.")
treatment = CopilotAgent(
instructions="Write Python code. Add Google-style docstrings to every function."
)

b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).")

assert b.success and t.success

if '"""' not in t.file("math.py"):
suggestion = await optimize_instruction(
treatment.instructions or "",
t,
"Treatment agent should add docstrings — treatment instruction did not work.",
)
pytest.fail(f"Treatment still no docstrings.\n\n{suggestion}")

# Confirm baseline does NOT have docstrings (differential assertion)
assert '"""' not in b.file("math.py"), "Baseline unexpectedly has docstrings"
```

## API Reference

::: pytest_codingagents.copilot.optimizer.optimize_instruction

---

::: pytest_codingagents.copilot.optimizer.InstructionSuggestion

## Choosing a Model

`optimize_instruction()` defaults to `openai:gpt-4o-mini` — cheap, fast, and precise enough for instruction analysis.

Override with the `model` keyword argument:

```python
suggestion = await optimize_instruction(
agent.instructions or "",
result,
"Agent should use type hints.",
model="anthropic:claude-3-haiku-20240307",
)
```

Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works.

## The Criterion

Write the `criterion` as a plain-English statement of what the agent *should* have done:

| Situation | Good criterion |
|-----------|----------------|
| Missing docstrings | `"Agent should add Google-style docstrings to every function."` |
| Wrong framework | `"Agent should use FastAPI, not Flask."` |
| Missing type hints | `"All function signatures must include type annotations."` |
| No error handling | `"All I/O operations must be wrapped in try/except."` |

The more specific the criterion, the more actionable the suggestion.
57 changes: 32 additions & 25 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,46 @@
# pytest-codingagents

**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.**
**Test-driven prompt engineering for GitHub Copilot.**

Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely?
Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?

**You don't know, because you're not testing it.**

pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why.
pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations:

1. **Write a test** — define what the agent *should* do
2. **Run it** — see it fail (or pass)
3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion
4. **A/B confirm** — use `ab_run` to prove the change actually helps
5. **Ship it** — you now have evidence, not vibes

Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon.

```python
from pytest_codingagents import CopilotAgent

async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path):
"""Does 'always use FastAPI' actually change what the agent produces?"""
# Config A: generic instructions
baseline = CopilotAgent(
instructions="You are a Python developer.",
working_directory=str(tmp_path / "a"),
)
# Config B: framework mandate
with_fastapi = CopilotAgent(
instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.",
working_directory=str(tmp_path / "b"),
from pytest_codingagents import CopilotAgent, optimize_instruction
import pytest


async def test_docstring_instruction_works(ab_run):
"""Prove the docstring instruction actually changes output, and get a fix if it doesn't."""
baseline = CopilotAgent(instructions="Write Python code.")
treatment = CopilotAgent(
instructions="Write Python code. Add Google-style docstrings to every function."
)
(tmp_path / "a").mkdir()
(tmp_path / "b").mkdir()

task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.'
result_a = await copilot_run(baseline, task)
result_b = await copilot_run(with_fastapi, task)
b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")

assert b.success and t.success

if '"""' not in t.file("math.py"):
suggestion = await optimize_instruction(
treatment.instructions or "",
t,
"Agent should add docstrings to every function.",
)
pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}")

assert result_a.success and result_b.success
code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py"))
assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect"
assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
```

## Install
Expand All @@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local).
| Capability | What it proves | Guide |
|---|---|---|
| **A/B comparison** | Config B actually produces different (and better) output than Config A | [A/B Testing](how-to/ab-testing.md) |
| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](how-to/optimize.md) |
| **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](getting-started/index.md) |
| **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](how-to/skills.md) |
| **Models** | Which model works best for your use case and budget | [Model Comparison](getting-started/model-comparison.md) |
Expand Down Expand Up @@ -78,6 +85,6 @@ uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-
Full docs at **[sbroenne.github.io/pytest-codingagents](https://sbroenne.github.io/pytest-codingagents/)** — API reference, how-to guides, and demo reports.

- [Getting Started](getting-started/index.md) — Install and write your first test
- [How-To Guides](how-to/index.md) — Skills, MCP servers, CLI tools, and more
- [How-To Guides](how-to/index.md) — A/B testing, instruction optimization, skills, MCP, and more
- [Demo Reports](demo/index.md) — See real HTML reports with AI analysis
- [API Reference](reference/api.md) — Full API documentation
8 changes: 8 additions & 0 deletions docs/reference/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,11 @@
::: pytest_codingagents.CopilotResult
options:
show_source: false

::: pytest_codingagents.optimize_instruction
options:
show_source: false

::: pytest_codingagents.InstructionSuggestion
options:
show_source: false
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ nav:
- How-To Guides:
- Overview: how-to/index.md
- A/B Testing: how-to/ab-testing.md
- Optimize Instructions: how-to/optimize.md
- Assertions: how-to/assertions.md
- Load from Copilot Config: how-to/copilot-config.md
- Skill Testing: how-to/skills.md
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "pytest-codingagents"
version = "0.1.2"
version = "0.2.0"
description = "Pytest plugin for testing real coding agents via their SDK"
readme = "README.md"
license = { text = "MIT" }
Expand All @@ -31,6 +31,7 @@ dependencies = [
"pytest-aitest>=0.5.6",
"azure-identity>=1.25.2",
"pyyaml>=6.0",
"pydantic-ai>=1.0",
]

[project.optional-dependencies]
Expand Down
3 changes: 3 additions & 0 deletions src/pytest_codingagents/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,14 @@

from pytest_codingagents.copilot.agent import CopilotAgent
from pytest_codingagents.copilot.agents import load_custom_agent, load_custom_agents
from pytest_codingagents.copilot.optimizer import InstructionSuggestion, optimize_instruction
from pytest_codingagents.copilot.result import CopilotResult

__all__ = [
"CopilotAgent",
"CopilotResult",
"InstructionSuggestion",
"load_custom_agent",
"load_custom_agents",
"optimize_instruction",
]
Loading