sbroenne · sbroenne · Feb 19, 2026 · Feb 19, 2026
@@ -1,40 +1,46 @@
 # pytest-codingagents
 
-**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.**
+**Test-driven prompt engineering for GitHub Copilot.**
 
-Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely?
+Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?
 
 **You don't know, because you're not testing it.**
 
-pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why.
+pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations:
+
+1. **Write a test** — define what the agent *should* do
+2. **Run it** — see it fail (or pass)
+3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion
+4. **A/B confirm** — use `ab_run` to prove the change actually helps
+5. **Ship it** — you now have evidence, not vibes
 
 Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon.
 
 ```python
-from pytest_codingagents import CopilotAgent
-
-async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path):
-    """Does 'always use FastAPI' actually change what the agent produces?"""
-    # Config A: generic instructions
-    baseline = CopilotAgent(
-        instructions="You are a Python developer.",
-        working_directory=str(tmp_path / "a"),
-    )
-    # Config B: framework mandate
-    with_fastapi = CopilotAgent(
-        instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.",
-        working_directory=str(tmp_path / "b"),
+from pytest_codingagents import CopilotAgent, optimize_instruction
+import pytest
+
+
+async def test_docstring_instruction_works(ab_run):
+    """Prove the docstring instruction actually changes output, and get a fix if it doesn't."""
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
     )
-    (tmp_path / "a").mkdir()
-    (tmp_path / "b").mkdir()
 
-    task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.'
-    result_a = await copilot_run(baseline, task)
-    result_b = await copilot_run(with_fastapi, task)
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")
+
+    assert b.success and t.success
+
+    if '"""' not in t.file("math.py"):
+        suggestion = await optimize_instruction(
+            treatment.instructions or "",
+            t,
+            "Agent should add docstrings to every function.",
+        )
+        pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}")
 
-    assert result_a.success and result_b.success
-    code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py"))
-    assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect"
+    assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
 ```
 
 ## Install
@@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local).
 | Capability | What it proves | Guide |
 |---|---|---|
 | **A/B comparison** | Config B actually produces different (and better) output than Config A | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) |
+| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](https://sbroenne.github.io/pytest-codingagents/how-to/optimize/) |
 | **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) |
 | **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](https://sbroenne.github.io/pytest-codingagents/how-to/skills/) |
 | **Models** | Which model works best for your use case and budget | [Model Comparison](https://sbroenne.github.io/pytest-codingagents/getting-started/model-comparison/) |

@@ -4,9 +4,32 @@ The core use case of pytest-codingagents is **A/B testing**: run the same task w
 
 This stops cargo cult configuration — copying instructions and skills from blog posts without knowing if they work.
 
-## The Pattern
+## The `ab_run` Fixture
 
-Every A/B test follows the same structure:
+The `ab_run` fixture is the fastest way to write an A/B test. It handles directory isolation, sequential execution, and aitest reporting automatically:
+
+```python
+from pytest_codingagents import CopilotAgent
+
+
+async def test_docstring_instruction(ab_run):
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
+    )
+
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")
+
+    assert b.success and t.success
+    assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
+    assert '"""' in t.file("math.py"), "Treatment: docstring instruction was ignored"
+```
+
+`ab_run` automatically creates `baseline/` and `treatment/` subdirectories under `tmp_path`, overrides `working_directory` on each agent (so they never share a workspace), and runs them sequentially.
+
+## The Manual Pattern
+
+For full control — custom paths, conditional logic, more than two configs — use `copilot_run` directly:
 
 ```python
 from pytest_codingagents import CopilotAgent
@@ -41,8 +64,6 @@ async def test_config_a_vs_config_b(copilot_run, tmp_path):
 
 **The key rule**: assert something that is present in Config B *because of the change* and absent (or different) in Config A.
 
----
-
 ## Testing Instructions
 
 Does adding a documentation mandate actually change the code written?

@@ -3,6 +3,7 @@
 Practical guides for common tasks.
 
 - [A/B Testing](ab-testing.md) — Prove that your config changes actually make a difference
+- [Optimize Instructions](optimize.md) — Use AI to turn test failures into actionable instruction improvements
 - [Assertions](assertions.md) — File helpers and semantic assertions with `llm_assert`
 - [Load from Copilot Config](copilot-config.md) — Build a `CopilotAgent` from your real `.github/` config files
 - [Skill Testing](skills.md) — Measure the impact of domain knowledge

@@ -0,0 +1,124 @@
+# Optimizing Instructions with AI
+
+`optimize_instruction()` closes the test→optimize→test loop.
+
+When a test fails — the agent ignored an instruction or produced unexpected output — call `optimize_instruction()` to get a concrete, LLM-generated suggestion for improving the instruction. Drop the suggestion into `pytest.fail()` so the test failure message includes a ready-to-use fix.
+
+## The Loop
+
+```
+write test → run → fail → optimize → update instruction → run → pass
+```
+
+This is **test-driven prompt engineering**: your tests define the standard; the optimizer helps you reach it.
+
+## Basic Usage
+
+```python
+import pytest
+from pytest_codingagents import CopilotAgent, optimize_instruction
+
+
+async def test_docstring_instruction(copilot_run, tmp_path):
+    agent = CopilotAgent(
+        instructions="Write Python code.",
+        working_directory=str(tmp_path),
+    )
+
+    result = await copilot_run(agent, "Create math.py with add(a, b) and subtract(a, b).")
+
+    if '"""' not in result.file("math.py"):
+        suggestion = await optimize_instruction(
+            agent.instructions or "",
+            result,
+            "Agent should add Google-style docstrings to every function.",
+        )
+        pytest.fail(f"No docstrings found.\n\n{suggestion}")
+```
+
+The failure message will look like:
+
+```
+FAILED test_math.py::test_docstring_instruction
+
+No docstrings found.
+
+💡 Suggested instruction:
+
+  Write Python code. Add Google-style docstrings to every function.
+  The docstring should describe what the function does, its parameters (Args:),
+  and its return value (Returns:).
+
+  Changes: Added explicit docstring format mandate with Args/Returns sections.
+  Reasoning: The original instruction did not mention documentation. The agent
+  produced code without docstrings because there was no requirement to add them.
+```
+
+## With A/B Testing
+
+Pair `optimize_instruction()` with `ab_run` to test the fix before committing:
+
+```python
+import pytest
+from pytest_codingagents import CopilotAgent, optimize_instruction
+
+
+async def test_docstring_instruction_iterates(ab_run, tmp_path):
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
+    )
+
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).")
+
+    assert b.success and t.success
+
+    if '"""' not in t.file("math.py"):
+        suggestion = await optimize_instruction(
+            treatment.instructions or "",
+            t,
+            "Treatment agent should add docstrings — treatment instruction did not work.",
+        )
+        pytest.fail(f"Treatment still no docstrings.\n\n{suggestion}")
+
+    # Confirm baseline does NOT have docstrings (differential assertion)
+    assert '"""' not in b.file("math.py"), "Baseline unexpectedly has docstrings"
+```
+
+## API Reference
+
+::: pytest_codingagents.copilot.optimizer.optimize_instruction
+
+---
+
+::: pytest_codingagents.copilot.optimizer.InstructionSuggestion
+
+## Choosing a Model
+
+`optimize_instruction()` defaults to `openai:gpt-4o-mini` — cheap, fast, and precise enough for instruction analysis.
+
+Override with the `model` keyword argument:
+
+```python
+suggestion = await optimize_instruction(
+    agent.instructions or "",
+    result,
+    "Agent should use type hints.",
+    model="anthropic:claude-3-haiku-20240307",
+)
+```
+
+Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works.
+
+## The Criterion
+
+Write the `criterion` as a plain-English statement of what the agent *should* have done:
+
+| Situation | Good criterion |
+|-----------|----------------|
+| Missing docstrings | `"Agent should add Google-style docstrings to every function."` |
+| Wrong framework | `"Agent should use FastAPI, not Flask."` |
+| Missing type hints | `"All function signatures must include type annotations."` |
+| No error handling | `"All I/O operations must be wrapped in try/except."` |
+
+The more specific the criterion, the more actionable the suggestion.
@@ -1,40 +1,46 @@
 # pytest-codingagents
 
-**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.**
+**Test-driven prompt engineering for GitHub Copilot.**
 
-Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely?
+Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?
 
 **You don't know, because you're not testing it.**
 
-pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why.
+pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations:
+
+1. **Write a test** — define what the agent *should* do
+2. **Run it** — see it fail (or pass)
+3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion
+4. **A/B confirm** — use `ab_run` to prove the change actually helps
+5. **Ship it** — you now have evidence, not vibes
 
 Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon.
 
 ```python
-from pytest_codingagents import CopilotAgent
-
-async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path):
-    """Does 'always use FastAPI' actually change what the agent produces?"""
-    # Config A: generic instructions
-    baseline = CopilotAgent(
-        instructions="You are a Python developer.",
-        working_directory=str(tmp_path / "a"),
-    )
-    # Config B: framework mandate
-    with_fastapi = CopilotAgent(
-        instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.",
-        working_directory=str(tmp_path / "b"),
+from pytest_codingagents import CopilotAgent, optimize_instruction
+import pytest
+
+
+async def test_docstring_instruction_works(ab_run):
+    """Prove the docstring instruction actually changes output, and get a fix if it doesn't."""
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
     )
-    (tmp_path / "a").mkdir()
-    (tmp_path / "b").mkdir()
 
-    task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.'
-    result_a = await copilot_run(baseline, task)
-    result_b = await copilot_run(with_fastapi, task)
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")
+
+    assert b.success and t.success
+
+    if '"""' not in t.file("math.py"):
+        suggestion = await optimize_instruction(
+            treatment.instructions or "",
+            t,
+            "Agent should add docstrings to every function.",
+        )
+        pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}")
 
-    assert result_a.success and result_b.success
-    code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py"))
-    assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect"
+    assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
 ```
 
 ## Install
@@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local).
 | Capability | What it proves | Guide |
 |---|---|---|
 | **A/B comparison** | Config B actually produces different (and better) output than Config A | [A/B Testing](how-to/ab-testing.md) |
+| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](how-to/optimize.md) |
 | **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](getting-started/index.md) |
 | **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](how-to/skills.md) |
 | **Models** | Which model works best for your use case and budget | [Model Comparison](getting-started/model-comparison.md) |
@@ -78,6 +85,6 @@ uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-
 Full docs at **[sbroenne.github.io/pytest-codingagents](https://sbroenne.github.io/pytest-codingagents/)** — API reference, how-to guides, and demo reports.
 
 - [Getting Started](getting-started/index.md) — Install and write your first test
-- [How-To Guides](how-to/index.md) — Skills, MCP servers, CLI tools, and more
+- [How-To Guides](how-to/index.md) — A/B testing, instruction optimization, skills, MCP, and more
 - [Demo Reports](demo/index.md) — See real HTML reports with AI analysis
 - [API Reference](reference/api.md) — Full API documentation
@@ -7,3 +7,11 @@
 ::: pytest_codingagents.CopilotResult
     options:
       show_source: false
+
+::: pytest_codingagents.optimize_instruction
+    options:
+      show_source: false
+
+::: pytest_codingagents.InstructionSuggestion
+    options:
+      show_source: false
@@ -77,6 +77,7 @@ nav:
   - How-To Guides:
     - Overview: how-to/index.md
     - A/B Testing: how-to/ab-testing.md
+    - Optimize Instructions: how-to/optimize.md
     - Assertions: how-to/assertions.md
     - Load from Copilot Config: how-to/copilot-config.md
     - Skill Testing: how-to/skills.md

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "pytest-codingagents"
-version = "0.1.2"
+version = "0.2.0"
 description = "Pytest plugin for testing real coding agents via their SDK"
 readme = "README.md"
 license = { text = "MIT" }
@@ -31,6 +31,7 @@ dependencies = [
     "pytest-aitest>=0.5.6",
     "azure-identity>=1.25.2",
     "pyyaml>=6.0",
+    "pydantic-ai>=1.0",
 ]
 
 [project.optional-dependencies]

@@ -4,11 +4,14 @@
 
 from pytest_codingagents.copilot.agent import CopilotAgent
 from pytest_codingagents.copilot.agents import load_custom_agent, load_custom_agents
+from pytest_codingagents.copilot.optimizer import InstructionSuggestion, optimize_instruction
 from pytest_codingagents.copilot.result import CopilotResult
 
 __all__ = [
     "CopilotAgent",
     "CopilotResult",
+    "InstructionSuggestion",
     "load_custom_agent",
     "load_custom_agents",
+    "optimize_instruction",
 ]