diff --git a/README.md b/README.md
index 4e2b559..fdff77f 100644
--- a/README.md
+++ b/README.md
@@ -1,40 +1,46 @@
 # pytest-codingagents
 
-**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.**
+**Test-driven prompt engineering for GitHub Copilot.**
 
-Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely?
+Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?
 
 **You don't know, because you're not testing it.**
 
-pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why.
+pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations:
+
+1. **Write a test** — define what the agent *should* do
+2. **Run it** — see it fail (or pass)
+3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion
+4. **A/B confirm** — use `ab_run` to prove the change actually helps
+5. **Ship it** — you now have evidence, not vibes
 
 Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon.
 
 ```python
-from pytest_codingagents import CopilotAgent
-
-async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path):
-    """Does 'always use FastAPI' actually change what the agent produces?"""
-    # Config A: generic instructions
-    baseline = CopilotAgent(
-        instructions="You are a Python developer.",
-        working_directory=str(tmp_path / "a"),
-    )
-    # Config B: framework mandate
-    with_fastapi = CopilotAgent(
-        instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.",
-        working_directory=str(tmp_path / "b"),
+from pytest_codingagents import CopilotAgent, optimize_instruction
+import pytest
+
+
+async def test_docstring_instruction_works(ab_run):
+    """Prove the docstring instruction actually changes output, and get a fix if it doesn't."""
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
     )
-    (tmp_path / "a").mkdir()
-    (tmp_path / "b").mkdir()
 
-    task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.'
-    result_a = await copilot_run(baseline, task)
-    result_b = await copilot_run(with_fastapi, task)
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")
+
+    assert b.success and t.success
+
+    if '"""' not in t.file("math.py"):
+        suggestion = await optimize_instruction(
+            treatment.instructions or "",
+            t,
+            "Agent should add docstrings to every function.",
+        )
+        pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}")
 
-    assert result_a.success and result_b.success
-    code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py"))
-    assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect"
+    assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
 ```
 
 ## Install
@@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local).
 | Capability | What it proves | Guide |
 |---|---|---|
 | **A/B comparison** | Config B actually produces different (and better) output than Config A | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) |
+| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](https://sbroenne.github.io/pytest-codingagents/how-to/optimize/) |
 | **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) |
 | **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](https://sbroenne.github.io/pytest-codingagents/how-to/skills/) |
 | **Models** | Which model works best for your use case and budget | [Model Comparison](https://sbroenne.github.io/pytest-codingagents/getting-started/model-comparison/) |
diff --git a/docs/how-to/ab-testing.md b/docs/how-to/ab-testing.md
index cb9567a..a6ab6b9 100644
--- a/docs/how-to/ab-testing.md
+++ b/docs/how-to/ab-testing.md
@@ -4,9 +4,32 @@ The core use case of pytest-codingagents is **A/B testing**: run the same task w
 
 This stops cargo cult configuration — copying instructions and skills from blog posts without knowing if they work.
 
-## The Pattern
+## The `ab_run` Fixture
 
-Every A/B test follows the same structure:
+The `ab_run` fixture is the fastest way to write an A/B test. It handles directory isolation, sequential execution, and aitest reporting automatically:
+
+```python
+from pytest_codingagents import CopilotAgent
+
+
+async def test_docstring_instruction(ab_run):
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
+    )
+
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")
+
+    assert b.success and t.success
+    assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
+    assert '"""' in t.file("math.py"), "Treatment: docstring instruction was ignored"
+```
+
+`ab_run` automatically creates `baseline/` and `treatment/` subdirectories under `tmp_path`, overrides `working_directory` on each agent (so they never share a workspace), and runs them sequentially.
+
+## The Manual Pattern
+
+For full control — custom paths, conditional logic, more than two configs — use `copilot_run` directly:
 
 ```python
 from pytest_codingagents import CopilotAgent
@@ -41,8 +64,6 @@ async def test_config_a_vs_config_b(copilot_run, tmp_path):
 
 **The key rule**: assert something that is present in Config B *because of the change* and absent (or different) in Config A.
 
----
-
 ## Testing Instructions
 
 Does adding a documentation mandate actually change the code written?
diff --git a/docs/how-to/index.md b/docs/how-to/index.md
index 4f2cf57..b1840ea 100644
--- a/docs/how-to/index.md
+++ b/docs/how-to/index.md
@@ -3,6 +3,7 @@
 Practical guides for common tasks.
 
 - [A/B Testing](ab-testing.md) — Prove that your config changes actually make a difference
+- [Optimize Instructions](optimize.md) — Use AI to turn test failures into actionable instruction improvements
 - [Assertions](assertions.md) — File helpers and semantic assertions with `llm_assert`
 - [Load from Copilot Config](copilot-config.md) — Build a `CopilotAgent` from your real `.github/` config files
 - [Skill Testing](skills.md) — Measure the impact of domain knowledge
diff --git a/docs/how-to/optimize.md b/docs/how-to/optimize.md
new file mode 100644
index 0000000..964c9c2
--- /dev/null
+++ b/docs/how-to/optimize.md
@@ -0,0 +1,124 @@
+# Optimizing Instructions with AI
+
+`optimize_instruction()` closes the test→optimize→test loop.
+
+When a test fails — the agent ignored an instruction or produced unexpected output — call `optimize_instruction()` to get a concrete, LLM-generated suggestion for improving the instruction. Drop the suggestion into `pytest.fail()` so the test failure message includes a ready-to-use fix.
+
+## The Loop
+
+```
+write test → run → fail → optimize → update instruction → run → pass
+```
+
+This is **test-driven prompt engineering**: your tests define the standard; the optimizer helps you reach it.
+
+## Basic Usage
+
+```python
+import pytest
+from pytest_codingagents import CopilotAgent, optimize_instruction
+
+
+async def test_docstring_instruction(copilot_run, tmp_path):
+    agent = CopilotAgent(
+        instructions="Write Python code.",
+        working_directory=str(tmp_path),
+    )
+
+    result = await copilot_run(agent, "Create math.py with add(a, b) and subtract(a, b).")
+
+    if '"""' not in result.file("math.py"):
+        suggestion = await optimize_instruction(
+            agent.instructions or "",
+            result,
+            "Agent should add Google-style docstrings to every function.",
+        )
+        pytest.fail(f"No docstrings found.\n\n{suggestion}")
+```
+
+The failure message will look like:
+
+```
+FAILED test_math.py::test_docstring_instruction
+
+No docstrings found.
+
+💡 Suggested instruction:
+
+  Write Python code. Add Google-style docstrings to every function.
+  The docstring should describe what the function does, its parameters (Args:),
+  and its return value (Returns:).
+
+  Changes: Added explicit docstring format mandate with Args/Returns sections.
+  Reasoning: The original instruction did not mention documentation. The agent
+  produced code without docstrings because there was no requirement to add them.
+```
+
+## With A/B Testing
+
+Pair `optimize_instruction()` with `ab_run` to test the fix before committing:
+
+```python
+import pytest
+from pytest_codingagents import CopilotAgent, optimize_instruction
+
+
+async def test_docstring_instruction_iterates(ab_run, tmp_path):
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
+    )
+
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).")
+
+    assert b.success and t.success
+
+    if '"""' not in t.file("math.py"):
+        suggestion = await optimize_instruction(
+            treatment.instructions or "",
+            t,
+            "Treatment agent should add docstrings — treatment instruction did not work.",
+        )
+        pytest.fail(f"Treatment still no docstrings.\n\n{suggestion}")
+
+    # Confirm baseline does NOT have docstrings (differential assertion)
+    assert '"""' not in b.file("math.py"), "Baseline unexpectedly has docstrings"
+```
+
+## API Reference
+
+::: pytest_codingagents.copilot.optimizer.optimize_instruction
+
+---
+
+::: pytest_codingagents.copilot.optimizer.InstructionSuggestion
+
+## Choosing a Model
+
+`optimize_instruction()` defaults to `openai:gpt-4o-mini` — cheap, fast, and precise enough for instruction analysis.
+
+Override with the `model` keyword argument:
+
+```python
+suggestion = await optimize_instruction(
+    agent.instructions or "",
+    result,
+    "Agent should use type hints.",
+    model="anthropic:claude-3-haiku-20240307",
+)
+```
+
+Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works.
+
+## The Criterion
+
+Write the `criterion` as a plain-English statement of what the agent *should* have done:
+
+| Situation | Good criterion |
+|-----------|----------------|
+| Missing docstrings | `"Agent should add Google-style docstrings to every function."` |
+| Wrong framework | `"Agent should use FastAPI, not Flask."` |
+| Missing type hints | `"All function signatures must include type annotations."` |
+| No error handling | `"All I/O operations must be wrapped in try/except."` |
+
+The more specific the criterion, the more actionable the suggestion.
diff --git a/docs/index.md b/docs/index.md
index 2725b9e..5f7286e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,40 +1,46 @@
 # pytest-codingagents
 
-**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.**
+**Test-driven prompt engineering for GitHub Copilot.**
 
-Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely?
+Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?
 
 **You don't know, because you're not testing it.**
 
-pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why.
+pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations:
+
+1. **Write a test** — define what the agent *should* do
+2. **Run it** — see it fail (or pass)
+3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion
+4. **A/B confirm** — use `ab_run` to prove the change actually helps
+5. **Ship it** — you now have evidence, not vibes
 
 Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon.
 
 ```python
-from pytest_codingagents import CopilotAgent
-
-async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path):
-    """Does 'always use FastAPI' actually change what the agent produces?"""
-    # Config A: generic instructions
-    baseline = CopilotAgent(
-        instructions="You are a Python developer.",
-        working_directory=str(tmp_path / "a"),
-    )
-    # Config B: framework mandate
-    with_fastapi = CopilotAgent(
-        instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.",
-        working_directory=str(tmp_path / "b"),
+from pytest_codingagents import CopilotAgent, optimize_instruction
+import pytest
+
+
+async def test_docstring_instruction_works(ab_run):
+    """Prove the docstring instruction actually changes output, and get a fix if it doesn't."""
+    baseline = CopilotAgent(instructions="Write Python code.")
+    treatment = CopilotAgent(
+        instructions="Write Python code. Add Google-style docstrings to every function."
     )
-    (tmp_path / "a").mkdir()
-    (tmp_path / "b").mkdir()
 
-    task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.'
-    result_a = await copilot_run(baseline, task)
-    result_b = await copilot_run(with_fastapi, task)
+    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).")
+
+    assert b.success and t.success
+
+    if '"""' not in t.file("math.py"):
+        suggestion = await optimize_instruction(
+            treatment.instructions or "",
+            t,
+            "Agent should add docstrings to every function.",
+        )
+        pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}")
 
-    assert result_a.success and result_b.success
-    code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py"))
-    assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect"
+    assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
 ```
 
 ## Install
@@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local).
 | Capability | What it proves | Guide |
 |---|---|---|
 | **A/B comparison** | Config B actually produces different (and better) output than Config A | [A/B Testing](how-to/ab-testing.md) |
+| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](how-to/optimize.md) |
 | **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](getting-started/index.md) |
 | **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](how-to/skills.md) |
 | **Models** | Which model works best for your use case and budget | [Model Comparison](getting-started/model-comparison.md) |
@@ -78,6 +85,6 @@ uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-
 Full docs at **[sbroenne.github.io/pytest-codingagents](https://sbroenne.github.io/pytest-codingagents/)** — API reference, how-to guides, and demo reports.
 
 - [Getting Started](getting-started/index.md) — Install and write your first test
-- [How-To Guides](how-to/index.md) — Skills, MCP servers, CLI tools, and more
+- [How-To Guides](how-to/index.md) — A/B testing, instruction optimization, skills, MCP, and more
 - [Demo Reports](demo/index.md) — See real HTML reports with AI analysis
 - [API Reference](reference/api.md) — Full API documentation
diff --git a/docs/reference/api.md b/docs/reference/api.md
index a116478..8f16a53 100644
--- a/docs/reference/api.md
+++ b/docs/reference/api.md
@@ -7,3 +7,11 @@
 ::: pytest_codingagents.CopilotResult
     options:
       show_source: false
+
+::: pytest_codingagents.optimize_instruction
+    options:
+      show_source: false
+
+::: pytest_codingagents.InstructionSuggestion
+    options:
+      show_source: false
diff --git a/mkdocs.yml b/mkdocs.yml
index cc55caf..2a7a478 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -77,6 +77,7 @@ nav:
   - How-To Guides:
     - Overview: how-to/index.md
     - A/B Testing: how-to/ab-testing.md
+    - Optimize Instructions: how-to/optimize.md
     - Assertions: how-to/assertions.md
     - Load from Copilot Config: how-to/copilot-config.md
     - Skill Testing: how-to/skills.md
diff --git a/pyproject.toml b/pyproject.toml
index 9c85e07..d7f4de9 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "pytest-codingagents"
-version = "0.1.2"
+version = "0.2.0"
 description = "Pytest plugin for testing real coding agents via their SDK"
 readme = "README.md"
 license = { text = "MIT" }
@@ -31,6 +31,7 @@ dependencies = [
     "pytest-aitest>=0.5.6",
     "azure-identity>=1.25.2",
     "pyyaml>=6.0",
+    "pydantic-ai>=1.0",
 ]
 
 [project.optional-dependencies]
diff --git a/src/pytest_codingagents/__init__.py b/src/pytest_codingagents/__init__.py
index e07cd94..2d93271 100644
--- a/src/pytest_codingagents/__init__.py
+++ b/src/pytest_codingagents/__init__.py
@@ -4,11 +4,14 @@
 
 from pytest_codingagents.copilot.agent import CopilotAgent
 from pytest_codingagents.copilot.agents import load_custom_agent, load_custom_agents
+from pytest_codingagents.copilot.optimizer import InstructionSuggestion, optimize_instruction
 from pytest_codingagents.copilot.result import CopilotResult
 
 __all__ = [
     "CopilotAgent",
     "CopilotResult",
+    "InstructionSuggestion",
     "load_custom_agent",
     "load_custom_agents",
+    "optimize_instruction",
 ]
diff --git a/src/pytest_codingagents/copilot/fixtures.py b/src/pytest_codingagents/copilot/fixtures.py
index 3449041..31045d5 100644
--- a/src/pytest_codingagents/copilot/fixtures.py
+++ b/src/pytest_codingagents/copilot/fixtures.py
@@ -2,10 +2,15 @@
 
 Provides the ``copilot_run`` fixture that executes prompts against Copilot
 and stashes results for pytest-aitest reporting (if installed).
+
+Also provides ``ab_run``, a higher-level fixture for A/B testing two agent
+configurations against the same task in isolated directories.
 """
 
 from __future__ import annotations
 
+import dataclasses
+from pathlib import Path
 from typing import TYPE_CHECKING, Any
 
 import pytest
@@ -131,3 +136,66 @@ def _stash_for_aitest(
         as the ``item`` parameter.
     """
     stash_on_item(request.node, agent, result)  # type: ignore[arg-type]
+
+
+@pytest.fixture
+def ab_run(
+    request: pytest.FixtureRequest,
+    tmp_path: Path,
+) -> Callable[..., Coroutine[Any, Any, tuple[CopilotResult, CopilotResult]]]:
+    """Run two agents against the same task in isolated directories.
+
+    Creates ``baseline/`` and ``treatment/`` subdirectories under
+    ``tmp_path``, overrides ``working_directory`` on each agent so they
+    never share a workspace, then runs them sequentially and stashes the
+    treatment result for pytest-aitest reporting.
+
+    Example::
+
+        async def test_docstring_instruction(ab_run):
+            baseline = CopilotAgent(instructions="Write Python code.")
+            treatment = CopilotAgent(
+                instructions="Write Python code. Add Google-style docstrings to every function."
+            )
+
+            b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).")
+
+            assert b.success and t.success
+            assert '\"\"\"' not in b.file("math.py"), "Baseline should not have docstrings"
+            assert '\"\"\"' in t.file("math.py"), "Treatment should add docstrings"
+
+    Args:
+        baseline: Control ``CopilotAgent`` (the existing / unchanged config).
+        treatment: Treatment ``CopilotAgent`` (the change you are testing).
+        task: Prompt to give both agents.
+
+    Returns:
+        ``(baseline_result, treatment_result)`` tuple.
+    """
+
+    async def _run(
+        baseline: CopilotAgent,
+        treatment: CopilotAgent,
+        task: str,
+    ) -> tuple[CopilotResult, CopilotResult]:
+        baseline_dir = tmp_path / "baseline"
+        treatment_dir = tmp_path / "treatment"
+        baseline_dir.mkdir(exist_ok=True)
+        treatment_dir.mkdir(exist_ok=True)
+
+        # Override working directories to guarantee isolation.
+        # CopilotAgent is frozen — dataclasses.replace() creates a new instance.
+        baseline = dataclasses.replace(baseline, working_directory=str(baseline_dir))
+        treatment = dataclasses.replace(treatment, working_directory=str(treatment_dir))
+
+        # Run sequentially — agents may write to disk, install packages, etc.
+        baseline_result = await run_copilot(baseline, task)
+        treatment_result = await run_copilot(treatment, task)
+
+        # Stash treatment result for pytest-aitest reporting.
+        # Treatment is the config being evaluated; its result is what matters.
+        stash_on_item(request.node, treatment, treatment_result)
+
+        return baseline_result, treatment_result
+
+    return _run
diff --git a/src/pytest_codingagents/copilot/optimizer.py b/src/pytest_codingagents/copilot/optimizer.py
new file mode 100644
index 0000000..1d2fb99
--- /dev/null
+++ b/src/pytest_codingagents/copilot/optimizer.py
@@ -0,0 +1,153 @@
+"""Instruction optimizer for test-driven prompt engineering.
+
+Provides :func:`optimize_instruction`, which uses an LLM to analyze the gap
+between a current agent instruction and the observed behavior, and suggests a
+concrete improvement.
+
+Requires ``pydantic-ai``:
+
+    uv add pydantic-ai
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import TYPE_CHECKING
+
+from pydantic import BaseModel
+
+if TYPE_CHECKING:
+    from pytest_codingagents.copilot.result import CopilotResult
+
+__all__ = ["InstructionSuggestion", "optimize_instruction"]
+
+
+@dataclass
+class InstructionSuggestion:
+    """A suggested improvement to a Copilot agent instruction.
+
+    Returned by :func:`optimize_instruction`. Designed to drop into
+    ``pytest.fail()`` so the failure message includes an actionable fix.
+
+    Attributes:
+        instruction: The improved instruction text to use instead.
+        reasoning: Explanation of why this change would close the gap.
+        changes: Short description of what was changed (one sentence).
+
+    Example::
+
+        suggestion = await optimize_instruction(
+            agent.instructions,
+            result,
+            "Agent should add docstrings to all functions.",
+        )
+        pytest.fail(f"No docstrings found.\\n\\n{suggestion}")
+    """
+
+    instruction: str
+    reasoning: str
+    changes: str
+
+    def __str__(self) -> str:
+        return (
+            f"💡 Suggested instruction:\n\n"
+            f"  {self.instruction}\n\n"
+            f"  Changes: {self.changes}\n"
+            f"  Reasoning: {self.reasoning}"
+        )
+
+
+class _OptimizationOutput(BaseModel):
+    """Structured output schema for the optimizer LLM call."""
+
+    instruction: str
+    reasoning: str
+    changes: str
+
+
+async def optimize_instruction(
+    current_instruction: str,
+    result: CopilotResult,
+    criterion: str,
+    *,
+    model: str = "openai:gpt-4o-mini",
+) -> InstructionSuggestion:
+    """Analyze a result and suggest an improved instruction.
+
+    Uses pydantic-ai structured output to analyze the gap between a
+    current instruction and the agent's observed behavior, returning a
+    concrete, actionable improvement.
+
+    Designed to drop into ``pytest.fail()`` so the failure message
+    contains a ready-to-use fix:
+
+    Example::
+
+        result = await copilot_run(agent, task)
+        if '\"\"\"' not in result.file("main.py"):
+            suggestion = await optimize_instruction(
+                agent.instructions or "",
+                result,
+                "Agent should add docstrings to all functions.",
+            )
+            pytest.fail(f"No docstrings found.\\n\\n{suggestion}")
+
+    Args:
+        current_instruction: The agent's current instruction text.
+        result: The ``CopilotResult`` from the (failed) run.
+        criterion: What the agent *should* have done — the test expectation
+            in plain English (e.g. ``"Always write docstrings"``).
+        model: LiteLLM-style model string (e.g. ``"openai:gpt-4o-mini"``
+            or ``"anthropic:claude-3-haiku-20240307"``).
+
+    Returns:
+        An :class:`InstructionSuggestion` with the improved instruction.
+
+    Raises:
+        ImportError: If pydantic-ai is not installed.
+    """
+    try:
+        from pydantic_ai import Agent as PydanticAgent
+    except ImportError as exc:
+        msg = (
+            "pydantic-ai is required for optimize_instruction(). "
+            "Install it with: uv add pydantic-ai"
+        )
+        raise ImportError(msg) from exc
+
+    final_output = result.final_response or "(no response)"
+    tool_calls = ", ".join(sorted(result.tool_names_called)) or "none"
+
+    prompt = f"""You are helping improve a GitHub Copilot agent instruction.
+
+## Current instruction
+{current_instruction or "(no instruction)"}
+
+## Task the agent performed
+{criterion}
+
+## What actually happened
+The agent produced:
+{final_output[:1500]}
+
+Tools called: {tool_calls}
+Run succeeded: {result.success}
+
+## Expected criterion
+The agent SHOULD have satisfied this criterion:
+{criterion}
+
+Analyze the gap between the instruction and the observed behaviour.
+Suggest a specific, concise, directive improvement to the instruction
+that would make the agent satisfy the criterion.
+Keep the instruction under 200 words. Do not add unrelated rules."""
+
+    optimizer_agent = PydanticAgent(model, output_type=_OptimizationOutput)
+    run_result = await optimizer_agent.run(prompt)
+    output = run_result.output
+
+    return InstructionSuggestion(
+        instruction=output.instruction,
+        reasoning=output.reasoning,
+        changes=output.changes,
+    )
diff --git a/src/pytest_codingagents/plugin.py b/src/pytest_codingagents/plugin.py
index 4c60719..52ba3e2 100644
--- a/src/pytest_codingagents/plugin.py
+++ b/src/pytest_codingagents/plugin.py
@@ -12,13 +12,13 @@
 
 import pytest
 
-# Re-export the fixture so pytest discovers it via the plugin entry point.
-from pytest_codingagents.copilot.fixtures import copilot_run
+# Re-export fixtures so pytest discovers them via the plugin entry point.
+from pytest_codingagents.copilot.fixtures import ab_run, copilot_run
 
 if TYPE_CHECKING:
     from _pytest.nodes import Item
 
-__all__ = ["copilot_run"]
+__all__ = ["ab_run", "copilot_run"]
 
 _ANALYSIS_PROMPT_PATH = Path(__file__).parent / "prompts" / "coding_agent_analysis.md"
 
diff --git a/tests/unit/test_fixtures.py b/tests/unit/test_fixtures.py
new file mode 100644
index 0000000..aeff803
--- /dev/null
+++ b/tests/unit/test_fixtures.py
@@ -0,0 +1,159 @@
+"""Unit tests for the ab_run fixture."""
+
+from __future__ import annotations
+
+from unittest.mock import AsyncMock, patch
+
+import pytest
+
+from pytest_codingagents.copilot.agent import CopilotAgent
+from pytest_codingagents.copilot.result import CopilotResult
+
+
+def _make_result(success: bool = True) -> CopilotResult:
+    return CopilotResult(success=success)
+
+
+class TestAbRunFixture:
+    """Tests for the ab_run fixture."""
+
+    @pytest.fixture
+    def baseline_agent(self) -> CopilotAgent:
+        return CopilotAgent(name="baseline", instructions="Write plain Python.")
+
+    @pytest.fixture
+    def treatment_agent(self) -> CopilotAgent:
+        return CopilotAgent(name="treatment", instructions="Write Python with docstrings.")
+
+    async def test_returns_tuple_of_two_results(
+        self, ab_run, baseline_agent, treatment_agent, tmp_path
+    ):
+        """ab_run returns a 2-tuple of CopilotResult."""
+        baseline_result = _make_result(success=True)
+        treatment_result = _make_result(success=True)
+
+        with patch(
+            "pytest_codingagents.copilot.fixtures.run_copilot",
+            new=AsyncMock(side_effect=[baseline_result, treatment_result]),
+        ):
+            b, t = await ab_run(baseline_agent, treatment_agent, "Write math.py")
+
+        assert isinstance(b, CopilotResult)
+        assert isinstance(t, CopilotResult)
+        assert b is baseline_result
+        assert t is treatment_result
+
+    async def test_creates_isolated_directories(
+        self, ab_run, baseline_agent, treatment_agent, tmp_path
+    ):
+        """ab_run creates baseline/ and treatment/ under tmp_path."""
+        captured_agents = []
+
+        async def _capture(agent, task):
+            captured_agents.append(agent)
+            return _make_result()
+
+        with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture):
+            await ab_run(baseline_agent, treatment_agent, "some task")
+
+        baseline_dir = tmp_path / "baseline"
+        treatment_dir = tmp_path / "treatment"
+        assert baseline_dir.exists()
+        assert treatment_dir.exists()
+
+    async def test_overrides_working_directory_on_both_agents(self, ab_run, tmp_path):
+        """ab_run overrides working_directory regardless of original value."""
+        original_baseline = CopilotAgent(name="b", working_directory="/original/b")
+        original_treatment = CopilotAgent(name="t", working_directory="/original/t")
+
+        captured: list[CopilotAgent] = []
+
+        async def _capture(agent, task):
+            captured.append(agent)
+            return _make_result()
+
+        with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture):
+            await ab_run(original_baseline, original_treatment, "task")
+
+        assert captured[0].working_directory == str(tmp_path / "baseline")
+        assert captured[1].working_directory == str(tmp_path / "treatment")
+
+    async def test_agents_without_working_directory_get_isolated_dirs(self, ab_run, tmp_path):
+        """Agents with no working_directory still get isolated dirs."""
+        baseline = CopilotAgent(name="b")
+        treatment = CopilotAgent(name="t")
+
+        captured: list[CopilotAgent] = []
+
+        async def _capture(agent, task):
+            captured.append(agent)
+            return _make_result()
+
+        with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture):
+            await ab_run(baseline, treatment, "task")
+
+        assert captured[0].working_directory == str(tmp_path / "baseline")
+        assert captured[1].working_directory == str(tmp_path / "treatment")
+
+    async def test_runs_baseline_before_treatment(self, ab_run, tmp_path):
+        """ab_run runs baseline first, then treatment (sequential)."""
+        call_order: list[str] = []
+
+        async def _capture(agent, task):
+            call_order.append(agent.name)
+            return _make_result()
+
+        baseline = CopilotAgent(name="baseline")
+        treatment = CopilotAgent(name="treatment")
+
+        with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture):
+            await ab_run(baseline, treatment, "task")
+
+        assert call_order == ["baseline", "treatment"]
+
+    async def test_does_not_mutate_original_agents(self, ab_run, tmp_path):
+        """ab_run does not mutate the original CopilotAgent objects."""
+        original_baseline = CopilotAgent(name="b", working_directory=None)
+        original_treatment = CopilotAgent(name="t", working_directory=None)
+
+        with patch(
+            "pytest_codingagents.copilot.fixtures.run_copilot",
+            new=AsyncMock(return_value=_make_result()),
+        ):
+            await ab_run(original_baseline, original_treatment, "task")
+
+        # Frozen dataclasses cannot be mutated — but verify the originals are unchanged
+        assert original_baseline.working_directory is None
+        assert original_treatment.working_directory is None
+
+    async def test_stashes_treatment_result_for_aitest(self, ab_run, request, tmp_path):
+        """ab_run stashes treatment result on the test node for aitest."""
+        treatment_result = _make_result(success=True)
+        treatment = CopilotAgent(name="treatment")
+
+        with (
+            patch(
+                "pytest_codingagents.copilot.fixtures.run_copilot",
+                new=AsyncMock(side_effect=[_make_result(), treatment_result]),
+            ),
+            patch("pytest_codingagents.copilot.fixtures.stash_on_item") as mock_stash,
+        ):
+            await ab_run(CopilotAgent(name="baseline"), treatment, "task")
+
+        # stash_on_item called once with treatment result
+        mock_stash.assert_called_once()
+        _, _, stashed_result = mock_stash.call_args[0]
+        assert stashed_result is treatment_result
+
+    async def test_passes_task_to_both_agents(self, ab_run, tmp_path):
+        """ab_run passes the same task string to both agents."""
+        captured_tasks: list[str] = []
+
+        async def _capture(agent, task):
+            captured_tasks.append(task)
+            return _make_result()
+
+        with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture):
+            await ab_run(CopilotAgent(), CopilotAgent(), "my specific task")
+
+        assert captured_tasks == ["my specific task", "my specific task"]
diff --git a/tests/unit/test_optimizer.py b/tests/unit/test_optimizer.py
new file mode 100644
index 0000000..81e797c
--- /dev/null
+++ b/tests/unit/test_optimizer.py
@@ -0,0 +1,235 @@
+"""Unit tests for optimize_instruction() and InstructionSuggestion."""
+
+from __future__ import annotations
+
+import sys
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from pytest_codingagents.copilot.optimizer import InstructionSuggestion, optimize_instruction
+from pytest_codingagents.copilot.result import CopilotResult, ToolCall, Turn
+
+
+def _make_result(
+    *,
+    success: bool = True,
+    final_response: str = "Here is the code.",
+    tools: list[str] | None = None,
+) -> CopilotResult:
+    tool_calls = [ToolCall(name=t, arguments={}) for t in (tools or [])]
+    return CopilotResult(
+        success=success,
+        turns=[
+            Turn(role="assistant", content=final_response, tool_calls=tool_calls),
+        ],
+    )
+
+
+def _make_agent_mock(instruction: str, reasoning: str, changes: str) -> MagicMock:
+    """Build a pydantic-ai Agent mock that returns a structured suggestion."""
+    output = MagicMock()
+    output.instruction = instruction
+    output.reasoning = reasoning
+    output.changes = changes
+
+    run_result = MagicMock()
+    run_result.output = output
+
+    agent_instance = MagicMock()
+    agent_instance.run = AsyncMock(return_value=run_result)
+
+    agent_class = MagicMock(return_value=agent_instance)
+    return agent_class
+
+
+class TestInstructionSuggestion:
+    """Tests for the InstructionSuggestion dataclass."""
+
+    def test_str_contains_instruction(self):
+        s = InstructionSuggestion(
+            instruction="Always add docstrings.",
+            reasoning="The original instruction omits documentation requirements.",
+            changes="Added docstring mandate.",
+        )
+        assert "Always add docstrings." in str(s)
+
+    def test_str_contains_reasoning(self):
+        s = InstructionSuggestion(
+            instruction="inst",
+            reasoning="because reasons",
+            changes="changed x",
+        )
+        assert "because reasons" in str(s)
+
+    def test_str_contains_changes(self):
+        s = InstructionSuggestion(
+            instruction="inst",
+            reasoning="reason",
+            changes="Added docstring mandate.",
+        )
+        assert "Added docstring mandate." in str(s)
+
+    def test_fields_accessible(self):
+        s = InstructionSuggestion(
+            instruction="inst",
+            reasoning="reason",
+            changes="changes",
+        )
+        assert s.instruction == "inst"
+        assert s.reasoning == "reason"
+        assert s.changes == "changes"
+
+
+class TestOptimizeInstruction:
+    """Tests for optimize_instruction()."""
+
+    async def test_returns_instruction_suggestion(self):
+        """optimize_instruction returns an InstructionSuggestion."""
+        agent_class = _make_agent_mock(
+            instruction="Always add Google-style docstrings.",
+            reasoning="The original instruction omits documentation.",
+            changes="Added docstring mandate.",
+        )
+
+        # patch pydantic_ai.Agent in the module where it's imported
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        result = await optimize_instruction(
+            "Write Python code.",
+            _make_result(),
+            "Agent should add docstrings.",
+        )
+
+        assert isinstance(result, InstructionSuggestion)
+        assert result.instruction == "Always add Google-style docstrings."
+        assert result.reasoning == "The original instruction omits documentation."
+        assert result.changes == "Added docstring mandate."
+
+    async def test_uses_default_model(self):
+        """optimize_instruction defaults to openai:gpt-4o-mini."""
+        agent_class = _make_agent_mock("inst", "reason", "changes")
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        await optimize_instruction("inst", _make_result(), "criterion")
+
+        agent_class.assert_called_once()
+        assert agent_class.call_args[0][0] == "openai:gpt-4o-mini"
+
+    async def test_accepts_custom_model(self):
+        """optimize_instruction accepts a custom model string."""
+        agent_class = _make_agent_mock("inst", "reason", "changes")
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        await optimize_instruction(
+            "inst",
+            _make_result(),
+            "criterion",
+            model="anthropic:claude-3-haiku-20240307",
+        )
+
+        assert agent_class.call_args[0][0] == "anthropic:claude-3-haiku-20240307"
+
+    async def test_includes_criterion_in_prompt(self):
+        """The LLM prompt includes the criterion text."""
+        agent_class = _make_agent_mock("improved", "reason", "change")
+        agent_instance = agent_class.return_value
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        await optimize_instruction(
+            "Write code.",
+            _make_result(),
+            "Agent must use type hints on all functions.",
+        )
+
+        prompt = agent_instance.run.call_args[0][0]
+        assert "type hints" in prompt
+
+    async def test_includes_current_instruction_in_prompt(self):
+        """The LLM prompt contains the current instruction."""
+        agent_class = _make_agent_mock("inst", "reason", "changes")
+        agent_instance = agent_class.return_value
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        await optimize_instruction(
+            "Always use FastAPI for web APIs.",
+            _make_result(),
+            "criterion",
+        )
+
+        prompt = agent_instance.run.call_args[0][0]
+        assert "FastAPI" in prompt
+
+    async def test_includes_agent_output_in_prompt(self):
+        """The LLM prompt contains the agent's final response."""
+        agent_class = _make_agent_mock("inst", "reason", "changes")
+        agent_instance = agent_class.return_value
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        result = _make_result(final_response="def add(a, b): return a + b")
+        await optimize_instruction("inst", result, "criterion")
+
+        prompt = agent_instance.run.call_args[0][0]
+        assert "def add" in prompt
+
+    async def test_handles_no_final_response(self):
+        """optimize_instruction handles results with no turns gracefully."""
+        agent_class = _make_agent_mock("inst", "reason", "changes")
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        empty_result = CopilotResult(success=False, turns=[])
+        result = await optimize_instruction("inst", empty_result, "criterion")
+
+        assert isinstance(result, InstructionSuggestion)
+
+    async def test_handles_empty_instruction(self):
+        """optimize_instruction handles empty current instruction."""
+        agent_class = _make_agent_mock("new inst", "reason", "changes")
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        result = await optimize_instruction("", _make_result(), "criterion")
+        assert isinstance(result, InstructionSuggestion)
+
+    async def test_includes_tool_calls_in_prompt(self):
+        """The LLM prompt includes tool call information."""
+        agent_class = _make_agent_mock("inst", "reason", "changes")
+        agent_instance = agent_class.return_value
+        sys.modules["pydantic_ai"].Agent = agent_class  # type: ignore[attr-defined]
+
+        result = _make_result(tools=["create_file", "read_file"])
+        await optimize_instruction("inst", result, "criterion")
+
+        prompt = agent_instance.run.call_args[0][0]
+        assert "create_file" in prompt
+
+
+class TestOptimizeInstructionImportError:
+    """Test ImportError when pydantic-ai is not installed."""
+
+    async def test_raises_import_error_when_pydantic_ai_missing(self):
+        """optimize_instruction raises ImportError if pydantic-ai not installed."""
+        saved = sys.modules.get("pydantic_ai")
+        try:
+            sys.modules["pydantic_ai"] = None  # type: ignore
+
+            with pytest.raises(ImportError, match="pydantic-ai"):
+                await optimize_instruction("inst", _make_result(), "criterion")
+        finally:
+            if saved is not None:
+                sys.modules["pydantic_ai"] = saved
+            else:
+                del sys.modules["pydantic_ai"]
+
+    async def test_import_error_includes_install_hint(self):
+        """ImportError message includes the uv add install hint."""
+        saved = sys.modules.get("pydantic_ai")
+        try:
+            sys.modules["pydantic_ai"] = None  # type: ignore
+
+            with pytest.raises(ImportError, match="uv add pydantic-ai"):
+                await optimize_instruction("inst", _make_result(), "criterion")
+        finally:
+            if saved is not None:
+                sys.modules["pydantic_ai"] = saved
+            else:
+                del sys.modules["pydantic_ai"]