diff --git a/README.md b/README.md index 4e2b559..fdff77f 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,46 @@ # pytest-codingagents -**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.** +**Test-driven prompt engineering for GitHub Copilot.** -Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely? +Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer? **You don't know, because you're not testing it.** -pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why. +pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations: + +1. **Write a test** — define what the agent *should* do +2. **Run it** — see it fail (or pass) +3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion +4. **A/B confirm** — use `ab_run` to prove the change actually helps +5. **Ship it** — you now have evidence, not vibes Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon. ```python -from pytest_codingagents import CopilotAgent - -async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path): - """Does 'always use FastAPI' actually change what the agent produces?""" - # Config A: generic instructions - baseline = CopilotAgent( - instructions="You are a Python developer.", - working_directory=str(tmp_path / "a"), - ) - # Config B: framework mandate - with_fastapi = CopilotAgent( - instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.", - working_directory=str(tmp_path / "b"), +from pytest_codingagents import CopilotAgent, optimize_instruction +import pytest + + +async def test_docstring_instruction_works(ab_run): + """Prove the docstring instruction actually changes output, and get a fix if it doesn't.""" + baseline = CopilotAgent(instructions="Write Python code.") + treatment = CopilotAgent( + instructions="Write Python code. Add Google-style docstrings to every function." ) - (tmp_path / "a").mkdir() - (tmp_path / "b").mkdir() - task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.' - result_a = await copilot_run(baseline, task) - result_b = await copilot_run(with_fastapi, task) + b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).") + + assert b.success and t.success + + if '"""' not in t.file("math.py"): + suggestion = await optimize_instruction( + treatment.instructions or "", + t, + "Agent should add docstrings to every function.", + ) + pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}") - assert result_a.success and result_b.success - code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py")) - assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect" + assert '"""' not in b.file("math.py"), "Baseline should not have docstrings" ``` ## Install @@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local). | Capability | What it proves | Guide | |---|---|---| | **A/B comparison** | Config B actually produces different (and better) output than Config A | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) | +| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](https://sbroenne.github.io/pytest-codingagents/how-to/optimize/) | | **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) | | **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](https://sbroenne.github.io/pytest-codingagents/how-to/skills/) | | **Models** | Which model works best for your use case and budget | [Model Comparison](https://sbroenne.github.io/pytest-codingagents/getting-started/model-comparison/) | diff --git a/docs/how-to/ab-testing.md b/docs/how-to/ab-testing.md index cb9567a..a6ab6b9 100644 --- a/docs/how-to/ab-testing.md +++ b/docs/how-to/ab-testing.md @@ -4,9 +4,32 @@ The core use case of pytest-codingagents is **A/B testing**: run the same task w This stops cargo cult configuration — copying instructions and skills from blog posts without knowing if they work. -## The Pattern +## The `ab_run` Fixture -Every A/B test follows the same structure: +The `ab_run` fixture is the fastest way to write an A/B test. It handles directory isolation, sequential execution, and aitest reporting automatically: + +```python +from pytest_codingagents import CopilotAgent + + +async def test_docstring_instruction(ab_run): + baseline = CopilotAgent(instructions="Write Python code.") + treatment = CopilotAgent( + instructions="Write Python code. Add Google-style docstrings to every function." + ) + + b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).") + + assert b.success and t.success + assert '"""' not in b.file("math.py"), "Baseline should not have docstrings" + assert '"""' in t.file("math.py"), "Treatment: docstring instruction was ignored" +``` + +`ab_run` automatically creates `baseline/` and `treatment/` subdirectories under `tmp_path`, overrides `working_directory` on each agent (so they never share a workspace), and runs them sequentially. + +## The Manual Pattern + +For full control — custom paths, conditional logic, more than two configs — use `copilot_run` directly: ```python from pytest_codingagents import CopilotAgent @@ -41,8 +64,6 @@ async def test_config_a_vs_config_b(copilot_run, tmp_path): **The key rule**: assert something that is present in Config B *because of the change* and absent (or different) in Config A. ---- - ## Testing Instructions Does adding a documentation mandate actually change the code written? diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 4f2cf57..b1840ea 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -3,6 +3,7 @@ Practical guides for common tasks. - [A/B Testing](ab-testing.md) — Prove that your config changes actually make a difference +- [Optimize Instructions](optimize.md) — Use AI to turn test failures into actionable instruction improvements - [Assertions](assertions.md) — File helpers and semantic assertions with `llm_assert` - [Load from Copilot Config](copilot-config.md) — Build a `CopilotAgent` from your real `.github/` config files - [Skill Testing](skills.md) — Measure the impact of domain knowledge diff --git a/docs/how-to/optimize.md b/docs/how-to/optimize.md new file mode 100644 index 0000000..964c9c2 --- /dev/null +++ b/docs/how-to/optimize.md @@ -0,0 +1,124 @@ +# Optimizing Instructions with AI + +`optimize_instruction()` closes the test→optimize→test loop. + +When a test fails — the agent ignored an instruction or produced unexpected output — call `optimize_instruction()` to get a concrete, LLM-generated suggestion for improving the instruction. Drop the suggestion into `pytest.fail()` so the test failure message includes a ready-to-use fix. + +## The Loop + +``` +write test → run → fail → optimize → update instruction → run → pass +``` + +This is **test-driven prompt engineering**: your tests define the standard; the optimizer helps you reach it. + +## Basic Usage + +```python +import pytest +from pytest_codingagents import CopilotAgent, optimize_instruction + + +async def test_docstring_instruction(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Write Python code.", + working_directory=str(tmp_path), + ) + + result = await copilot_run(agent, "Create math.py with add(a, b) and subtract(a, b).") + + if '"""' not in result.file("math.py"): + suggestion = await optimize_instruction( + agent.instructions or "", + result, + "Agent should add Google-style docstrings to every function.", + ) + pytest.fail(f"No docstrings found.\n\n{suggestion}") +``` + +The failure message will look like: + +``` +FAILED test_math.py::test_docstring_instruction + +No docstrings found. + +💡 Suggested instruction: + + Write Python code. Add Google-style docstrings to every function. + The docstring should describe what the function does, its parameters (Args:), + and its return value (Returns:). + + Changes: Added explicit docstring format mandate with Args/Returns sections. + Reasoning: The original instruction did not mention documentation. The agent + produced code without docstrings because there was no requirement to add them. +``` + +## With A/B Testing + +Pair `optimize_instruction()` with `ab_run` to test the fix before committing: + +```python +import pytest +from pytest_codingagents import CopilotAgent, optimize_instruction + + +async def test_docstring_instruction_iterates(ab_run, tmp_path): + baseline = CopilotAgent(instructions="Write Python code.") + treatment = CopilotAgent( + instructions="Write Python code. Add Google-style docstrings to every function." + ) + + b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).") + + assert b.success and t.success + + if '"""' not in t.file("math.py"): + suggestion = await optimize_instruction( + treatment.instructions or "", + t, + "Treatment agent should add docstrings — treatment instruction did not work.", + ) + pytest.fail(f"Treatment still no docstrings.\n\n{suggestion}") + + # Confirm baseline does NOT have docstrings (differential assertion) + assert '"""' not in b.file("math.py"), "Baseline unexpectedly has docstrings" +``` + +## API Reference + +::: pytest_codingagents.copilot.optimizer.optimize_instruction + +--- + +::: pytest_codingagents.copilot.optimizer.InstructionSuggestion + +## Choosing a Model + +`optimize_instruction()` defaults to `openai:gpt-4o-mini` — cheap, fast, and precise enough for instruction analysis. + +Override with the `model` keyword argument: + +```python +suggestion = await optimize_instruction( + agent.instructions or "", + result, + "Agent should use type hints.", + model="anthropic:claude-3-haiku-20240307", +) +``` + +Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works. + +## The Criterion + +Write the `criterion` as a plain-English statement of what the agent *should* have done: + +| Situation | Good criterion | +|-----------|----------------| +| Missing docstrings | `"Agent should add Google-style docstrings to every function."` | +| Wrong framework | `"Agent should use FastAPI, not Flask."` | +| Missing type hints | `"All function signatures must include type annotations."` | +| No error handling | `"All I/O operations must be wrapped in try/except."` | + +The more specific the criterion, the more actionable the suggestion. diff --git a/docs/index.md b/docs/index.md index 2725b9e..5f7286e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,40 +1,46 @@ # pytest-codingagents -**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.** +**Test-driven prompt engineering for GitHub Copilot.** -Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely? +Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer? **You don't know, because you're not testing it.** -pytest-codingagents gives you **A/B testing for coding agent configurations**. Run two configs against the same task, assert the difference, and let AI analysis tell you which one wins — and why. +pytest-codingagents gives you a complete **test→optimize→test loop** for GitHub Copilot configurations: + +1. **Write a test** — define what the agent *should* do +2. **Run it** — see it fail (or pass) +3. **Optimize** — call `optimize_instruction()` to get a concrete suggestion +4. **A/B confirm** — use `ab_run` to prove the change actually helps +5. **Ship it** — you now have evidence, not vibes Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon. ```python -from pytest_codingagents import CopilotAgent - -async def test_fastapi_instruction_steers_framework(copilot_run, tmp_path): - """Does 'always use FastAPI' actually change what the agent produces?""" - # Config A: generic instructions - baseline = CopilotAgent( - instructions="You are a Python developer.", - working_directory=str(tmp_path / "a"), - ) - # Config B: framework mandate - with_fastapi = CopilotAgent( - instructions="You are a Python developer. ALWAYS use FastAPI for web APIs.", - working_directory=str(tmp_path / "b"), +from pytest_codingagents import CopilotAgent, optimize_instruction +import pytest + + +async def test_docstring_instruction_works(ab_run): + """Prove the docstring instruction actually changes output, and get a fix if it doesn't.""" + baseline = CopilotAgent(instructions="Write Python code.") + treatment = CopilotAgent( + instructions="Write Python code. Add Google-style docstrings to every function." ) - (tmp_path / "a").mkdir() - (tmp_path / "b").mkdir() - task = 'Create a web API with a GET /health endpoint returning {"status": "ok"}.' - result_a = await copilot_run(baseline, task) - result_b = await copilot_run(with_fastapi, task) + b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).") + + assert b.success and t.success + + if '"""' not in t.file("math.py"): + suggestion = await optimize_instruction( + treatment.instructions or "", + t, + "Agent should add docstrings to every function.", + ) + pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}") - assert result_a.success and result_b.success - code_b = "\n".join(f.read_text() for f in (tmp_path / "b").rglob("*.py")) - assert "fastapi" in code_b.lower(), "FastAPI instruction was ignored — the config has no effect" + assert '"""' not in b.file("math.py"), "Baseline should not have docstrings" ``` ## Install @@ -50,6 +56,7 @@ Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local). | Capability | What it proves | Guide | |---|---|---| | **A/B comparison** | Config B actually produces different (and better) output than Config A | [A/B Testing](how-to/ab-testing.md) | +| **Instruction optimization** | Turn a failing test into a ready-to-use instruction fix | [Optimize Instructions](how-to/optimize.md) | | **Instructions** | Your custom instructions change agent behavior — not just vibes | [Getting Started](getting-started/index.md) | | **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](how-to/skills.md) | | **Models** | Which model works best for your use case and budget | [Model Comparison](getting-started/model-comparison.md) | @@ -78,6 +85,6 @@ uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt- Full docs at **[sbroenne.github.io/pytest-codingagents](https://sbroenne.github.io/pytest-codingagents/)** — API reference, how-to guides, and demo reports. - [Getting Started](getting-started/index.md) — Install and write your first test -- [How-To Guides](how-to/index.md) — Skills, MCP servers, CLI tools, and more +- [How-To Guides](how-to/index.md) — A/B testing, instruction optimization, skills, MCP, and more - [Demo Reports](demo/index.md) — See real HTML reports with AI analysis - [API Reference](reference/api.md) — Full API documentation diff --git a/docs/reference/api.md b/docs/reference/api.md index a116478..8f16a53 100644 --- a/docs/reference/api.md +++ b/docs/reference/api.md @@ -7,3 +7,11 @@ ::: pytest_codingagents.CopilotResult options: show_source: false + +::: pytest_codingagents.optimize_instruction + options: + show_source: false + +::: pytest_codingagents.InstructionSuggestion + options: + show_source: false diff --git a/mkdocs.yml b/mkdocs.yml index cc55caf..2a7a478 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -77,6 +77,7 @@ nav: - How-To Guides: - Overview: how-to/index.md - A/B Testing: how-to/ab-testing.md + - Optimize Instructions: how-to/optimize.md - Assertions: how-to/assertions.md - Load from Copilot Config: how-to/copilot-config.md - Skill Testing: how-to/skills.md diff --git a/pyproject.toml b/pyproject.toml index 9c85e07..d7f4de9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "pytest-codingagents" -version = "0.1.2" +version = "0.2.0" description = "Pytest plugin for testing real coding agents via their SDK" readme = "README.md" license = { text = "MIT" } @@ -31,6 +31,7 @@ dependencies = [ "pytest-aitest>=0.5.6", "azure-identity>=1.25.2", "pyyaml>=6.0", + "pydantic-ai>=1.0", ] [project.optional-dependencies] diff --git a/src/pytest_codingagents/__init__.py b/src/pytest_codingagents/__init__.py index e07cd94..2d93271 100644 --- a/src/pytest_codingagents/__init__.py +++ b/src/pytest_codingagents/__init__.py @@ -4,11 +4,14 @@ from pytest_codingagents.copilot.agent import CopilotAgent from pytest_codingagents.copilot.agents import load_custom_agent, load_custom_agents +from pytest_codingagents.copilot.optimizer import InstructionSuggestion, optimize_instruction from pytest_codingagents.copilot.result import CopilotResult __all__ = [ "CopilotAgent", "CopilotResult", + "InstructionSuggestion", "load_custom_agent", "load_custom_agents", + "optimize_instruction", ] diff --git a/src/pytest_codingagents/copilot/fixtures.py b/src/pytest_codingagents/copilot/fixtures.py index 3449041..31045d5 100644 --- a/src/pytest_codingagents/copilot/fixtures.py +++ b/src/pytest_codingagents/copilot/fixtures.py @@ -2,10 +2,15 @@ Provides the ``copilot_run`` fixture that executes prompts against Copilot and stashes results for pytest-aitest reporting (if installed). + +Also provides ``ab_run``, a higher-level fixture for A/B testing two agent +configurations against the same task in isolated directories. """ from __future__ import annotations +import dataclasses +from pathlib import Path from typing import TYPE_CHECKING, Any import pytest @@ -131,3 +136,66 @@ def _stash_for_aitest( as the ``item`` parameter. """ stash_on_item(request.node, agent, result) # type: ignore[arg-type] + + +@pytest.fixture +def ab_run( + request: pytest.FixtureRequest, + tmp_path: Path, +) -> Callable[..., Coroutine[Any, Any, tuple[CopilotResult, CopilotResult]]]: + """Run two agents against the same task in isolated directories. + + Creates ``baseline/`` and ``treatment/`` subdirectories under + ``tmp_path``, overrides ``working_directory`` on each agent so they + never share a workspace, then runs them sequentially and stashes the + treatment result for pytest-aitest reporting. + + Example:: + + async def test_docstring_instruction(ab_run): + baseline = CopilotAgent(instructions="Write Python code.") + treatment = CopilotAgent( + instructions="Write Python code. Add Google-style docstrings to every function." + ) + + b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).") + + assert b.success and t.success + assert '\"\"\"' not in b.file("math.py"), "Baseline should not have docstrings" + assert '\"\"\"' in t.file("math.py"), "Treatment should add docstrings" + + Args: + baseline: Control ``CopilotAgent`` (the existing / unchanged config). + treatment: Treatment ``CopilotAgent`` (the change you are testing). + task: Prompt to give both agents. + + Returns: + ``(baseline_result, treatment_result)`` tuple. + """ + + async def _run( + baseline: CopilotAgent, + treatment: CopilotAgent, + task: str, + ) -> tuple[CopilotResult, CopilotResult]: + baseline_dir = tmp_path / "baseline" + treatment_dir = tmp_path / "treatment" + baseline_dir.mkdir(exist_ok=True) + treatment_dir.mkdir(exist_ok=True) + + # Override working directories to guarantee isolation. + # CopilotAgent is frozen — dataclasses.replace() creates a new instance. + baseline = dataclasses.replace(baseline, working_directory=str(baseline_dir)) + treatment = dataclasses.replace(treatment, working_directory=str(treatment_dir)) + + # Run sequentially — agents may write to disk, install packages, etc. + baseline_result = await run_copilot(baseline, task) + treatment_result = await run_copilot(treatment, task) + + # Stash treatment result for pytest-aitest reporting. + # Treatment is the config being evaluated; its result is what matters. + stash_on_item(request.node, treatment, treatment_result) + + return baseline_result, treatment_result + + return _run diff --git a/src/pytest_codingagents/copilot/optimizer.py b/src/pytest_codingagents/copilot/optimizer.py new file mode 100644 index 0000000..1d2fb99 --- /dev/null +++ b/src/pytest_codingagents/copilot/optimizer.py @@ -0,0 +1,153 @@ +"""Instruction optimizer for test-driven prompt engineering. + +Provides :func:`optimize_instruction`, which uses an LLM to analyze the gap +between a current agent instruction and the observed behavior, and suggests a +concrete improvement. + +Requires ``pydantic-ai``: + + uv add pydantic-ai +""" + +from __future__ import annotations + +from dataclasses import dataclass +from typing import TYPE_CHECKING + +from pydantic import BaseModel + +if TYPE_CHECKING: + from pytest_codingagents.copilot.result import CopilotResult + +__all__ = ["InstructionSuggestion", "optimize_instruction"] + + +@dataclass +class InstructionSuggestion: + """A suggested improvement to a Copilot agent instruction. + + Returned by :func:`optimize_instruction`. Designed to drop into + ``pytest.fail()`` so the failure message includes an actionable fix. + + Attributes: + instruction: The improved instruction text to use instead. + reasoning: Explanation of why this change would close the gap. + changes: Short description of what was changed (one sentence). + + Example:: + + suggestion = await optimize_instruction( + agent.instructions, + result, + "Agent should add docstrings to all functions.", + ) + pytest.fail(f"No docstrings found.\\n\\n{suggestion}") + """ + + instruction: str + reasoning: str + changes: str + + def __str__(self) -> str: + return ( + f"💡 Suggested instruction:\n\n" + f" {self.instruction}\n\n" + f" Changes: {self.changes}\n" + f" Reasoning: {self.reasoning}" + ) + + +class _OptimizationOutput(BaseModel): + """Structured output schema for the optimizer LLM call.""" + + instruction: str + reasoning: str + changes: str + + +async def optimize_instruction( + current_instruction: str, + result: CopilotResult, + criterion: str, + *, + model: str = "openai:gpt-4o-mini", +) -> InstructionSuggestion: + """Analyze a result and suggest an improved instruction. + + Uses pydantic-ai structured output to analyze the gap between a + current instruction and the agent's observed behavior, returning a + concrete, actionable improvement. + + Designed to drop into ``pytest.fail()`` so the failure message + contains a ready-to-use fix: + + Example:: + + result = await copilot_run(agent, task) + if '\"\"\"' not in result.file("main.py"): + suggestion = await optimize_instruction( + agent.instructions or "", + result, + "Agent should add docstrings to all functions.", + ) + pytest.fail(f"No docstrings found.\\n\\n{suggestion}") + + Args: + current_instruction: The agent's current instruction text. + result: The ``CopilotResult`` from the (failed) run. + criterion: What the agent *should* have done — the test expectation + in plain English (e.g. ``"Always write docstrings"``). + model: LiteLLM-style model string (e.g. ``"openai:gpt-4o-mini"`` + or ``"anthropic:claude-3-haiku-20240307"``). + + Returns: + An :class:`InstructionSuggestion` with the improved instruction. + + Raises: + ImportError: If pydantic-ai is not installed. + """ + try: + from pydantic_ai import Agent as PydanticAgent + except ImportError as exc: + msg = ( + "pydantic-ai is required for optimize_instruction(). " + "Install it with: uv add pydantic-ai" + ) + raise ImportError(msg) from exc + + final_output = result.final_response or "(no response)" + tool_calls = ", ".join(sorted(result.tool_names_called)) or "none" + + prompt = f"""You are helping improve a GitHub Copilot agent instruction. + +## Current instruction +{current_instruction or "(no instruction)"} + +## Task the agent performed +{criterion} + +## What actually happened +The agent produced: +{final_output[:1500]} + +Tools called: {tool_calls} +Run succeeded: {result.success} + +## Expected criterion +The agent SHOULD have satisfied this criterion: +{criterion} + +Analyze the gap between the instruction and the observed behaviour. +Suggest a specific, concise, directive improvement to the instruction +that would make the agent satisfy the criterion. +Keep the instruction under 200 words. Do not add unrelated rules.""" + + optimizer_agent = PydanticAgent(model, output_type=_OptimizationOutput) + run_result = await optimizer_agent.run(prompt) + output = run_result.output + + return InstructionSuggestion( + instruction=output.instruction, + reasoning=output.reasoning, + changes=output.changes, + ) diff --git a/src/pytest_codingagents/plugin.py b/src/pytest_codingagents/plugin.py index 4c60719..52ba3e2 100644 --- a/src/pytest_codingagents/plugin.py +++ b/src/pytest_codingagents/plugin.py @@ -12,13 +12,13 @@ import pytest -# Re-export the fixture so pytest discovers it via the plugin entry point. -from pytest_codingagents.copilot.fixtures import copilot_run +# Re-export fixtures so pytest discovers them via the plugin entry point. +from pytest_codingagents.copilot.fixtures import ab_run, copilot_run if TYPE_CHECKING: from _pytest.nodes import Item -__all__ = ["copilot_run"] +__all__ = ["ab_run", "copilot_run"] _ANALYSIS_PROMPT_PATH = Path(__file__).parent / "prompts" / "coding_agent_analysis.md" diff --git a/tests/unit/test_fixtures.py b/tests/unit/test_fixtures.py new file mode 100644 index 0000000..aeff803 --- /dev/null +++ b/tests/unit/test_fixtures.py @@ -0,0 +1,159 @@ +"""Unit tests for the ab_run fixture.""" + +from __future__ import annotations + +from unittest.mock import AsyncMock, patch + +import pytest + +from pytest_codingagents.copilot.agent import CopilotAgent +from pytest_codingagents.copilot.result import CopilotResult + + +def _make_result(success: bool = True) -> CopilotResult: + return CopilotResult(success=success) + + +class TestAbRunFixture: + """Tests for the ab_run fixture.""" + + @pytest.fixture + def baseline_agent(self) -> CopilotAgent: + return CopilotAgent(name="baseline", instructions="Write plain Python.") + + @pytest.fixture + def treatment_agent(self) -> CopilotAgent: + return CopilotAgent(name="treatment", instructions="Write Python with docstrings.") + + async def test_returns_tuple_of_two_results( + self, ab_run, baseline_agent, treatment_agent, tmp_path + ): + """ab_run returns a 2-tuple of CopilotResult.""" + baseline_result = _make_result(success=True) + treatment_result = _make_result(success=True) + + with patch( + "pytest_codingagents.copilot.fixtures.run_copilot", + new=AsyncMock(side_effect=[baseline_result, treatment_result]), + ): + b, t = await ab_run(baseline_agent, treatment_agent, "Write math.py") + + assert isinstance(b, CopilotResult) + assert isinstance(t, CopilotResult) + assert b is baseline_result + assert t is treatment_result + + async def test_creates_isolated_directories( + self, ab_run, baseline_agent, treatment_agent, tmp_path + ): + """ab_run creates baseline/ and treatment/ under tmp_path.""" + captured_agents = [] + + async def _capture(agent, task): + captured_agents.append(agent) + return _make_result() + + with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture): + await ab_run(baseline_agent, treatment_agent, "some task") + + baseline_dir = tmp_path / "baseline" + treatment_dir = tmp_path / "treatment" + assert baseline_dir.exists() + assert treatment_dir.exists() + + async def test_overrides_working_directory_on_both_agents(self, ab_run, tmp_path): + """ab_run overrides working_directory regardless of original value.""" + original_baseline = CopilotAgent(name="b", working_directory="/original/b") + original_treatment = CopilotAgent(name="t", working_directory="/original/t") + + captured: list[CopilotAgent] = [] + + async def _capture(agent, task): + captured.append(agent) + return _make_result() + + with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture): + await ab_run(original_baseline, original_treatment, "task") + + assert captured[0].working_directory == str(tmp_path / "baseline") + assert captured[1].working_directory == str(tmp_path / "treatment") + + async def test_agents_without_working_directory_get_isolated_dirs(self, ab_run, tmp_path): + """Agents with no working_directory still get isolated dirs.""" + baseline = CopilotAgent(name="b") + treatment = CopilotAgent(name="t") + + captured: list[CopilotAgent] = [] + + async def _capture(agent, task): + captured.append(agent) + return _make_result() + + with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture): + await ab_run(baseline, treatment, "task") + + assert captured[0].working_directory == str(tmp_path / "baseline") + assert captured[1].working_directory == str(tmp_path / "treatment") + + async def test_runs_baseline_before_treatment(self, ab_run, tmp_path): + """ab_run runs baseline first, then treatment (sequential).""" + call_order: list[str] = [] + + async def _capture(agent, task): + call_order.append(agent.name) + return _make_result() + + baseline = CopilotAgent(name="baseline") + treatment = CopilotAgent(name="treatment") + + with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture): + await ab_run(baseline, treatment, "task") + + assert call_order == ["baseline", "treatment"] + + async def test_does_not_mutate_original_agents(self, ab_run, tmp_path): + """ab_run does not mutate the original CopilotAgent objects.""" + original_baseline = CopilotAgent(name="b", working_directory=None) + original_treatment = CopilotAgent(name="t", working_directory=None) + + with patch( + "pytest_codingagents.copilot.fixtures.run_copilot", + new=AsyncMock(return_value=_make_result()), + ): + await ab_run(original_baseline, original_treatment, "task") + + # Frozen dataclasses cannot be mutated — but verify the originals are unchanged + assert original_baseline.working_directory is None + assert original_treatment.working_directory is None + + async def test_stashes_treatment_result_for_aitest(self, ab_run, request, tmp_path): + """ab_run stashes treatment result on the test node for aitest.""" + treatment_result = _make_result(success=True) + treatment = CopilotAgent(name="treatment") + + with ( + patch( + "pytest_codingagents.copilot.fixtures.run_copilot", + new=AsyncMock(side_effect=[_make_result(), treatment_result]), + ), + patch("pytest_codingagents.copilot.fixtures.stash_on_item") as mock_stash, + ): + await ab_run(CopilotAgent(name="baseline"), treatment, "task") + + # stash_on_item called once with treatment result + mock_stash.assert_called_once() + _, _, stashed_result = mock_stash.call_args[0] + assert stashed_result is treatment_result + + async def test_passes_task_to_both_agents(self, ab_run, tmp_path): + """ab_run passes the same task string to both agents.""" + captured_tasks: list[str] = [] + + async def _capture(agent, task): + captured_tasks.append(task) + return _make_result() + + with patch("pytest_codingagents.copilot.fixtures.run_copilot", side_effect=_capture): + await ab_run(CopilotAgent(), CopilotAgent(), "my specific task") + + assert captured_tasks == ["my specific task", "my specific task"] diff --git a/tests/unit/test_optimizer.py b/tests/unit/test_optimizer.py new file mode 100644 index 0000000..81e797c --- /dev/null +++ b/tests/unit/test_optimizer.py @@ -0,0 +1,235 @@ +"""Unit tests for optimize_instruction() and InstructionSuggestion.""" + +from __future__ import annotations + +import sys +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from pytest_codingagents.copilot.optimizer import InstructionSuggestion, optimize_instruction +from pytest_codingagents.copilot.result import CopilotResult, ToolCall, Turn + + +def _make_result( + *, + success: bool = True, + final_response: str = "Here is the code.", + tools: list[str] | None = None, +) -> CopilotResult: + tool_calls = [ToolCall(name=t, arguments={}) for t in (tools or [])] + return CopilotResult( + success=success, + turns=[ + Turn(role="assistant", content=final_response, tool_calls=tool_calls), + ], + ) + + +def _make_agent_mock(instruction: str, reasoning: str, changes: str) -> MagicMock: + """Build a pydantic-ai Agent mock that returns a structured suggestion.""" + output = MagicMock() + output.instruction = instruction + output.reasoning = reasoning + output.changes = changes + + run_result = MagicMock() + run_result.output = output + + agent_instance = MagicMock() + agent_instance.run = AsyncMock(return_value=run_result) + + agent_class = MagicMock(return_value=agent_instance) + return agent_class + + +class TestInstructionSuggestion: + """Tests for the InstructionSuggestion dataclass.""" + + def test_str_contains_instruction(self): + s = InstructionSuggestion( + instruction="Always add docstrings.", + reasoning="The original instruction omits documentation requirements.", + changes="Added docstring mandate.", + ) + assert "Always add docstrings." in str(s) + + def test_str_contains_reasoning(self): + s = InstructionSuggestion( + instruction="inst", + reasoning="because reasons", + changes="changed x", + ) + assert "because reasons" in str(s) + + def test_str_contains_changes(self): + s = InstructionSuggestion( + instruction="inst", + reasoning="reason", + changes="Added docstring mandate.", + ) + assert "Added docstring mandate." in str(s) + + def test_fields_accessible(self): + s = InstructionSuggestion( + instruction="inst", + reasoning="reason", + changes="changes", + ) + assert s.instruction == "inst" + assert s.reasoning == "reason" + assert s.changes == "changes" + + +class TestOptimizeInstruction: + """Tests for optimize_instruction().""" + + async def test_returns_instruction_suggestion(self): + """optimize_instruction returns an InstructionSuggestion.""" + agent_class = _make_agent_mock( + instruction="Always add Google-style docstrings.", + reasoning="The original instruction omits documentation.", + changes="Added docstring mandate.", + ) + + # patch pydantic_ai.Agent in the module where it's imported + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + result = await optimize_instruction( + "Write Python code.", + _make_result(), + "Agent should add docstrings.", + ) + + assert isinstance(result, InstructionSuggestion) + assert result.instruction == "Always add Google-style docstrings." + assert result.reasoning == "The original instruction omits documentation." + assert result.changes == "Added docstring mandate." + + async def test_uses_default_model(self): + """optimize_instruction defaults to openai:gpt-4o-mini.""" + agent_class = _make_agent_mock("inst", "reason", "changes") + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + await optimize_instruction("inst", _make_result(), "criterion") + + agent_class.assert_called_once() + assert agent_class.call_args[0][0] == "openai:gpt-4o-mini" + + async def test_accepts_custom_model(self): + """optimize_instruction accepts a custom model string.""" + agent_class = _make_agent_mock("inst", "reason", "changes") + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + await optimize_instruction( + "inst", + _make_result(), + "criterion", + model="anthropic:claude-3-haiku-20240307", + ) + + assert agent_class.call_args[0][0] == "anthropic:claude-3-haiku-20240307" + + async def test_includes_criterion_in_prompt(self): + """The LLM prompt includes the criterion text.""" + agent_class = _make_agent_mock("improved", "reason", "change") + agent_instance = agent_class.return_value + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + await optimize_instruction( + "Write code.", + _make_result(), + "Agent must use type hints on all functions.", + ) + + prompt = agent_instance.run.call_args[0][0] + assert "type hints" in prompt + + async def test_includes_current_instruction_in_prompt(self): + """The LLM prompt contains the current instruction.""" + agent_class = _make_agent_mock("inst", "reason", "changes") + agent_instance = agent_class.return_value + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + await optimize_instruction( + "Always use FastAPI for web APIs.", + _make_result(), + "criterion", + ) + + prompt = agent_instance.run.call_args[0][0] + assert "FastAPI" in prompt + + async def test_includes_agent_output_in_prompt(self): + """The LLM prompt contains the agent's final response.""" + agent_class = _make_agent_mock("inst", "reason", "changes") + agent_instance = agent_class.return_value + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + result = _make_result(final_response="def add(a, b): return a + b") + await optimize_instruction("inst", result, "criterion") + + prompt = agent_instance.run.call_args[0][0] + assert "def add" in prompt + + async def test_handles_no_final_response(self): + """optimize_instruction handles results with no turns gracefully.""" + agent_class = _make_agent_mock("inst", "reason", "changes") + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + empty_result = CopilotResult(success=False, turns=[]) + result = await optimize_instruction("inst", empty_result, "criterion") + + assert isinstance(result, InstructionSuggestion) + + async def test_handles_empty_instruction(self): + """optimize_instruction handles empty current instruction.""" + agent_class = _make_agent_mock("new inst", "reason", "changes") + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + result = await optimize_instruction("", _make_result(), "criterion") + assert isinstance(result, InstructionSuggestion) + + async def test_includes_tool_calls_in_prompt(self): + """The LLM prompt includes tool call information.""" + agent_class = _make_agent_mock("inst", "reason", "changes") + agent_instance = agent_class.return_value + sys.modules["pydantic_ai"].Agent = agent_class # type: ignore[attr-defined] + + result = _make_result(tools=["create_file", "read_file"]) + await optimize_instruction("inst", result, "criterion") + + prompt = agent_instance.run.call_args[0][0] + assert "create_file" in prompt + + +class TestOptimizeInstructionImportError: + """Test ImportError when pydantic-ai is not installed.""" + + async def test_raises_import_error_when_pydantic_ai_missing(self): + """optimize_instruction raises ImportError if pydantic-ai not installed.""" + saved = sys.modules.get("pydantic_ai") + try: + sys.modules["pydantic_ai"] = None # type: ignore + + with pytest.raises(ImportError, match="pydantic-ai"): + await optimize_instruction("inst", _make_result(), "criterion") + finally: + if saved is not None: + sys.modules["pydantic_ai"] = saved + else: + del sys.modules["pydantic_ai"] + + async def test_import_error_includes_install_hint(self): + """ImportError message includes the uv add install hint.""" + saved = sys.modules.get("pydantic_ai") + try: + sys.modules["pydantic_ai"] = None # type: ignore + + with pytest.raises(ImportError, match="uv add pydantic-ai"): + await optimize_instruction("inst", _make_result(), "criterion") + finally: + if saved is not None: + sys.modules["pydantic_ai"] = saved + else: + del sys.modules["pydantic_ai"]