Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"permissions": {
"allow": [
"Bash(git merge-base:*)",
"Bash(git log:*)",
"Bash(git diff:*)"
]
}
}
195 changes: 195 additions & 0 deletions .claude/skills/pr-description/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
name: pr-description
description: Generate a pull request description from branch commits and diffs. Use this skill whenever the user asks to write, create, generate, or draft a PR description, pull request summary, PR body, or merge request description. Also trigger when the user says things like "describe this PR", "what should the PR say", "write up these changes", or "summarize this branch for a PR". Even casual phrases like "PR desc" or "write the PR" should trigger this skill.
---

# PR Description Generator

Generate a structured pull request description by analyzing all commits on the current branch compared to the base branch.

## Gathering context

1. Determine the base branch. If the user provides one as an argument, use that. Otherwise default to `main`.
2. Get the merge base: `git merge-base <base-branch> HEAD`
3. Get all commits since the merge base with full messages: `git log --reverse --format="### %s%n%n%b" <merge-base>..HEAD`
4. Get the full diff: `git diff <merge-base>..HEAD`
5. Get the list of changed files grouped by status: `git diff --stat <merge-base>..HEAD`

Read through the commits and diff carefully. Understand not just *what* changed but *why*—commit messages often explain motivation, and the diff reveals the actual behavioral changes.

## Writing the description

The description has three sections: a summary paragraph, a categorized summary, and an actions breakdown.

### The summary paragraph

Write 1-3 natural paragraphs at the top (no heading) that explain the overall motivation and purpose of the changes. This is the most important part—it should tell someone *why* this PR exists, not just catalog what it touches.

Lead with specific, concrete names—action names, parameter names, the actual things that changed. Don't abstract them into vague categories like "new looping capabilities" when you can say "Adds `do_while_item` action." The reader should know exactly what's in this PR from the first sentence.

Think about it from the perspective of a reviewer or someone reading the changelog: what problem was being solved? What was broken or missing before? If there are breaking changes, call them out here so they're impossible to miss.

Don't start with "This PR..."—just lead with the substance. For example: "Adds `do_while_item` action and fixes several issues with Docker sandbox environments..." or "The caching layer was silently dropping entries when..."

### The categorized summary

```
## Summary

### Enhancements
- Description of each enhancement, being specific about parameter names and behaviors added.

### Bug Fixes
- **Breaking: description here** (for fixes that change existing behavior)
- Description of each fix
```

Guidelines:
- Only include `### Enhancements` or `### Bug Fixes` if there are items for that category. Skip empty categories entirely.
- Do not include Documentation or Refactoring sections. Only include Enhancements and Bug Fixes. If a refactor changes observable behavior, list it as an enhancement.
- Each bullet should be concise but specific—mention parameter names, function names, and concrete behavioral changes rather than vague descriptions.
- For new actions or parameters with non-obvious behavior, explain the semantics. For example: "Unlike `while_item`, the condition is evaluated after the actions, so `iteration` is 1 on the first condition check."
- Use parenthetical format for secondary identifiers: `--max-tokens` CLI option (`DF_MAX_TOKENS` for env) rather than `--max-tokens` / `DF_MAX_TOKENS`.
- For bug fixes, describe outcomes rather than implementation details. Say "Both the full and log displays can now be cleanly exited" rather than "by raising `KeyboardInterrupt` from the signal handler."
- Explain *why* something matters, not just what it does. Say "making it easier to see which parameters were actually used" rather than "for easier debugging."
- Bold any breaking changes with the `**Breaking:**` prefix. A change is breaking if it alters existing behavior that consumers may depend on.

### The actions breakdown

This section only covers actions (the building blocks of pipelines). Other changes like CLI, infrastructure, or internal refactoring belong in the Summary section bullets, not here.

```
## Actions

### Updated

#### Item

`action_name`
: Description of what changed.

#### Dataset

`another_action`
: Description of what changed.

### Added

#### Item

`new_action`
: What it does and why it was added.

### Removed

#### Item

`removed_action`
: Why it was removed.
```

Guidelines:
- Only include actions—the functions defined in `actions/item/` and `actions/dataset/`. Do not include CLI changes, Docker infrastructure, utility modules, or other non-action code.
- Group actions under `#### Item` or `#### Dataset` sub-headings based on whether they live in `actions/item/` or `actions/dataset/`.
- Use the definition list format: backtick-wrapped name on its own line, then `: ` followed by the description on the next line.
- Only include `### Updated`, `### Added`, or `### Removed` sections that have content. Skip empty ones.
- Skip the entire `## Actions` section if no actions were changed.

### The dependencies breakdown

Place this section between Summary and Components. Check the diff for changes to dependency files (e.g., `pyproject.toml`, `uv.lock`, `package.json`, `requirements.txt`, `Cargo.toml`, `go.mod`, etc.) and list any dependency changes.

```
## Dependencies

### New
- `dependency-name` v0.0.0

### Upgraded
- `dependency-name` v0.0.0 -> v1.0.0

### Removed
- `dependency-name`
```

Guidelines:
- Only include `### New`, `### Upgraded`, or `### Removed` subsections that have content.
- Skip the entire `## Dependencies` section if there are no dependency changes.
- Include the version for New and Upgraded dependencies. For Upgraded, show the old and new version with `->`.
- For Removed dependencies, just the name is sufficient.

## Output

Output the full PR description as markdown. Do not wrap it in a code fence—just output the raw markdown text so the user can copy it directly.

## Example

Here's an example of a well-written PR description to calibrate tone and structure:

---

Adds `do_while_item` action and fixes several issues with Docker sandbox environments, particularly around Python/pip availability and repository setup reliability. The Codex agent Dockerfile was missing Python entirely, setup scripts were silently swallowing errors, and `~/.local/bin` wasn't on PATH—all of which caused repo setup failures that were difficult to diagnose.

This release also introduces pytest plugin support, allowing pipelines to inject custom test behavior without modifying the tests themselves.

Additionally, the `--max-tokens` CLI option enables longer model responses, and Ctrl-C now works correctly when using the `log` display mode.

## Summary

### Enhancements
- Added `--max-tokens` CLI option (`DF_MAX_TOKENS` for env) to override the default token limit (8096), enabling longer model responses.
- Add `do_while_item` action, which executes actions at least once and then repeatedly while a condition is true. Unlike `while_item`, the condition is evaluated after the actions, so iteration is 1 on the first condition check.
- Refactored `while_item` to align its iteration-counting behavior with `do_while_item`.
- Introduced `test_plugins_dir` parameter to `run_unit_tests` and `run_swe_agent`, allowing pytest plugins to be mounted and auto-loaded in sandbox containers.
- Pipeline parameters are now logged at startup, making it easier to see which parameters were actually used.
- Active pytest plugins and environment variables are logged when set.

### Bug Fixes
- Fixed `setup-repo.sh` swallowing errors by running setup scripts with `bash -e`, so failures propagate instead of being silently ignored.
- Fixed Codex agent Dockerfile missing Python and pip entirely, which caused repo setup scripts to fail.
- Fixed permissions issue in Codex Dockerfile where `setup-repo.sh` copy required root.
- Added `~/.local/bin` to PATH in sandbox and agent Dockerfiles to quiet pip warnings and ensure installed executables are findable.
- Set `PIP_BREAK_SYSTEM_PACKAGES=1` in Codex Dockerfile to allow pip installs without a virtual environment.
- Set `HOMEBREW_NO_AUTO_UPDATE=1` in Codex Dockerfile to reduce noise and improve build speed.
- Fixed Ctrl-C not working when `--display` is `log`. Both the full and log displays can now be cleanly exited.
- Ensured `updated_at` is set on first run so it is always present in item metadata.

## Dependencies

### Upgraded
- `openai` v1.68.2 -> v1.72.0

## Actions

### Added

#### Item

`do_while_item`
: Executes actions at least once, then continues while a condition is true. Complements the existing `while_item` action.

### Updated

#### Item

`while_item`
: Aligned iteration counter to increment before the loop body, matching `do_while_item` semantics.

`run_swe_agent`
: Added `test_plugins_dir` parameter for mounting pytest plugins into the agent container.

`run_unit_tests`
: Added `test_plugins_dir` parameter for mounting pytest plugins into the sandbox container.

`set_item_metadata`
: Set `updated_at` on first run so it always exists in metadata.

---

Notice how:
- The opening paragraph names `do_while_item` directly instead of saying "new looping capabilities"
- Enhancements explain behavioral semantics ("Unlike `while_item`, the condition is evaluated after the actions...")
- The `while_item` refactor is listed as an enhancement because it changes observable behavior
- Bug fixes describe outcomes ("Both the full and log displays can now be cleanly exited") not implementation ("by raising `KeyboardInterrupt`")
- Secondary identifiers use parenthetical format: `--max-tokens` CLI option (`DF_MAX_TOKENS` for env)
- No spaces around em dashes
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,24 @@ in the dataset.

For details on which actions are supported, see the [actions](docs/actions.md) documentation.

## Setup
## Installation

1. Clone the repository
2. Create a virtual environment:
1. Install the package:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install dataset-foundry
```
3. Install the package:
```bash
pip install -e .
```
4. Create a `.env` file in the project root with your OpenAI API key:

2. Create a `.env` file in the project root:
```
OPENAI_API_KEY=your_api_key_here
# Provider API keys
OPENAI_API_KEY=
ANTHROPIC_API_KEY=

# Command-line defaults
DF_MODEL=anthropic/claude-sonnet-4-20250514

# Keep full display open after finishing so you can browse the results
# DF_NO_EXIT=true
```

### Default Settings
Expand Down
16 changes: 15 additions & 1 deletion docs/actions.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,20 @@ Executes a pipeline of steps on an item. Does not execute the setup or teardown
**Parameters:**
- `pipeline` (Union[Callable,Key,ItemPipeline,str]): The pipeline defining step to execute

### `do_while_item`
Executes actions at least once and then repeatedly while a condition is true.

**Parameters:**
- `actions` (list): Actions to execute while condition is true
- `condition` (str): The condition to evaluate
- `max_iterations` (int): Maximum number of iterations (default: 10)

For both `while_item` and `do_while_item`, the condition receives an `iteration` variable that
represents the number of completed iterations at the time the condition is evaluated. For
`while_item`, `iteration` is 0 on the first evaluation because the body has not yet run. For
`do_while_item`, the body executes once before the first evaluation, so `iteration` is 1 on the
first condition check.

### `generate_item`
Generates a data item using a language model, storing the result as either the entire data for a
data item, or, if specified, as data underneath the property specified by `output_key`.
Expand Down Expand Up @@ -181,4 +195,4 @@ Executes actions repeatedly while a condition is true.
**Parameters:**
- `condition` (str): The condition to evaluate
- `actions` (list): Actions to execute while condition is true
- `max_iterations` (int): Maximum number of iterations (default: 10)
- `max_iterations` (int): Maximum number of iterations (default: 10)
50 changes: 50 additions & 0 deletions src/dataset_foundry/actions/item/do_while_item.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import logging

from ...core.context import Context
from ...core.dataset_item import DatasetItem
from ...types.item_action import ItemAction
from ...utils.eval.item_eval import item_eval

logger = logging.getLogger(__name__)


def do_while_item(actions: list, condition: str, max_iterations: int = 10) -> ItemAction:
"""
Creates an action that executes a list of actions at least once and then continues executing
them while a given condition is met.

The `iteration` variable passed into the condition represents the number of completed iterations
at the time the condition is evaluated. For `do_while_item`, the actions execute once before the
first evaluation, so `iteration` is 1 on the first condition check.

Args:
actions (list): A list of actions to execute while the condition is true.
condition (str): A string representing the condition to evaluate.
max_iterations (int): The maximum number of iterations to execute. Defaults to 10.

Returns:
function: A function that takes a DatasetItem and Context and executes the
actions at least once and then while the condition is true.
"""

async def do_while_item_action(item: DatasetItem, context: Context):
iterations = 0

# TODO: Think about whether we want to bind `**item.data` here to make things simpler. I
# think other item actions are doing this [fastfedora 3.Mar.2025]
while True:
iterations += 1
logger.debug(
f"Executing do-while loop iteration {iterations} for condition '{condition}'."
)
for action in actions:
await action(item, context)

if not item_eval(condition, item, context, {"iteration": iterations}):
break

if iterations >= max_iterations:
logger.warning(f"Reached maximum of {max_iterations} iterations for '{condition}'.")
break

return do_while_item_action
7 changes: 7 additions & 0 deletions src/dataset_foundry/actions/item/run_swe_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ def run_swe_agent(
output_dir: Union[Callable, Key, str] = Key("context.output_dir"),
agent: Union[Callable, Key, str] = Key("context.swe_agent.type"),
repo_path: Optional[Union[Callable, Key, str]] = None,
test_plugins_dir: Optional[Union[Callable, Key, str]] = None,
timeout: Union[Callable, Key, int] = 3600, # 1 hour default
max_retries: Union[Callable, Key, int] = 3,
output_key: Union[Callable, Key, str] = "agent_result",
Expand All @@ -36,6 +37,7 @@ def run_swe_agent(
output_dir: Directory where agent output should be saved
agent: Name of the agent to run (e.g., "codex", "claude-code")
repo_path: Optional path to pre-existing repository
test_plugins_dir: Optional directory containing pytest plugins to mount in the container
timeout: Maximum execution time in seconds
max_retries: Maximum number of retry attempts
output_key: Key to store the agent result in item data
Expand All @@ -51,6 +53,7 @@ async def run_swe_agent_action(item: DatasetItem, context: Context):
resolved_output_dir = resolve_item_value(output_dir, item, context, required_as="output_dir")
resolved_agent = resolve_item_value(agent, item, context, required_as="agent")
resolved_repo_path = resolve_item_value(repo_path, item, context) if repo_path else None
resolved_test_plugins_dir = resolve_item_value(test_plugins_dir, item, context)
resolved_timeout = resolve_item_value(timeout, item, context)
resolved_max_retries = resolve_item_value(max_retries, item, context)
resolved_output_key = resolve_item_value(output_key, item, context)
Expand All @@ -69,6 +72,9 @@ async def run_swe_agent_action(item: DatasetItem, context: Context):
**item.data
})

if isinstance(resolved_test_plugins_dir, str):
resolved_test_plugins_dir = Path(resolved_test_plugins_dir)

output_path = Path(resolved_output_dir)
output_path.mkdir(parents=True, exist_ok=True)

Expand All @@ -93,6 +99,7 @@ async def run_swe_agent_action(item: DatasetItem, context: Context):
result = await agent_runner.run(
inputs=agent_inputs,
output_dir=output_path,
test_plugins_dir=resolved_test_plugins_dir,
timeout=resolved_timeout,
attempt=attempt + 1,
stream_logs=resolved_stream_logs
Expand Down
6 changes: 6 additions & 0 deletions src/dataset_foundry/actions/item/run_unit_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def run_unit_tests(
property: Union[Callable,Key,str] = "test_result",
setup_property: Union[Callable,Key,str] = "setup_result",
sandbox: Optional[Union[Callable,Key,str]] = None,
test_plugins_dir: Optional[Union[Callable,Key,str,Path]] = None,
stream_logs: Union[Callable,Key,bool] = False,
timeout: Union[Callable,Key,int] = 300,
) -> ItemAction:
Expand All @@ -36,6 +37,7 @@ async def run_unit_tests_action(item: DatasetItem, context: Context):
resolved_property = resolve_item_value(property, item, context, required_as="property")
resolved_setup_property = resolve_item_value(setup_property, item, context)
resolved_sandbox = resolve_item_value(sandbox, item, context)
resolved_test_plugins_dir = resolve_item_value(test_plugins_dir, item, context)
resolved_stream_logs = resolve_item_value(stream_logs, item, context)
resolved_timeout = resolve_item_value(timeout, item, context)

Expand All @@ -45,12 +47,16 @@ async def run_unit_tests_action(item: DatasetItem, context: Context):
else:
raise ValueError("Sandbox must be a string name of a sandbox")

if isinstance(resolved_test_plugins_dir, str):
resolved_test_plugins_dir = Path(resolved_test_plugins_dir)

command = [f"python -m pytest -v '{resolved_filename}'"]

logger.info(f"Running tests in sandbox with command: {' '.join(command)}")
sandbox_result = await sandbox_manager.run(
target_file=resolved_filename,
workspace_dir=resolved_dir,
test_plugins_dir=resolved_test_plugins_dir,
command=command,
timeout=resolved_timeout,
stream_logs=resolved_stream_logs
Expand Down
Loading