Adding data gen using llm by akhatua2 · Pull Request #26 · cooperbench/CooperBench

akhatua2 · 2026-02-02T00:21:24Z

Feature: Automated Benchmark Data Generation Pipeline

Description

This PR introduces an automated pipeline for generating new benchmark features using AI agents. The pipeline creates features that intentionally conflict with existing features during git merge, enabling evaluation of AI agents' ability to resolve merge conflicts.

Features

Agent-based feature generation - Uses mini_swe_agent to implement new features in sandboxed environments
Automatic conflict detection - Validates that generated features create real git merge conflicts with existing features
Test validation - Runs only the newly added tests to verify feature correctness
Rich metadata collection - Captures feature descriptions, conflict info with titles, test output, and agent trajectories

Usage

# Generate a new conflicting feature for a task
python -m cooperbench.generation \
    --task dataset/huggingface_datasets_task/task7309 \
    --backend modal \
    --model gemini-2.5-flash \
    --output generated/my_run \
    --debug

# Just preview the prompt (no agent run)
python -m cooperbench.generation --task dataset/huggingface_datasets_task/task7309 --prompt-only

# Validate existing patches
python -m cooperbench.generation --task dataset/huggingface_datasets_task/task7309 \
    --validate feature.patch tests.patch

Pipeline Flow

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Build Prompt   │ ──▶ │  Run Agent in    │ ──▶ │  Extract Patch  │
│  from existing  │     │  Modal/Docker    │     │  & Split into   │
│  features       │     │  Sandbox         │     │  feature/tests  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Save Results   │ ◀── │  Check Conflicts │ ◀── │  Run New Tests  │
│  feature.patch  │     │  with Existing   │     │  in Sandbox     │
│  tests.patch    │     │  Features        │     │                 │
│  feature.md     │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Key Implementation Details

1. Reliable Feature Description Extraction

The agent writes feature descriptions to .feature_description.md in the repo, which gets captured in the git diff. This avoids unreliable conversation parsing.

# Prompt instructs agent to create:
cat << 'FEATURE_EOF' > .feature_description.md
**Title**: [Feature title]
**Description**: [Detailed description]
**API Changes**: [New functions, parameters]
**Implementation Details**: [Key algorithms, logic]
**Files Modified**: [List of files]
FEATURE_EOF

2. Selective Test Execution

Only runs newly added test functions, avoiding false failures from pre-existing test issues:

def _extract_new_test_functions(patch: str) -> list[str]:
    """Extract test function names from patch in pytest format."""
    # Returns: ["tests/io/test_parquet.py::test_parquet_row_groups_selection"]

3. Git Merge Conflict Detection

Creates real git branches and attempts merges to detect actual conflicts:

# Branch A: existing feature
git checkout -b __existing_{fid}
git apply feature{fid}.patch
git commit -m "existing feature"

# Branch B: new feature
git checkout -b __new_{fid}
git apply new_feature.patch
git commit -m "new feature"

# Try merge - conflicts detected if merge fails
git merge --no-commit __existing_{fid}

4. Rich Conflict Information

Returns feature titles alongside conflict IDs for better reporting:

{
  "conflicts": [1, 2],
  "conflicts_info": [
    {"id": 1, "title": "Faster Parquet Streaming + Filters"},
    {"id": 2, "title": "Support for Sorting During Streaming"}
  ]
}

Files Modified

File	Changes
`src/cooperbench/generation/prompt.py`	Enhanced prompt with detailed feature description format, test file targeting
`src/cooperbench/generation/generator.py`	Added `.feature_description.md` extraction, `conflicts_info` support
`src/cooperbench/generation/validator.py`	Selective test execution, feature title extraction for conflicts
`src/cooperbench/generation/splitter.py`	Fixed patch newline handling for git compatibility
`src/cooperbench/generation/__main__.py`	CLI improvements, output directory naming

Output Structure

generated/
└── row_group_selection_for_parquet_datasets_f701d/
    ├── feature.patch      # Implementation changes
    ├── tests.patch        # Test changes
    ├── feature.md         # Detailed feature description
    ├── result.json        # Full results with conflicts, test output
    ├── trajectory_*.json  # Agent conversation history
    └── trajectory_*.txt   # Human-readable trajectory

Result Schema

{
  "success": true,
  "feature_md": "**Title**: Row Group Selection...",
  "feature_patch": "diff --git ...",
  "tests_patch": "diff --git ...",
  "conflicts": [1, 2],
  "conflicts_info": [
    {
      "id": 1,
      "title": "Faster Parquet Streaming + Filters",
      "conflict_diff": "--- file.py ---\n<<<<<<< HEAD\n...\n=======\n...\n>>>>>>> __existing_1"
    },
    {
      "id": 2,
      "title": "Support for Sorting During Streaming",
      "conflict_diff": "--- file.py ---\n<<<<<<< HEAD\n...\n=======\n...\n>>>>>>> __existing_2"
    }
  ],
  "errors": [],
  "agent_cost": 0.047,
  "agent_steps": 11,
  "duration_seconds": 72.3,
  "tests_passed": true,
  "tests_output": "===== 1 passed in 0.30s =====",
  "validation_run": true
}

Testing

Validated with huggingface_datasets_task/task7309:

Agent successfully generates conflicting features
Feature descriptions properly extracted from patch
Only new tests are run (avoids pre-existing failures)
Conflicts detected with both existing features
Feature titles included in conflict info

ProKil · 2026-02-02T00:33:56Z

Would this produce features that are too similar to the original feature limiting the diversity of training tasks?

akhatua2 · 2026-02-02T00:41:32Z

Yeah that is a possibilty. Potential directions:

(1) We could play with the prompt where we first get a stronger model (Gemini3 Pro) to come up with ideas which promote diversity, conflict-ability and compatibility. And then we pass that to this flow.

(2) Another direction we can go is using repo to automate new repo setup and generate tasks on those

akhatua2 added 3 commits February 1, 2026 16:00

Adding data gen using llm

a781eef

Capture the conflict

d143a83

Fixing git issues

00b82e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding data gen using llm#26

Adding data gen using llm#26
akhatua2 wants to merge 3 commits intomainfrom
data-gen

akhatua2 commented Feb 2, 2026

Uh oh!

ProKil commented Feb 2, 2026

Uh oh!

akhatua2 commented Feb 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akhatua2 commented Feb 2, 2026

Feature: Automated Benchmark Data Generation Pipeline

Description

Features

Usage

Pipeline Flow

Key Implementation Details

1. Reliable Feature Description Extraction

2. Selective Test Execution

3. Git Merge Conflict Detection

4. Rich Conflict Information

Files Modified

Output Structure

Result Schema

Testing

Uh oh!

ProKil commented Feb 2, 2026

Uh oh!

akhatua2 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akhatua2 commented Feb 2, 2026 •

edited

Loading