Skip to content

Adding data gen using llm#26

Open
akhatua2 wants to merge 3 commits intomainfrom
data-gen
Open

Adding data gen using llm#26
akhatua2 wants to merge 3 commits intomainfrom
data-gen

Conversation

@akhatua2
Copy link
Collaborator

@akhatua2 akhatua2 commented Feb 2, 2026

Feature: Automated Benchmark Data Generation Pipeline

Description

This PR introduces an automated pipeline for generating new benchmark features using AI agents. The pipeline creates features that intentionally conflict with existing features during git merge, enabling evaluation of AI agents' ability to resolve merge conflicts.

Features

  • Agent-based feature generation - Uses mini_swe_agent to implement new features in sandboxed environments
  • Automatic conflict detection - Validates that generated features create real git merge conflicts with existing features
  • Test validation - Runs only the newly added tests to verify feature correctness
  • Rich metadata collection - Captures feature descriptions, conflict info with titles, test output, and agent trajectories

Usage

# Generate a new conflicting feature for a task
python -m cooperbench.generation \
    --task dataset/huggingface_datasets_task/task7309 \
    --backend modal \
    --model gemini-2.5-flash \
    --output generated/my_run \
    --debug

# Just preview the prompt (no agent run)
python -m cooperbench.generation --task dataset/huggingface_datasets_task/task7309 --prompt-only

# Validate existing patches
python -m cooperbench.generation --task dataset/huggingface_datasets_task/task7309 \
    --validate feature.patch tests.patch

Pipeline Flow

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Build Prompt   │ ──▶ │  Run Agent in    │ ──▶ │  Extract Patch  │
│  from existing  │     │  Modal/Docker    │     │  & Split into   │
│  features       │     │  Sandbox         │     │  feature/tests  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Save Results   │ ◀── │  Check Conflicts │ ◀── │  Run New Tests  │
│  feature.patch  │     │  with Existing   │     │  in Sandbox     │
│  tests.patch    │     │  Features        │     │                 │
│  feature.md     │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Key Implementation Details

1. Reliable Feature Description Extraction

The agent writes feature descriptions to .feature_description.md in the repo, which gets captured in the git diff. This avoids unreliable conversation parsing.

# Prompt instructs agent to create:
cat << 'FEATURE_EOF' > .feature_description.md
**Title**: [Feature title]
**Description**: [Detailed description]
**API Changes**: [New functions, parameters]
**Implementation Details**: [Key algorithms, logic]
**Files Modified**: [List of files]
FEATURE_EOF

2. Selective Test Execution

Only runs newly added test functions, avoiding false failures from pre-existing test issues:

def _extract_new_test_functions(patch: str) -> list[str]:
    """Extract test function names from patch in pytest format."""
    # Returns: ["tests/io/test_parquet.py::test_parquet_row_groups_selection"]

3. Git Merge Conflict Detection

Creates real git branches and attempts merges to detect actual conflicts:

# Branch A: existing feature
git checkout -b __existing_{fid}
git apply feature{fid}.patch
git commit -m "existing feature"

# Branch B: new feature
git checkout -b __new_{fid}
git apply new_feature.patch
git commit -m "new feature"

# Try merge - conflicts detected if merge fails
git merge --no-commit __existing_{fid}

4. Rich Conflict Information

Returns feature titles alongside conflict IDs for better reporting:

{
  "conflicts": [1, 2],
  "conflicts_info": [
    {"id": 1, "title": "Faster Parquet Streaming + Filters"},
    {"id": 2, "title": "Support for Sorting During Streaming"}
  ]
}

Files Modified

File Changes
src/cooperbench/generation/prompt.py Enhanced prompt with detailed feature description format, test file targeting
src/cooperbench/generation/generator.py Added .feature_description.md extraction, conflicts_info support
src/cooperbench/generation/validator.py Selective test execution, feature title extraction for conflicts
src/cooperbench/generation/splitter.py Fixed patch newline handling for git compatibility
src/cooperbench/generation/__main__.py CLI improvements, output directory naming

Output Structure

generated/
└── row_group_selection_for_parquet_datasets_f701d/
    ├── feature.patch      # Implementation changes
    ├── tests.patch        # Test changes
    ├── feature.md         # Detailed feature description
    ├── result.json        # Full results with conflicts, test output
    ├── trajectory_*.json  # Agent conversation history
    └── trajectory_*.txt   # Human-readable trajectory

Result Schema

{
  "success": true,
  "feature_md": "**Title**: Row Group Selection...",
  "feature_patch": "diff --git ...",
  "tests_patch": "diff --git ...",
  "conflicts": [1, 2],
  "conflicts_info": [
    {
      "id": 1,
      "title": "Faster Parquet Streaming + Filters",
      "conflict_diff": "--- file.py ---\n<<<<<<< HEAD\n...\n=======\n...\n>>>>>>> __existing_1"
    },
    {
      "id": 2,
      "title": "Support for Sorting During Streaming",
      "conflict_diff": "--- file.py ---\n<<<<<<< HEAD\n...\n=======\n...\n>>>>>>> __existing_2"
    }
  ],
  "errors": [],
  "agent_cost": 0.047,
  "agent_steps": 11,
  "duration_seconds": 72.3,
  "tests_passed": true,
  "tests_output": "===== 1 passed in 0.30s =====",
  "validation_run": true
}

Testing

Validated with huggingface_datasets_task/task7309:

  • Agent successfully generates conflicting features
  • Feature descriptions properly extracted from patch
  • Only new tests are run (avoids pre-existing failures)
  • Conflicts detected with both existing features
  • Feature titles included in conflict info

@ProKil
Copy link
Member

ProKil commented Feb 2, 2026

Would this produce features that are too similar to the original feature limiting the diversity of training tasks?

@akhatua2
Copy link
Collaborator Author

akhatua2 commented Feb 2, 2026

Yeah that is a possibilty. Potential directions:

(1) We could play with the prompt where we first get a stronger model (Gemini3 Pro) to come up with ideas which promote diversity, conflict-ability and compatibility. And then we pass that to this flow.

(2) Another direction we can go is using repo to automate new repo setup and generate tasks on those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants