SWE-Python-AI: A SWE-Bench-like benchmark for AI-related tasks in Python
Version 1.0
Date: 5 Nov 2024
Duc-Manh Tran
SWE-Bench requires PyTest to be executable with each of its task instance (datapoint), which means that we need to specify installation for every versions there are.
Example:
For a task instance:
{
"repo": "keras-team/keras",
"pull_number": 20410,
"instance_id": "keras-team__keras-20410",
"issue_numbers": ["19740"],
"base_commit": "0c2bdff313f7533f0d7e6670a906102cc2fb046d",
"patch": "a very long string",
"test_patch": "a very long string",
"hints_text": "a very long string",
"created_at": "2024-10-25T16:23:31Z"
}we need to identify the version of the repository that associate with the base_commit and FAIL-TO-PASS and PASS-TO-PASS test cases of this instance.
Regard of versioning, for repository keras-team/keras, at version v3.6.0 (assume that this is the version that associate with commit 0c2bdff313f7533f0d7e6670a906102cc2fb046d), we need to identify the following information:
{
"python": "3.9",
"packages": "requirements.txt",
"install": "pip install -e .",
"pip_packages": ["pytest"],
"pytest_cmd": "pytest keras/src/applications"
}And for FAIL-TO-PASS and PASS-TO-PASS test cases of this instance, we need to manually compare the output of the PyTest.
| Task ID | Task Description | Deadline | Notes |
|---|---|---|---|
| T1 | Versioning | [Due Date] | Run versioning scripts while looking for a better approach |
| T2 | Run PyTest | [Due Date] | Read documents of the repository to specify information of each repository's version and create a script to validate that information |
| T3 | Compare PyTest's outputs | [Due Date] | Run PyTest with above specified information on task instances to identify FAIL-TO-PASS and PASS-TO-PASS test cases |
List the specific outcomes or products expected from the project, along with due dates.
| Deliverable | Description | Due Date | Notes |
|---|---|---|---|
| Repository's Specification | Complete this file | [Date] | |
F2P & P2P |
Identify FAIL-TO-PASS and PASS-TO-PASS test cases for every task instances |
[Date] | |
versioning |
A better approach on searching for version of a given commit | [Date] |
Provide an overview of the project timeline, including key milestones.
- Kick-off Meeting: [Date]
- Knowledge: Python, PyTest, basic Linux
- Software/Tools: Kaggle
- Weekly Meetings: Every Monday
- Status Reports: Due every Thusday