Skip to content

Commit 9451c54

Browse files
authored
Datasmith pipeline improvements (#9)
* Add prepare_formulacode_dataset.py for HF dataset enrichment and upload * Add DATASET_CARD.md template for HF dataset card * Add Step 7 (HF upload) to update_formulacode.py * Add Step 7 docs, HF_TOKEN, and remove license section * Remove AWS/ECR files * Remove AWS/ECR references from source, config, and docs
1 parent c2d2280 commit 9451c54

23 files changed

Lines changed: 1072 additions & 4362 deletions

AGENTS.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Repository Guidelines
2+
3+
This guide helps contributors work effectively in the datasmith repository.
4+
5+
## Project Structure & Module Organization
6+
- Source: `src/datasmith/` — core modules: `agents/`, `docker/`, `scrape/`, `benchmark/`, `detection/`, `execution/`, `collation/`, `core/`.
7+
- Tests: `tests/` — pytest suites (e.g., `tests/test_docker_*`, `tests/agents/`).
8+
- Assets/Docs: `static/`, `docs/`.
9+
- Artifacts: `scratch/` (generated data), `dist/` (wheels). Do not commit contents.
10+
11+
## Build, Test, and Development Commands
12+
- `make install` — create env with uv and install pre-commit.
13+
- `make check` — lock check, ruff lint/format, mypy, deptry.
14+
- `make test` — run pytest with coverage (XML for CI/Codecov).
15+
- `make build` — build wheel into `dist/`.
16+
- `uv run <cmd>` — run tools inside the env (e.g., `uv run pytest`).
17+
- `uvx tox -q` — run the tox matrix (py39–py312) if tox is installed.
18+
- Optional: `make backup` uses `tokens.env` for `BACKUP_DIR` rsync.
19+
- To run commands using the same environment variables as the user, use `uv run <command>`.
20+
21+
## Coding Style & Naming Conventions
22+
- Python 3.9–3.12. 4‑space indentation, type hints required (mypy strict; see `pyproject.toml`).
23+
- Lint/format via Ruff (line length 120). Run `make check` before pushing.
24+
- Naming: modules/functions `snake_case`, classes `CamelCase`, constants `UPPER_SNAKE_CASE`.
25+
- Prefer `logging` (see `src/datasmith/logging_config.py`) over prints.
26+
27+
## Testing Guidelines
28+
- Framework: pytest + pytest‑cov. Place tests in `tests/` named `test_*.py`.
29+
- Run locally: `make test` or `uv run pytest`.
30+
- Coverage: Codecov target 90% (see `codecov.yaml`). Add tests for new code paths.
31+
- Tests must be deterministic and offline; use fakes for network calls.
32+
33+
## Commit & Pull Request Guidelines
34+
- History is informal; please use clear, present‑tense summaries, optionally prefixing a subsystem tag: `docker: prune dangling layers`, `agents: improve build plan`.
35+
- PRs must include: description, rationale, test coverage notes, and any docs updates. Link issues. For CLI/UX changes, include sample output or screenshots.
36+
- Ensure `make check` and `make test` pass; CI should be green.
37+
38+
## Security & Configuration Tips
39+
- Create `tokens.env` (ignored) for `GH_TOKEN`, `CODECOV_TOKEN`, `CACHE_LOCATION`, `BACKUP_DIR`. Never commit secrets.
40+
- Docker tooling exists in `src/datasmith/docker/`; validate locally before pushing remote runs.
41+
42+
## Agent‑Specific Instructions
43+
- Keep changes small and focused; update/cover adjacent tests. Follow this guide for all files under the repo root.

DATASET_CARD.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
![banner](static/formula-code-datasmith.svg)
2+
3+
4+
<p align="center">
5+
<a href="https://formula-code.github.io/">
6+
<img src="https://img.shields.io/badge/%F0%9F%8C%90%20Website-0A7A5E?style=for-the-badge" alt="FormulaCode Website">
7+
</a>
8+
<a href="https://example.com">
9+
<img src="https://img.shields.io/badge/Paper-1F6FEB?style=for-the-badge&logo=arxiv&logoColor=white" alt="FormulaCode Paper">
10+
</a>
11+
<a href="https://formula-code.github.io/leaderboard/">
12+
<img src="https://img.shields.io/badge/%F0%9F%93%88%20Leaderboard-EA580C?style=for-the-badge&logoColor=white" alt="FormulaCode Leaderboard">
13+
</a>
14+
</p>
15+
16+
[FormulaCode](https://formula-code.github.io/) is a *live* benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a [pipeline](https://github.com/formula-code/datasmith) to construct performance optimization tasks, and an [execution harness](https://github.com/formula-code/terminal-bench) that connects a language model to our terminal sandbox.
17+
18+
This dataset contains **{total_rows}** enriched performance optimization tasks derived from real open-source Python projects, spanning {num_months} months of merged PRs.
19+
20+
## Quick Start
21+
22+
```python
23+
from datasets import load_dataset
24+
25+
# Load all tasks
26+
ds = load_dataset("formulacode/formulacode-all")
27+
28+
# Load only verified tasks (human-validated)
29+
ds = load_dataset("formulacode/formulacode-all", "verified")
30+
31+
# Load tasks from a specific month
32+
ds = load_dataset("formulacode/formulacode-all", "2024-07")
33+
```
34+
35+
## Configs
36+
37+
| Config | Description | Tasks |
38+
|--------|-------------|-------|
39+
| `default` | All tasks | {total_rows} |
40+
{verified_row}
41+
| `YYYY-MM` | Tasks by PR merge month ({num_months} months available) | varies |
42+
43+
## Key Columns
44+
45+
| Column | Description |
46+
|--------|-------------|
47+
| `task_id` | Unique task identifier (e.g. `pandas-dev_pandas_1`) |
48+
| `repo_name` | Source repository (e.g. `pandas-dev/pandas`) |
49+
| `container_name` | Docker container reference (`<owner>-<repo>-<sha>:final`) |
50+
| `image_name` | Full Docker Hub image reference |
51+
| `difficulty` | Normalized difficulty: `easy`, `medium`, `hard` |
52+
| `classification` | Optimization type (e.g. `use_better_algorithm`, `micro_optimizations`) |
53+
| `patch` | Ground truth performance improvement patch |
54+
| `final_md` | Task instructions in markdown |
55+
| `pr_merged_at` | Date the PR was merged |
56+
| `pr_merge_commit_sha` | Merge commit SHA |
57+
| `pr_base_sha` | Base commit SHA |

0 commit comments

Comments
 (0)