diff --git a/AGENTS.md b/AGENTS.md deleted file mode 100644 index adf62c8..0000000 --- a/AGENTS.md +++ /dev/null @@ -1,43 +0,0 @@ -# Repository Guidelines - -This guide helps contributors work effectively in the datasmith repository. - -## Project Structure & Module Organization -- Source: `src/datasmith/` — core modules: `agents/`, `docker/`, `scrape/`, `benchmark/`, `detection/`, `execution/`, `collation/`, `core/`. -- Tests: `tests/` — pytest suites (e.g., `tests/test_docker_*`, `tests/agents/`). -- Assets/Docs: `static/`, `docs/`. -- Artifacts: `scratch/` (generated data), `dist/` (wheels). Do not commit contents. - -## Build, Test, and Development Commands -- `make install` — create env with uv and install pre-commit. -- `make check` — lock check, ruff lint/format, mypy, deptry. -- `make test` — run pytest with coverage (XML for CI/Codecov). -- `make build` — build wheel into `dist/`. -- `uv run ` — run tools inside the env (e.g., `uv run pytest`). -- `uvx tox -q` — run the tox matrix (py39–py312) if tox is installed. -- Optional: `make backup` uses `tokens.env` for `BACKUP_DIR` rsync. -- To run commands using the same environment variables as the user, use `uv run `. - -## Coding Style & Naming Conventions -- Python 3.9–3.12. 4‑space indentation, type hints required (mypy strict; see `pyproject.toml`). -- Lint/format via Ruff (line length 120). Run `make check` before pushing. -- Naming: modules/functions `snake_case`, classes `CamelCase`, constants `UPPER_SNAKE_CASE`. -- Prefer `logging` (see `src/datasmith/logging_config.py`) over prints. - -## Testing Guidelines -- Framework: pytest + pytest‑cov. Place tests in `tests/` named `test_*.py`. -- Run locally: `make test` or `uv run pytest`. -- Coverage: Codecov target 90% (see `codecov.yaml`). Add tests for new code paths. -- Tests must be deterministic and offline; use fakes for network calls. - -## Commit & Pull Request Guidelines -- History is informal; please use clear, present‑tense summaries, optionally prefixing a subsystem tag: `docker: prune dangling layers`, `agents: improve build plan`. -- PRs must include: description, rationale, test coverage notes, and any docs updates. Link issues. For CLI/UX changes, include sample output or screenshots. -- Ensure `make check` and `make test` pass; CI should be green. - -## Security & Configuration Tips -- Create `tokens.env` (ignored) for `GH_TOKEN`, `CODECOV_TOKEN`, `CACHE_LOCATION`, `BACKUP_DIR`. Never commit secrets. -- Docker tooling exists in `src/datasmith/docker/`; validate locally before pushing remote runs. - -## Agent‑Specific Instructions -- Keep changes small and focused; update/cover adjacent tests. Follow this guide for all files under the repo root. diff --git a/DATASET_CARD.md b/DATASET_CARD.md deleted file mode 100644 index d83cd6c..0000000 --- a/DATASET_CARD.md +++ /dev/null @@ -1,57 +0,0 @@ -![banner](static/formula-code-datasmith.svg) - - -

- - FormulaCode Website - - - FormulaCode Paper - - - FormulaCode Leaderboard - -

- -[FormulaCode](https://formula-code.github.io/) is a *live* benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a [pipeline](https://github.com/formula-code/datasmith) to construct performance optimization tasks, and an [execution harness](https://github.com/formula-code/terminal-bench) that connects a language model to our terminal sandbox. - -This dataset contains **{total_rows}** enriched performance optimization tasks derived from real open-source Python projects, spanning {num_months} months of merged PRs. - -## Quick Start - -```python -from datasets import load_dataset - -# Load all tasks -ds = load_dataset("formulacode/formulacode-all") - -# Load only verified tasks (human-validated) -ds = load_dataset("formulacode/formulacode-all", "verified") - -# Load tasks from a specific month -ds = load_dataset("formulacode/formulacode-all", "2024-07") -``` - -## Configs - -| Config | Description | Tasks | -|--------|-------------|-------| -| `default` | All tasks | {total_rows} | -{verified_row} -| `YYYY-MM` | Tasks by PR merge month ({num_months} months available) | varies | - -## Key Columns - -| Column | Description | -|--------|-------------| -| `task_id` | Unique task identifier (e.g. `pandas-dev_pandas_1`) | -| `repo_name` | Source repository (e.g. `pandas-dev/pandas`) | -| `container_name` | Docker container reference (`--:final`) | -| `image_name` | Full Docker Hub image reference | -| `difficulty` | Normalized difficulty: `easy`, `medium`, `hard` | -| `classification` | Optimization type (e.g. `use_better_algorithm`, `micro_optimizations`) | -| `patch` | Ground truth performance improvement patch | -| `final_md` | Task instructions in markdown | -| `pr_merged_at` | Date the PR was merged | -| `pr_merge_commit_sha` | Merge commit SHA | -| `pr_base_sha` | Base commit SHA | diff --git a/README.md b/README.md index f61f450..887492f 100644 --- a/README.md +++ b/README.md @@ -1,56 +1,28 @@ -![banner](static/formula-code-datasmith.svg) +# FormulaCode - DataSmith 🔨 +[![Release](https://img.shields.io/github/v/release/formula-code/datasmith)](https://img.shields.io/github/v/release/formula-code/datasmith) +[![Build status](https://img.shields.io/github/actions/workflow/status/formula-code/datasmith/main.yml?branch=main)](https://github.com/formula-code/datasmith/actions/workflows/main.yml?query=branch%3Amain) +[![codecov](https://codecov.io/gh/formula-code/datasmith/branch/main/graph/badge.svg)](https://codecov.io/gh/formula-code/datasmith) +[![Commit activity](https://img.shields.io/github/commit-activity/m/formula-code/datasmith)](https://img.shields.io/github/commit-activity/m/formula-code/datasmith) +[![License](https://img.shields.io/github/license/formula-code/datasmith)](https://img.shields.io/github/license/formula-code/datasmith) -

- - FormulaCode Website - - - FormulaCode Paper - - - FormulaCode Leaderboard - -

+This is a Python codebase for preparing and analyzing the Hugging Face dataset for **FormulaCode** (67 repositories; 964+ performance‑improving commits) and **FormulaCode-V** (?? Repositories; 200 performance-improving commits with manually verified pytest benchmarks). -[FormulaCode](https://formula-code.github.io/) is a *live* benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a pipeline to construct performance optimization tasks, and an [execution harness](https://github.com/formula-code/terminal-bench) that connects a language model to our terminal sandbox. _This repository contains the task generation pipeline._ +![FormulaCode](static/Fig1.png) -`datasmith` is a python package for automatically curating and managing [FormulaCode](https://formula-code.github.io/) tasks. After installation, `datasmith` is designed to run as a monthly CRON job that updates the FormulaCode dataset with new commits and repositories. +FormulaCode is designed to benchmark the capabilities of large language models (LLMs) to optimize the performance of real‑world codebases. It is designed to *complement* existing benchmarks (e.g. SWE‑Bench) by using the same API and methodology as SWE‑Bench. -# Table of Contents +## Key improvements -# Installation +1. **Human‑relative metric** – FormulaCode scores an optimizer relative to the speed‑up achieved by the human author of the original commit, preventing “memorize‑and‑saturate” tactics. +2. **Finer‑grained feedback** – Performance measurements provide a dense reward signal that helps RL or evolutionary algorithms iterate more effectively than binary pass/fail unit tests. +3. **Performance benchmarks vs. unit tests** – Unit tests protect against functional regressions but can be over‑fit; realistic workload benchmarks capture the critical performance hot‑paths developers actually care about. -## Step 1: Install `datasmith` and set up the environment +4. **Real‑world impact** – FormulaCode uses a library's _pre-defined_ performance categorization workloads. As such, if an LLM statistically outperforms the human baseline on a FormulaCode task, the resulting patch is often state‑of‑the‑art and can be upstreamed to the library. However, this is contingent on the patch being thoroughly validated after manual verification, of course. -Install [uv](https://astral.sh/uv/) and set up the development environment: -```bash -# Install uv -$ curl -LsSf https://astral.sh/uv/install.sh | sh -# Clone the repository. -$ git clone https://github.com/formula-code/datasmith.git -$ cd datasmith -# Installs pre-commit hooks and dev dependencies. -$ make install -# Resolve initial formatting issues. -$ uv run pre-commit run -a -$ make check -# Run tests to verify installation. -$ make test -``` - -## Step 2: Configure environment variables - -Make a `tokens.env` file with your environment variables. You'll need: - -* A [GitHub token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) with `repo` and `read:org` permissions to read repositories and commit. -* An OpenAI API key or local LLM endpoint (recommended) for synthesizing reproducible containers and for performance classification. -* A DockerHub token with `write:packages` permissions if you want to publish images to DockerHub for easier sharing and distribution. - - -
-Click to reveal a sample tokens.env file. +## Installation +Make a `tokens.env` file with your GitHub and Codecov credentials. ```bash # Cache and backup locations CACHE_LOCATION=/home/???/formulacode/datasmith/scratch/artifacts/cache.db @@ -71,19 +43,27 @@ DOCKERHUB_NAMESPACE=formulacode # Required for dataset verification DOCKERHUB_USERNAME=myuser # Required for dataset verification DOCKERHUB_TOKEN=dckr_pat_xxxxx # Required for dataset verification -# For Hugging Face dataset uploads -HF_TOKEN=hf_xxxxx # Required for --upload-to-hf +# For ECR access (legacy/optional) +AWS_REGION=us-east-1 # Depends on the system. #DOCKER_USE_BUILDX=0 DOCKER_NETWORK_MODE=host ``` -
+Then, install [uv](https://astral.sh/uv/) and set up the development environment: +```bash +$ curl -LsSf https://astral.sh/uv/install.sh | sh +# Installs pre-commit hooks and dev dependencies. +$ make install +# Resolve initial formatting issues. +$ uv run pre-commit run -a +$ make check +# Run tests to verify installation. +$ make test +``` -## Step 3: Set up CRON job for monthly updates - -Set up a CRON job to run the `update_formulacode.py` script on the 25th of every month. This will keep the FormulaCode dataset up to date with new repositories and commits. +Ensure your machine can run CRON tasks. This will be necessary for updating FormulaCode every month. ```bash $ crontab -l @@ -118,45 +98,78 @@ $ crontab -l 0 0 * * 0 /usr/bin/docker image prune -f # Run FormulaCode update script on the 25th day of every month at 2am -0 2 25 * * flock -n /tmp/update_formulacode.lock /home/???/formulacode/datasmith/.venv/bin/python /home/???/formulacode/datasmith/scratch/scripts/update_formulacode.py --start-date "$(date -d '-1 month' +\%Y-\%m-01)" --end-date "$(date +\%Y-\%m-01)" >> /home/???/formulacode/datasmith/scratch/logs/update_formulacode_$(date +\%Y\%m\%d).log 2>&1 +0 2 25 * * cd /home/???/formulacode/datasmith && ./.venv/bin/python scratch/scripts/update_formulacode.py >> scratch/logs/update_formulacode_{date +\%Y\%m\%d}.log 2>&1 $ crontab -e # ``` -# Package Organization -## Module Structure +## Data layout +The general layout of the artifacts is as follows: +```bash +scratch/artifacts +├── cache.db # See `CACHE_LOCATION` env var +├── raw/ # Raw downloads & lists produced by scripts +│   ├── downloads/ # Per‑repo dashboard archives +│   ├── online_dashboards.jsonl # Updated config for dashboard scraper +│   ├── repos_discovered.csv # Candidates from GitHub search +│   ├── repos_valid.csv +└── processed/. # Outputs of various processing scripts +``` + + +## Dataset building + +### Installation + +You will need to download and install uv to set up Datasmith. The rest of the process is automated using `make` commands. + +```bash +$ curl -LsSf https://astral.sh/uv/install.sh | sh +# Install dev environment and pre-commit hooks +$ make install +# Resolve initial formatting issues. +$ uv run pre-commit run -a +$ make check +``` + +For querying github and codecov, we need to set up a few environment variables. You can do this by creating a `tokens.env` file in the root of the repository with the following content. + +```bash +$ cat tokens.env +GH_TOKEN=github_pat_??? +COVERALLS_TOKEN=XdK??? +CODECOV_TOKEN=54c6??? +CACHE_LOCATION=/home/???/formulacode/datasmith/scratch/artifacts/cache.db +BACKUP_DIR=/home/???/formulacode/backup/ +``` + + +## FormulaCode + + +FormulaCode is a dataset of 101 repositories with 4M+ PRs with an automated pipeline to scrape, filter, benchmark, and analyze performance-improving commits in open-source repositories that use Airspeed Velocity (asv) for benchmarking. To update formulacode simply run: ``` -src/datasmith/ -├── agents/ # LLM agent orchestration for build tasks and context synthesis -├── benchmark/ # Benchmark collection from ASV (Airspeed Velocity) configs -├── collation/ # Benchmark result collation and aggregation -├── core/ # Shared infra: API clients, caching, Git integration, data models -├── detection/ # Performance breakpoint and regression detection -├── docker/ # Docker image building, validation, DockerHub publishing -├── execution/ # Commit collection, filtering, and performance analysis -├── notebooks/ # Jupyter notebook utilities and context-registry updates -└── scrape/ # Web scraping, report building, LLM-based classification +$ python scratch/scripts/update_formulacode.py --start-date 2025-10-01 --end-date 2025-11-01 ``` -Pipeline scripts live in `scratch/scripts/`. The monthly orchestrator is `update_formulacode.py`; individual steps can also be run standalone. +The next sections describe each step of the pipeline in detail. -## Pipeline Overview +### FormulaCode update pipeline (bird's-eye view) -The diagram below summarizes how `scratch/scripts/update_formulacode.py` orchestrates the monthly update pipeline. +The diagram below summarizes how `scratch/scripts/update_formulacode.py` orchestrates the monthly update pipeline and how each downstream script delegates to internal modules and services. ```mermaid sequenceDiagram participant U as update_formulacode.py participant ENV as env setup - participant C1 as collect_and_filter_commits.py - participant C2 as prepare_commits_for_building_reports.py - participant C3 as collect_perf_commits.py - participant C4 as synthesize_contexts.py - participant C5 as build_and_publish_to_dockerhub.py - participant C6 as merge_perfonly_commits_master.py - participant C7 as prepare_formulacode_dataset.py + participant C1 as collect_commits.py + participant C2 as collect_and_filter_commits.py + participant C3 as prepare_commits_for_building_reports.py + participant C4 as collect_perf_commits.py + participant C5 as synthesize_contexts.py + participant C6 as build_and_publish_to_ecr.py participant CSV as repos_valid.csv participant GH as github or offline store participant TMP as temp repo dir @@ -166,162 +179,144 @@ sequenceDiagram participant SQL as sqlite cache participant CR as context registry participant DOCK as docker build - participant DH as dockerhub - participant HF as hugging face + participant ECR as aws ecr %% Orchestrator setup (grey) rect rgb(230,230,230) - Note over U,ENV: Prerequisite: collect_commits.py / scrape_repositories.py (one-time) U->>ENV: step 0 setup environment and logging ENV->>U: environment ready end - %% Step 1 collect_and_filter_commits.py (green) - rect rgb(210,245,220) - U->>C1: step 1 collect and filter commits + %% Step 1 collect_commits.py (blue) + rect rgb(210,225,255) + U->>C1: step 1 collect commits C1->>CSV: read repos_valid.csv - C1->>TMP: clone repo into temp dir - C1->>MI: collect merge shas and commit info - C1->>FS: write merge_commits_filtered parquet + C1->>GH: find perf commits for each repo + C1->>GH: find tagged releases + C1->>FS: write commits jsonl + end + + %% Step 2 collect_and_filter_commits.py (green) + rect rgb(210,245,220) + U->>C2: step 2 collect and filter commits + C2->>CSV: read repos_valid.csv + C2->>TMP: clone repo into temp dir + C2->>MI: collect merge shas and commit info + C2->>FS: write merge_commits_filtered parquet end - %% Step 2 prepare_commits_for_building_reports.py (yellow) + %% Step 3 prepare_commits_for_building_reports.py (yellow) rect rgb(255,250,210) - U->>C2: step 2 prepare commits for reports - C2->>FS: read merge_commits_filtered parquet - C2->>C2: tokenize patches and crude perf filter - C2->>C2: analyze commits in threads - C2->>DOCK: make tasks with container names - C2->>FS: optional get patch from diff url - C2->>FS: write parquet with patch + U->>C3: step 3 prepare commits for reports + C3->>FS: read merge_commits_filtered parquet + C3->>C3: tokenize patches and crude perf filter + C3->>C3: analyze commits in threads + C3->>DOCK: make tasks with container names + C3->>FS: optional get patch from diff url + C3->>FS: write parquet with patch end - %% Step 3 collect_perf_commits.py (red-ish) + %% Step 4 collect_perf_commits.py (red-ish) rect rgb(255,225,220) - U->>C3: step 3 classify performance commits - C3->>FS: read prepared parquet - C3->>C3: report builder per row - C3->>SQL: cache completion in sqlite - C3->>LLM: call llm backends - LLM->>C3: performance classification - C3->>FS: write raw parquet - C3->>FS: write perf only parquet + U->>C4: step 4 classify performance commits + C4->>FS: read prepared parquet + C4->>C4: report builder per row + C4->>SQL: cache completion in sqlite + C4->>LLM: call llm backends + LLM->>C4: performance classification + C4->>FS: write raw parquet + C4->>FS: write perf only parquet end - %% Step 4 synthesize_contexts.py (purple) + %% Step 5 synthesize_contexts.py (purple) rect rgb(235,220,255) - U->>C4: step 4 synthesize contexts - C4->>C4: configure agent backends - C4->>FS: load perf only parquet - C4->>CR: load context registry and update - CR->>C4: context registry ready - C4->>DOCK: build base image - DOCK->>C4: base image built - C4->>C4: prepare task list per repo and commit - C4->>C4: agent build and validate in threads - C4->>FS: write results jsonl and all files by image json - C4->>CR: update context_registry json + U->>C5: step 5 synthesize contexts + C5->>C5: configure agent backends + C5->>FS: load perf only parquet + C5->>CR: load context registry and update + CR->>C5: context registry ready + C5->>DOCK: build base image + DOCK->>C5: base image built + C5->>C5: prepare task list per repo and commit + C5->>C5: agent build and validate in threads + C5->>FS: write results jsonl and all files by image json + C5->>CR: update context_registry json end - %% Step 5 build_and_publish_to_dockerhub.py (teal) + %% Step 6 build_and_publish_to_ecr.py (teal) rect rgb(210,245,245) - U->>C5: step 5 build and publish - C5->>FS: read perf only parquet - C5->>CR: load context registry - C5->>DOCK: build images - DOCK->>C5: built images - C5->>DH: publish images to dockerhub + U->>C6: step 6 build and publish + C6->>FS: read perf only parquet + C6->>CR: load context registry and update + C6->>DOCK: build base image + DOCK->>C6: base image built + C6->>C6: prepare task list for recent package images + C6->>ECR: optional filter tasks not on ecr + ECR->>C6: list of existing images + C6->>DOCK: build using docker validator + DOCK->>C6: built images + C6->>ECR: publish images to aws ecr end +``` - %% Step 6 merge_perfonly_commits_master.py (blue) - rect rgb(210,225,255) - U->>C6: step 6 merge into master parquet - C6->>FS: read new perfonly parquet - C6->>FS: read master parquet - C6->>C6: deduplicate and merge - C6->>FS: write updated master parquet - end +### 1. Scrape Github for asv-compatible repositories - %% Step 7 prepare_formulacode_dataset.py (orange) - rect rgb(255,235,210) - U->>C7: step 7 enrich and upload to HF (optional) - C7->>FS: read master parquet - C7->>DH: resolve available images - DH->>C7: image map - C7->>C7: derive container names and normalize columns - C7->>C7: filter to key columns - C7->>FS: write enriched parquet - C7->>HF: upload monthly and default configs - end -``` +We start by collecting all repositories that use Airspeed Velocity (asv) for benchmarking. This can be done in one of two ways: -
-Prerequisites (one-time setup) +1. Google BigQuery: Google maintains a public dataset of GitHub repositories that can be queried using SQL. -Scrape GitHub for ASV-compatible repositories using `collect_commits.py` or `scrape_repositories.py`. These are **not** part of the monthly orchestrator — run them once to build `repos_valid.csv`. +2. Github Search API: We use the GitHub Search API to find all repositories that have a `asv.conf.json` file in their root directory. This is a more comprehensive search that can find repositories that are not indexed by Google BigQuery. _This version is implemented here._ +To run the script, you need to have a GitHub token with `repo` and `read:org` permissions. You can create a token by following the instructions [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token). + +Run: ```bash $ python scratch/scripts/collect_commits.py \ - --repos scratch/artifacts/pipeflush/repos_valid.csv \ + --dashboards scratch/artifacts/pipeflush/repos_valid.csv \ --outfile scratch/artifacts/pipeflush/commits_all.jsonl \ --max-pages 50 +# Writes scratch/artifacts/processed/repos_discovered.csv and scratch/artifacts/processed/repos_valid.csv ``` -The output `repos_valid.csv` contains repositories that aren't forks/reuploads, have at least a minimum number of stars, and pass other sanity checks (~700 repositories). +The `scratch/artifacts/processed/repos_valid.csv` file contains a subset of the repositories that aren't forks / reuploads / has atleast {min-stars} stars / pass other sanity checks. We found ~700 filtered repositories for this dataset. -
-### Step 1: Collect and filter commits — `collect_and_filter_commits.py` +### 2. Collect relevant commits for all repositories -Given the list of repositories, find merged PRs and filter out commits that only modified benchmarking files, were documentation-only, or could not be installed. +Given the list of repositories, we find the subset of commits that have already been closed and merged into the main branch and then filters out those commits that primarily modified the benchmarking files (e.g. `asv.conf.json`) or were not relevant to the benchmarks (e.g. documentation changes) or could not be installed (e.g. runnning `uv pip install -e .` causes issues). ```bash $ python scratch/scripts/collect_and_filter_commits.py \ --filtered-benchmarks-pth scratch/artifacts/pipeflush/repos_valid.csv \ --output-pth scratch/artifacts/pipeflush/merge_commits_filtered.parquet \ - --threads 8 \ - --procs 32 \ - --since 2025-10-01 \ - --until 2025-11-01 -``` - -### Step 2: Prepare commits for report building — `prepare_commits_for_building_reports.py` + --threads 8 \ + --procs 32 -Tokenize patches, apply a crude performance filter, and optionally fetch full patch text from the GitHub diff API. - -```bash $ python scratch/scripts/prepare_commits_for_building_reports.py \ - --input scratch/artifacts/pipeflush/merge_commits_filtered.parquet \ + --input scratch/artifacts/pipeflush/merge_commits_filtered.parquet \ --output scratch/artifacts/pipeflush/merge_commits_filtered_with_patch.parquet \ --max-workers 200 \ --filter-repos \ --fetch-patches -``` - -### Step 3: Classify performance commits — `collect_perf_commits.py` - -Build a structured report for each commit, call LLM backends for performance classification, and write a filtered parquet of performance-only commits. -```bash $ python scratch/scripts/collect_perf_commits.py \ - --commits scratch/artifacts/pipeflush/merge_commits_filtered_with_patch.parquet \ - --outfile scratch/artifacts/pipeflush/perfonly_commits \ + --commits scratch/artifacts/processed/merge_commits_filtered_with_patch.parquet \ + --outfile scratch/artifacts/processed/perfonly_commits_with_patch.parquet \ --max-workers -1 -# Produces perfonly_commits.raw.parquet and perfonly_commits.parquet ``` -### Step 4: Synthesize Docker contexts — `synthesize_contexts.py` +### 3. Build contexts for all commits -Each context is a (repo, commit) pair with an associated `build_env.sh` script. Contexts that fail to build are filtered out. Common failure reasons: +Each context is a (repo, commit) pair with an associated build_env.sh script to install dependencies. Some reasons a context might fail to build (and get filtered out): 1. Commit couldn't be checked out -2. Commit didn't have an `asv.conf.json` file -3. ASV environment could not be built -4. A quick `asv run` did not succeed +2. Commit didn't have an asv.conf.json file +3. We could not build the asv environment for the commit. +4. We could not run a quick asv run to ensure that the benchmarks run. ```bash $ python scratch/scripts/synthesize_contexts.py \ - --commits scratch/artifacts/pipeflush/perfonly_commits.parquet \ + --commits scratch/artifacts/pipeflush/commits_perfonly.parquet \ --output-dir scratch/artifacts/pipeflush/results_synthesis/ \ --context-registry scratch/artifacts/pipeflush/context_registry.json \ --max-workers 32 \ @@ -332,44 +327,97 @@ $ python scratch/scripts/synthesize_contexts.py \ --push-to-dockerhub ``` -### Step 5: Build and publish images — `build_and_publish_to_dockerhub.py` +### 4. Upload to AWS ECR -Rebuild Docker images and publish them to DockerHub. Credentials are read from environment variables configured in `tokens.env` (see Installation). +We rebuild the Docker images from scratch and then upload them to AWS ECR for later use. ```bash -$ python scratch/scripts/build_and_publish_to_dockerhub.py \ - --commits scratch/artifacts/pipeflush/perfonly_commits.parquet \ +$ python scratch/scripts/build_and_publish_to_ecr.py \ + --commits scratch/artifacts/processed/perfonly_commits_with_patch_final.parquet \ --context-registry scratch/artifacts/pipeflush/context_registry.json \ - --namespace formulacode \ --max-workers 5 \ --skip-existing ``` -### Step 6: Merge into master parquet — `merge_perfonly_commits_master.py` +### 4b. Alternative: Upload to DockerHub + +As an alternative to AWS ECR, you can publish Docker images to DockerHub for easier public sharing and distribution. + +#### Setup + +First, generate a DockerHub access token: +1. Visit https://hub.docker.com/settings/security +2. Click "New Access Token" +3. Give it a descriptive name (e.g., "datasmith-publisher") +4. Copy the token -Deduplicate and append this month's performance commits into the cumulative master parquet. +Then configure your environment: ```bash -$ python scratch/scripts/merge_perfonly_commits_master.py \ - --new-perfonly scratch/artifacts/pipeflush/perfonly_commits.parquet \ - --master scratch/artifacts/pipeflush/perfonly_commits_master.parquet +# Add to your tokens.env file +export DOCKERHUB_NAMESPACE=formulacode # Your DockerHub username or organization +export DOCKERHUB_USERNAME=myusername # Your DockerHub username +export DOCKERHUB_TOKEN=dckr_pat_xxxxx # Access token from above ``` -### Step 7: Enrich and upload to Hugging Face — `prepare_formulacode_dataset.py` +Or use Docker login: +```bash +$ docker login docker.io +# Enter your username and token when prompted +``` -Derive container names, resolve Docker Hub images, normalize difficulty/classification, generate task IDs, and upload the enriched dataset to Hugging Face with monthly configs. Only key columns (task metadata, patch, instructions) are uploaded. +#### Usage ```bash -$ python scratch/scripts/prepare_formulacode_dataset.py \ - --input scratch/artifacts/pipeflush/perfonly_commits_master.parquet \ - --output scratch/artifacts/pipeflush/perfonly_enriched.parquet \ - --dockerhub-repository formulacode/all \ - --upload-to-hf formulacode/formulacode-all \ - --hf-verified-filter /path/to/valid_tasks.json +$ python scratch/scripts/build_and_publish_to_dockerhub.py \ + --commits scratch/artifacts/processed/perfonly_commits_with_patch_final.parquet \ + --context-registry scratch/artifacts/pipeflush/context_registry.json \ + --namespace formulacode \ + --max-workers 5 \ + --skip-existing ``` -> Requires `HF_TOKEN` in `tokens.env`. The upload creates `default`, `verified`, and per-month (`YYYY-MM`) configs on Hugging Face. +#### Options + +- `--namespace`: **Required.** DockerHub namespace (your username or organization) +- `--username`: DockerHub username (or use `DOCKERHUB_USERNAME` env var) +- `--password`: DockerHub token (or use `DOCKERHUB_TOKEN` env var) +- `--repository-mode`: `single` (default) or `mirror` + - `single`: All images in one repository with encoded tags (e.g., `formulacode/all:owner-repo-sha--final`) + - `mirror`: Each project gets its own repository (e.g., `formulacode/owner-repo:sha-final`) +- `--single-repo`: Repository name for single mode (default: `all`) +- `--skip-existing`: Skip images that already exist on DockerHub +- `--max-workers`: Number of parallel build/push operations (default: 8) + +#### Important Notes + +- **Public Visibility**: New DockerHub repositories default to PUBLIC. Change to private manually on DockerHub if needed. +- **Rate Limits**: DockerHub free tier has rate limits. The script handles this with exponential backoff, but consider a paid plan for high-volume publishing. +- **Organization Repositories**: For organization namespaces, repositories may need to be manually created on DockerHub before first push. +- **Push Concurrency**: Default push concurrency is lower for DockerHub (8) vs ECR (12) to avoid rate limiting. + +#### Environment Variables + +```bash +# DockerHub-specific (add to tokens.env) +DOCKERHUB_NAMESPACE=formulacode # Required +DOCKERHUB_USERNAME=myuser # Required +DOCKERHUB_TOKEN=dckr_pat_xxxxx # Required +DOCKERHUB_RATE_LIMIT_WAIT=60 # Optional: seconds to wait on rate limit (default: 60) +DOCKERHUB_SINGLE_REPO=all # Optional: repo name for single mode (default: all) + +# Build settings (shared with ECR) +BUILD_CONCURRENCY=24 # Max parallel builds +PUSH_CONCURRENCY=8 # Max parallel pushes (lower for DockerHub) +DOCKER_USE_BUILDX=0 # Use Docker BuildKit +DOCKER_NETWORK_MODE=host # Build network mode +``` + +### 5. Evaluate all commits + +This is done in FormulaCode's fork of the terminal-bench evaluation framework. + -### Evaluation +## License -Evaluation is done in FormulaCode's fork of the [terminal-bench](https://github.com/formula-code/terminal-bench) evaluation framework. +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. diff --git a/pyproject.toml b/pyproject.toml index b3b8647..34f0450 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -16,6 +16,8 @@ classifiers = [ ] dependencies = [ "asv", + "boto3>=1.7.84", + "botocore>=1.10.84", "docker", "dspy>=2.6.27", "gitpython", @@ -156,3 +158,14 @@ disable_error_code = ["no-any-unimported", "import-untyped"] [[tool.mypy.overrides]] module = "datasmith.execution.utils" disallow_untyped_defs = false + +[[tool.mypy.overrides]] +module = "boto3.*" +ignore_missing_imports = true + +[[tool.mypy.overrides]] +module = "botocore.*" +ignore_missing_imports = true + +[tool.poetry.dev-dependencies] +types-boto3 = "*" diff --git a/scratch/notebooks/aws_benchmarking_prepare_batches.ipynb b/scratch/notebooks/aws_benchmarking_prepare_batches.ipynb new file mode 100644 index 0000000..98de1e2 --- /dev/null +++ b/scratch/notebooks/aws_benchmarking_prepare_batches.ipynb @@ -0,0 +1,320 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "cd247486", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/mnt/sdd1/atharvas/formulacode/datasmith\n", + "2025-09-20\n" + ] + } + ], + "source": [ + "%cd /mnt/sdd1/atharvas/formulacode/datasmith/\n", + "import datetime\n", + "from pathlib import Path\n", + "\n", + "import dotenv\n", + "\n", + "dotenv.load_dotenv(\"tokens.env\")\n", + "\n", + "curr_date: str = datetime.datetime.now().isoformat().split(\"T\")[0]\n", + "\n", + "print(curr_date)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5cef9306", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "\n", + "import docker\n", + "from tqdm.auto import tqdm\n", + "\n", + "logger = logging.getLogger(__name__)\n", + "\n", + "all_images = [\n", + " line.split()[10]\n", + " for line in Path(\"scratch/scripts/parallel_validate_containers.log\").read_text().splitlines()\n", + " if \"buildx build\" in line\n", + "]\n", + "images_to_remove = [\n", + " line.split()[3]\n", + " for line in Path(\"scratch/scripts/parallel_validate_containers.log\").read_text().splitlines()\n", + " if \"failed to run\" in line\n", + "]\n", + "images_to_keep = list(set(all_images) - set(images_to_remove))\n", + "print(f\"Keeping {len(images_to_keep)} images, removing {len(images_to_remove)} images\")\n", + "# Increase timeout significantly for large image operations\n", + "client = docker.from_env(timeout=3600) # 1 hour timeout\n", + "for img in tqdm(images_to_remove):\n", + " try:\n", + " client.images.remove(img, force=True)\n", + " except Exception as e:\n", + " logger.debug(f\"Failed to remove image {img}: {e}\")\n", + " continue" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "3bd3fd96", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "381" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(images_to_keep)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e81690f4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found 80 images locally\n" + ] + } + ], + "source": [ + "from docker.models.images import Image\n", + "\n", + "\n", + "def get_image(image_name: str) -> Image | None:\n", + " client = docker.from_env(timeout=3600) # 1 hour timeout\n", + " try:\n", + " return client.images.get(image_name)\n", + " except Exception:\n", + " return None\n", + "\n", + "\n", + "images = {img_name.split(\":\")[0]: get_image(img_name) for img_name in images_to_keep}\n", + "images = {img_name: img for img_name, img in images.items() if img is not None}\n", + "print(f\"Found {len(images)} images locally\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a54cd339", + "metadata": {}, + "outputs": [], + "source": "import contextlib\nimport json\nimport os\nimport re\nimport subprocess\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom pathlib import Path\n\nimport boto3\nfrom boto3.s3.transfer import TransferConfig\nfrom botocore.config import Config as BotoConfig\n\n\ndef _safe_tar_name(image_name: str) -> str:\n \"\"\"\n Make a filesystem- and S3-friendly tar name from the image name.\n e.g. 'registry:5000/ns/app:1.2.3' -> 'registry_5000_ns_app_1.2.3.tar.gz'\n \"\"\"\n base = re.sub(r\"[^A-Za-z0-9._-]+\", \"_\", image_name.replace(\"/\", \"_\").replace(\":\", \"_\"))\n return f\"{base}.tar.gz\"\n\n\nclass TarballCreationError(Exception):\n \"\"\"Error during tarball creation\"\"\"\n\n\ndef _create_tarball_subprocess(image_name: str, tar_gz_path: str, max_retries: int = 3) -> None:\n \"\"\"\n Create a tarball using subprocess calls to docker save and gzip.\n This is much more memory efficient and faster than Python-based streaming.\n \"\"\"\n # Create the directory if it doesn't exist\n os.makedirs(os.path.dirname(tar_gz_path), exist_ok=True)\n\n temp_tar_path = tar_gz_path.replace(\".tar.gz\", \".tar\")\n\n for attempt in range(max_retries):\n try:\n print(f\"[+] Attempt {attempt + 1}/{max_retries}: Creating tarball for {image_name}\")\n\n # Use docker save to create tar file\n docker_cmd = [\"docker\", \"save\", image_name, \"-o\", temp_tar_path]\n subprocess.run(docker_cmd, capture_output=True, text=True, check=True) # noqa: S603\n\n print(f\"[+] Compressing {temp_tar_path} to {tar_gz_path}\")\n # Use gzip to compress the tar file\n gzip_cmd = [\"gzip\", \"-c\", temp_tar_path]\n with open(tar_gz_path, \"wb\") as f_out:\n subprocess.run(gzip_cmd, stdout=f_out, check=True) # noqa: S603\n\n # Remove the temporary tar file\n os.remove(temp_tar_path)\n print(f\"[\u2713] Successfully created tarball for {image_name}\")\n except subprocess.CalledProcessError as e:\n print(f\"[!] Attempt {attempt + 1} failed for {image_name}: {e}\")\n if e.stderr:\n print(f\"[!] Error output: {e.stderr}\")\n except Exception as e:\n print(f\"[!] Attempt {attempt + 1} failed for {image_name}: {e}\")\n else:\n return\n\n # Clean up any partial files\n for path in [temp_tar_path, tar_gz_path]:\n if os.path.exists(path):\n with contextlib.suppress(Exception):\n os.remove(path)\n\n if attempt < max_retries - 1:\n print(f\"[+] Retrying {image_name} in 5 seconds...\")\n import time\n\n time.sleep(5)\n\n msg = f\"Failed to create tarball for {image_name}\"\n raise TarballCreationError(msg)\n\n\ndef _upload_single_file(\n s3_client, local_path: str, bucket: str, key: str, extra_args: dict, tx_config: TransferConfig\n) -> str:\n \"\"\"Upload a single file to S3.\"\"\"\n try:\n print(f\"[+] Uploading {local_path} \u2192 s3://{bucket}/{key}\")\n s3_client.upload_file(local_path, bucket, key, ExtraArgs=extra_args, Config=tx_config)\n uri = f\"s3://{bucket}/{key}\"\n print(f\"[\u2713] Uploaded: {uri}\")\n except Exception as e:\n print(f\"[!] Error uploading {local_path}: {e}\")\n raise\n else:\n return uri\n\n\ndef upload_images_to_s3(\n images: dict[str, \"Image\"],\n bucket: str,\n prefix: str = \"\",\n region: str | None = None,\n sse: str | None = None, # e.g. \"AES256\" or \"aws:kms\"\n tarball_dir: str = \"scratch/artifacts/tarballs\",\n force: bool = False,\n max_workers_tarball: int = 4,\n max_workers_upload: int = 8,\n) -> dict[str, str]:\n \"\"\"\n For each {name: Image} entry, produce a .tar.gz and upload to s3://bucket/prefix/.tar.gz\n Also saves the tarballs locally to tarball_dir.\n If force is enabled, the local tarball and uploaded tarball are overwritten.\n\n Args:\n max_workers_tarball: Number of parallel workers for tarball creation\n max_workers_upload: Number of parallel workers for S3 uploads\n \"\"\"\n # Prepare S3 client and config (only needed when upload code is enabled)\n # session = boto3.session.Session(region_name=region)\n # s3_client = session.client(\"s3\", config=BotoConfig(retries={\"max_attempts\": 10}))\n # tx_config = TransferConfig(multipart_threshold=8 * 1024 * 1024, multipart_chunksize=8 * 1024 * 1024)\n\n extra_args = {\"ContentType\": \"application/gzip\"}\n if sse:\n extra_args[\"ServerSideEncryption\"] = sse\n\n # Stage 1: Create all tarballs in parallel\n print(f\"[+] Stage 1: Creating {len(images)} tarballs with {max_workers_tarball} workers\")\n tarball_tasks = []\n\n for name, _img in images.items():\n tar_gz_name = _safe_tar_name(name)\n local_tar_gz_path = os.path.join(tarball_dir, tar_gz_name)\n\n if os.path.exists(local_tar_gz_path) and not force:\n print(f\"[+] Skipping {name} as it already exists locally\")\n continue\n\n tarball_tasks.append((name, local_tar_gz_path))\n\n # Create tarballs in parallel\n with ThreadPoolExecutor(max_workers=max_workers_tarball) as executor:\n future_to_name = {executor.submit(_create_tarball_subprocess, name, path): name for name, path in tarball_tasks}\n\n for future in as_completed(future_to_name):\n name = future_to_name[future]\n try:\n future.result()\n except Exception as e:\n print(f\"[!] Failed to create tarball for {name}: {e}\")\n # Remove from tasks if it failed\n tarball_tasks = [(n, p) for n, p in tarball_tasks if n != name]\n\n # Stage 2: Upload all tarballs to S3 in parallel\n print(f\"[+] Stage 2: Uploading {len(tarball_tasks)} tarballs with {max_workers_upload} workers\")\n mapping: dict[str, str] = {}\n\n upload_tasks = []\n for name, local_tar_gz_path in tarball_tasks:\n if os.path.exists(local_tar_gz_path):\n tar_gz_name = _safe_tar_name(name)\n key = f\"{prefix.strip('/')}/{tar_gz_name}\" if prefix else tar_gz_name\n upload_tasks.append((name, local_tar_gz_path, key))\n\n # # Upload to S3 in parallel\n # with ThreadPoolExecutor(max_workers=max_workers_upload) as executor:\n # future_to_name = {\n # executor.submit(_upload_single_file, s3_client, local_path, bucket, key, extra_args, tx_config): name\n # for name, local_path, key in upload_tasks\n # }\n\n # for future in as_completed(future_to_name):\n # name = future_to_name[future]\n # try:\n # uri = future.result()\n # mapping[name] = uri\n # except Exception as e:\n # print(f\"[!] Failed to upload {name}: {e}\")\n # # Clean up the local file if upload failed\n # local_path = next((path for n, path, _ in upload_tasks if n == name), None)\n # if local_path and os.path.exists(local_path):\n # with contextlib.suppress(Exception):\n # os.remove(local_path)\n\n return mapping\n\n\n# Filter out None values from images dict\nvalid_images = {name: img for name, img in images.items() if img is not None}\n\nmapping = upload_images_to_s3(\n valid_images,\n bucket=os.environ[\"AWS_S3_BUCKET\"],\n prefix=\"docker-tarballs/\",\n region=os.environ.get(\"AWS_REGION\", None),\n max_workers_tarball=30, # Adjust based on your system\n max_workers_upload=10, # Adjust based on your network bandwidth\n)\n\nPath(\"scratch/artifacts/aws_docker_image_s3_mapping.json\").write_text(json.dumps(mapping, indent=2))" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c38d214d", + "metadata": {}, + "outputs": [], + "source": [ + "import gzip\n", + "import json\n", + "import os\n", + "import shutil\n", + "\n", + "from boto3.s3.transfer import TransferConfig\n", + "\n", + "\n", + "def _safe_tar_name(image_name: str) -> str:\n", + " \"\"\"\n", + " Make a filesystem- and S3-friendly tar name from the image name.\n", + " e.g. 'registry:5000/ns/app:1.2.3' -> 'registry_5000_ns_app_1.2.3.tar.gz'\n", + " \"\"\"\n", + " base = re.sub(r\"[^A-Za-z0-9._-]+\", \"_\", image_name.replace(\"/\", \"_\").replace(\":\", \"_\"))\n", + " return f\"{base}.tar.gz\"\n", + "\n", + "\n", + "def _save_image_to_tar_gz(img, tar_gz_path: str, max_retries: int = 3) -> None:\n", + " \"\"\"\n", + " Stream-save a docker image to tar.gz without loading it all into memory.\n", + " Includes retry logic for timeout issues.\n", + " \"\"\"\n", + " # Create the directory if it doesn't exist\n", + " os.makedirs(os.path.dirname(tar_gz_path), exist_ok=True)\n", + "\n", + " temp_tar_path = tar_gz_path.replace(\".tar.gz\", \".tar\")\n", + "\n", + " for attempt in range(max_retries):\n", + " try:\n", + " print(f\"[+] Attempt {attempt + 1}/{max_retries}: Saving image to {temp_tar_path}\")\n", + "\n", + " # image.save(named=True) yields a generator of bytes\n", + " stream = img.save(named=True)\n", + "\n", + " # Write to a temporary tar file first, then compress\n", + " with open(temp_tar_path, \"wb\") as f:\n", + " for chunk in stream:\n", + " f.write(chunk)\n", + "\n", + " print(f\"[+] Compressing {temp_tar_path} to {tar_gz_path}\")\n", + " # Compress the tar file to tar.gz\n", + " with open(temp_tar_path, \"rb\") as f_in, gzip.open(tar_gz_path, \"wb\") as f_out:\n", + " shutil.copyfileobj(f_in, f_out)\n", + "\n", + " # Remove the temporary tar file\n", + " os.remove(temp_tar_path)\n", + " print(\"[\u2713] Successfully saved and compressed image\")\n", + " except Exception as e:\n", + " print(f\"[!] Attempt {attempt + 1} failed: {e}\")\n", + " # Clean up any partial files\n", + " for path in [temp_tar_path, tar_gz_path]:\n", + " if os.path.exists(path):\n", + " with contextlib.suppress(Exception):\n", + " os.remove(path)\n", + "\n", + " if attempt == max_retries - 1:\n", + " raise\n", + " else:\n", + " print(\"[+] Retrying in 5 seconds...\")\n", + " import time\n", + "\n", + " time.sleep(5)\n", + " else:\n", + " return\n", + "\n", + "\n", + "def upload_images_to_s3(\n", + " images: dict[str, \"Image\"],\n", + " bucket: str,\n", + " prefix: str = \"\",\n", + " region: str | None = None,\n", + " sse: str | None = None, # e.g. \"AES256\" or \"aws:kms\"\n", + " tarball_dir: str = \"scratch/artifacts/tarballs\",\n", + " force: bool = False,\n", + ") -> dict[str, str]:\n", + " \"\"\"\n", + " For each {name: Image} entry, produce a .tar.gz and upload to s3://bucket/prefix/.tar.gz\n", + " Also saves the tarballs locally to tarball_dir.\n", + " If force is enabled, the local tarball and uploaded tarball are overwritten.\n", + " \"\"\"\n", + " session = boto3.session.Session(region_name=region)\n", + " s3 = session.client(\"s3\", config=BotoConfig(retries={\"max_attempts\": 10}))\n", + " tx_config = TransferConfig(multipart_threshold=8 * 1024 * 1024, multipart_chunksize=8 * 1024 * 1024)\n", + "\n", + " mapping: dict[str, str] = {}\n", + "\n", + " for name, img in images.items():\n", + " tar_gz_name = _safe_tar_name(name)\n", + " key = f\"{prefix.strip('/')}/{tar_gz_name}\" if prefix else tar_gz_name\n", + "\n", + " # Local path for the tarball\n", + " local_tar_gz_path = os.path.join(tarball_dir, tar_gz_name)\n", + " if os.path.exists(local_tar_gz_path) and not force:\n", + " print(f\"[+] Skipping {name} as it already exists locally\")\n", + " continue\n", + "\n", + " try:\n", + " print(f\"[+] Saving image '{name}' \u2192 {local_tar_gz_path}\")\n", + " _save_image_to_tar_gz(img, local_tar_gz_path)\n", + "\n", + " extra = {\"ContentType\": \"application/gzip\"}\n", + " if sse:\n", + " extra[\"ServerSideEncryption\"] = sse\n", + "\n", + " print(f\"[+] Uploading {local_tar_gz_path} \u2192 s3://{bucket}/{key}\")\n", + " s3.upload_file(local_tar_gz_path, bucket, key, ExtraArgs=extra, Config=tx_config)\n", + "\n", + " uri = f\"s3://{bucket}/{key}\"\n", + " mapping[name] = uri\n", + " print(f\"[\u2713] Uploaded: {uri}\")\n", + " except Exception as e:\n", + " print(f\"[!] Error processing {name}: {e}\")\n", + " # Clean up the local file if it was created but upload failed\n", + " if os.path.exists(local_tar_gz_path):\n", + " with contextlib.suppress(Exception):\n", + " os.remove(local_tar_gz_path)\n", + "\n", + " return mapping\n", + "\n", + "\n", + "mapping = upload_images_to_s3(\n", + " images,\n", + " bucket=os.environ[\"AWS_S3_BUCKET\"],\n", + " prefix=\"docker-tarballs/\",\n", + " region=os.environ.get(\"AWS_REGION\", None),\n", + ")\n", + "\n", + "Path(\"Scratch/artifacts/aws_docker_image_s3_mapping.json\").write_text(json.dumps(mapping, indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d94f9955", + "metadata": {}, + "outputs": [], + "source": [ + "# #!/usr/bin/env bash\n", + "# set -euo pipefail\n", + "\n", + "# S3_URI=\"s3://my-artifacts-bucket/bundles/docker-images-2025-09-19.tar.gz\"\n", + "# WORKDIR=\"${WORKDIR:-/tmp/docker-image-bundle}\"\n", + "\n", + "# mkdir -p \"$WORKDIR\"\n", + "# cd \"$WORKDIR\"\n", + "\n", + "# echo \"Downloading bundle...\"\n", + "# aws s3 cp \"$S3_URI\" ./bundle.tar.gz\n", + "\n", + "# echo \"Extracting bundle...\"\n", + "# tar -xzf bundle.tar.gz\n", + "\n", + "# echo \"Loading images into Docker...\"\n", + "# shopt -s nullglob\n", + "# for img_tar in *.tar; do\n", + "# echo \"Loading $img_tar ...\"\n", + "# docker load -i \"$img_tar\"\n", + "# done\n", + "\n", + "# echo \"Done. Loaded images:\"\n", + "# docker images\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/scratch/notebooks/migrate_ecr_to_dockerhub.ipynb b/scratch/notebooks/migrate_ecr_to_dockerhub.ipynb new file mode 100644 index 0000000..e454ed5 --- /dev/null +++ b/scratch/notebooks/migrate_ecr_to_dockerhub.ipynb @@ -0,0 +1,610 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Migrate Containers from ECR to DockerHub\n", + "\n", + "This notebook:\n", + "1. Lists all containers on ECR\n", + "2. Filters ECR images by push date (only includes images pushed after a specified date)\n", + "3. Lists all containers on DockerHub\n", + "4. Finds containers that are on ECR but not on DockerHub\n", + "5. Pulls those containers from ECR\n", + "6. Pushes them to DockerHub" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/mnt/sdd1/atharvas/formulacode/datasmith\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:04 INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "AWS Account ID: 204464138089\n", + "ECR Registry: 204464138089.dkr.ecr.us-east-1.amazonaws.com\n", + "ECR Repository: formulacode/all\n", + "DockerHub Namespace: formulacode\n", + "DockerHub Repository: all\n", + "Date Filter: Only images pushed after 2025-11-07 01:00:00+00:00\n" + ] + } + ], + "source": [ + "%cd /mnt/sdd1/atharvas/formulacode/datasmith/\n", + "import os\n", + "import base64\n", + "from datetime import datetime, timezone\n", + "import docker\n", + "import boto3\n", + "from botocore.exceptions import ClientError\n", + "from datasmith.docker.dockerhub import _list_dockerhub_tags_single_repo, _get_dockerhub_credentials\n", + "from datasmith.docker.ecr import _list_ecr_tags_single_repo\n", + "from datasmith.logging_config import configure_logging\n", + "\n", + "logger = configure_logging()\n", + "\n", + "# Configuration\n", + "AWS_REGION = os.environ.get(\"AWS_REGION\", \"us-east-1\")\n", + "ECR_REPO = \"formulacode/all\"\n", + "DOCKERHUB_NAMESPACE = \"formulacode\" # Change to your DockerHub namespace\n", + "DOCKERHUB_REPO = \"all\"\n", + "\n", + "# Date filter - only migrate images pushed after this date\n", + "MIN_PUSHED_AT_UTC = datetime(2025, 11, 7, 1, 0, 0, tzinfo=timezone.utc)\n", + "\n", + "# Get DockerHub credentials\n", + "DOCKERHUB_USERNAME = os.environ.get(\"DOCKERHUB_USERNAME\")\n", + "DOCKERHUB_PASSWORD = os.environ.get(\"DOCKERHUB_TOKEN\") or os.environ.get(\"DOCKERHUB_PASSWORD\")\n", + "\n", + "# Get Docker client\n", + "docker_client = docker.from_env()\n", + "\n", + "# Get AWS account ID for ECR\n", + "session = boto3.session.Session(region_name=AWS_REGION)\n", + "sts = session.client(\"sts\")\n", + "account_id = sts.get_caller_identity()[\"Account\"]\n", + "ECR_REGISTRY = f\"{account_id}.dkr.ecr.{AWS_REGION}.amazonaws.com\"\n", + "\n", + "print(f\"AWS Account ID: {account_id}\")\n", + "print(f\"ECR Registry: {ECR_REGISTRY}\")\n", + "print(f\"ECR Repository: {ECR_REPO}\")\n", + "print(f\"DockerHub Namespace: {DOCKERHUB_NAMESPACE}\")\n", + "print(f\"DockerHub Repository: {DOCKERHUB_REPO}\")\n", + "print(f\"Date Filter: Only images pushed after {MIN_PUSHED_AT_UTC}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## List All Containers on ECR" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:05 INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Listing all images on ECR...\n", + "\n", + "Found 1746 images on ECR\n", + "\n", + "First 10 ECR images:\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:activitysim-activitysim-2f9afa0ad632d9dd3d98730c631daebb9f765b1a--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:allencellmodeling-aicsimageio-738ce92dcc4440563b77557bcc1fdf6b0a82738f--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:allencellmodeling-aicsimageio-738ce92dcc4440563b77557bcc1fdf6b0a82738f--run\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:allencellmodeling-aicsimageio-c49a613dc54381d11237240ba36f0ef54603a7d6--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-821c126ed9a41777518be256f075efd859b8e62b--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-cfbfbeb4b274a4990803c179936567d524f5e694--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-cfbfbeb4b274a4990803c179936567d524f5e694--run\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-e6b5e2bbdcd721cb621e7171964d84e9d48d592c--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:asdf-format-asdf-97a9fe49e44945ac554b5374ded165405fe878ee--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:asdf-format-asdf-bf1843ef8bcbe1d89623e26659ff408362f62fc7--final\n" + ] + } + ], + "source": [ + "print(\"Listing all images on ECR...\")\n", + "ecr_tags = _list_ecr_tags_single_repo(region=AWS_REGION, repo_name=ECR_REPO)\n", + "\n", + "print(f\"\\nFound {len(ecr_tags)} images on ECR\")\n", + "print(f\"\\nFirst 10 ECR images:\")\n", + "for tag in sorted(ecr_tags)[:10]:\n", + " print(f\" - {ECR_REGISTRY}/{ECR_REPO}:{tag}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:12 INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Filtering 1746 ECR images by push date (after 2025-11-07 01:00:00+00:00)...\n", + "\n", + "Filtered from 1746 to 953 images\n", + "Removed 793 images pushed before 2025-11-07 01:00:00+00:00\n", + "\n", + "First 10 filtered ECR images:\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:allencellmodeling-aicsimageio-738ce92dcc4440563b77557bcc1fdf6b0a82738f--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:allencellmodeling-aicsimageio-c49a613dc54381d11237240ba36f0ef54603a7d6--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-821c126ed9a41777518be256f075efd859b8e62b--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-cfbfbeb4b274a4990803c179936567d524f5e694--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:arviz-devs-arviz-e6b5e2bbdcd721cb621e7171964d84e9d48d592c--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:bjodah-chempy-2d148283c48a84d9414765fbabc806042654b9c2--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:danielgtaylor-python-betterproto-bd7de203e16e949666b2844b3dec1eb7c4ed523c--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:dasdae-dascore-6f098be09791468f989715a5bcfb651a0a44db3e--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:dasdae-dascore-7139676bd5c82db6456c0939022a5b14275b0dfb--final\n", + " - 204464138089.dkr.ecr.us-east-1.amazonaws.com/formulacode/all:dasdae-dascore-9ade640f5154e37b2378b9fdbf342c653fa8497d--final\n" + ] + } + ], + "source": [ + "def filter_ecr_tags_by_push_date(\n", + " tags: set[str],\n", + " *,\n", + " region: str,\n", + " repo_name: str,\n", + " cutoff: datetime,\n", + ") -> set[str]:\n", + " \"\"\"\n", + " Filter ECR tags to only include those pushed after the cutoff date.\n", + " \n", + " Args:\n", + " tags: Set of ECR image tags to filter\n", + " region: AWS region\n", + " repo_name: ECR repository name\n", + " cutoff: Only include images pushed after this datetime (must be timezone-aware)\n", + " \n", + " Returns:\n", + " Set of tags that were pushed after the cutoff date\n", + " \"\"\"\n", + " if not tags:\n", + " return set()\n", + " \n", + " session = boto3.session.Session(region_name=region)\n", + " ecr_client = session.client(\"ecr\")\n", + " \n", + " kept: set[str] = set()\n", + " \n", + " # Query ECR in chunks of 100 tags (API limit)\n", + " CHUNK_SIZE = 100\n", + " tag_list = sorted(tags)\n", + " \n", + " for i in range(0, len(tag_list), CHUNK_SIZE):\n", + " chunk = tag_list[i : i + CHUNK_SIZE]\n", + " image_ids = [{\"imageTag\": tag} for tag in chunk]\n", + " \n", + " try:\n", + " resp = ecr_client.describe_images(\n", + " repositoryName=repo_name,\n", + " imageIds=image_ids\n", + " )\n", + " except ClientError as ce:\n", + " code = ce.response.get(\"Error\", {}).get(\"Code\")\n", + " if code in {\"RepositoryNotFoundException\", \"ImageNotFoundException\"}:\n", + " # Skip this chunk if repo/images don't exist\n", + " logger.warning(f\"Repository or images not found for chunk {i}-{i+len(chunk)}\")\n", + " continue\n", + " # Re-raise unexpected errors\n", + " raise\n", + " \n", + " # Process image details\n", + " for detail in resp.get(\"imageDetails\", []):\n", + " pushed_at = detail.get(\"imagePushedAt\") # boto3 returns timezone-aware datetime\n", + " image_tags = detail.get(\"imageTags\", [])\n", + " \n", + " if not pushed_at or not image_tags:\n", + " continue\n", + " \n", + " # If this image was pushed after the cutoff, keep all its tags\n", + " if pushed_at > cutoff:\n", + " for tag in image_tags:\n", + " if tag in tags: # Only add if it was in our original set\n", + " kept.add(tag)\n", + " \n", + " return kept\n", + "\n", + "\n", + "print(f\"Filtering {len(ecr_tags)} ECR images by push date (after {MIN_PUSHED_AT_UTC})...\")\n", + "filtered_ecr_tags = filter_ecr_tags_by_push_date(\n", + " ecr_tags,\n", + " region=AWS_REGION,\n", + " repo_name=ECR_REPO,\n", + " cutoff=MIN_PUSHED_AT_UTC\n", + ")\n", + "\n", + "print(f\"\\nFiltered from {len(ecr_tags)} to {len(filtered_ecr_tags)} images\")\n", + "print(f\"Removed {len(ecr_tags) - len(filtered_ecr_tags)} images pushed before {MIN_PUSHED_AT_UTC}\")\n", + "print(f\"\\nFirst 10 filtered ECR images:\")\n", + "for tag in sorted(filtered_ecr_tags)[:10]:\n", + " print(f\" - {ECR_REGISTRY}/{ECR_REPO}:{tag}\")\n", + "\n", + "# Update ecr_tags to use the filtered set\n", + "ecr_tags = filtered_ecr_tags" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filter ECR Images by Push Date\n", + "\n", + "Filter ECR images to only include those pushed after the specified date." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## List All Containers on DockerHub" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Listing all images on DockerHub...\n", + "\n", + "Found 793 images on DockerHub\n", + "\n", + "First 10 DockerHub images:\n", + " - docker.io/formulacode/all:allencellmodeling-aicsimageio-738ce92dcc4440563b77557bcc1fdf6b0a82738f--final\n", + " - docker.io/formulacode/all:allencellmodeling-aicsimageio-c49a613dc54381d11237240ba36f0ef54603a7d6--final\n", + " - docker.io/formulacode/all:arviz-devs-arviz-821c126ed9a41777518be256f075efd859b8e62b--final\n", + " - docker.io/formulacode/all:arviz-devs-arviz-cfbfbeb4b274a4990803c179936567d524f5e694--final\n", + " - docker.io/formulacode/all:arviz-devs-arviz-e6b5e2bbdcd721cb621e7171964d84e9d48d592c--final\n", + " - docker.io/formulacode/all:bjodah-chempy-2d148283c48a84d9414765fbabc806042654b9c2--final\n", + " - docker.io/formulacode/all:danielgtaylor-python-betterproto-bd7de203e16e949666b2844b3dec1eb7c4ed523c--final\n", + " - docker.io/formulacode/all:dasdae-dascore-6f098be09791468f989715a5bcfb651a0a44db3e--final\n", + " - docker.io/formulacode/all:dasdae-dascore-7139676bd5c82db6456c0939022a5b14275b0dfb--final\n", + " - docker.io/formulacode/all:dasdae-dascore-9ade640f5154e37b2378b9fdbf342c653fa8497d--final\n" + ] + } + ], + "source": [ + "if not DOCKERHUB_USERNAME or not DOCKERHUB_PASSWORD:\n", + " raise ValueError(\n", + " \"DockerHub credentials required. Set DOCKERHUB_USERNAME and DOCKERHUB_TOKEN environment variables.\\n\"\n", + " \"Generate tokens at: https://hub.docker.com/settings/security\"\n", + " )\n", + "\n", + "print(\"Listing all images on DockerHub...\")\n", + "dockerhub_tags = _list_dockerhub_tags_single_repo(\n", + " namespace=DOCKERHUB_NAMESPACE,\n", + " repo_name=DOCKERHUB_REPO,\n", + " username=DOCKERHUB_USERNAME,\n", + " password=DOCKERHUB_PASSWORD\n", + ")\n", + "\n", + "print(f\"\\nFound {len(dockerhub_tags)} images on DockerHub\")\n", + "print(f\"\\nFirst 10 DockerHub images:\")\n", + "for tag in sorted(dockerhub_tags)[:10]:\n", + " print(f\" - docker.io/{DOCKERHUB_NAMESPACE}/{DOCKERHUB_REPO}:{tag}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Find Containers on ECR but Not on DockerHub" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Found 161 images on ECR that are NOT on DockerHub\n", + "\n", + "First 20 missing images:\n", + " - deepchecks-deepchecks-9d1261c40a3685ece57f0fefc72a36ed129fc10e--final\n", + " - devitocodes-devito-77ffae75d3498c2d576d8aa42452d03d30c10535--final\n", + " - devitocodes-devito-c7c5277ced30d737475b278610b8471a309bb8df--final\n", + " - dipy-dipy-03976f506cca9bc0b38fa4c5ea264d780f000c24--final\n", + " - dipy-dipy-0f32296de973299b21352370e1288633778d0949--final\n", + " - dipy-dipy-b585dd6a502c27617804fdf6d73929573e7059f9--final\n", + " - kedro-org-kedro-38bb1b248a49e9687eb60831508d30c018eea565--final\n", + " - kedro-org-kedro-4ba6b50352454bf73fa6c1a62f9ae5a60b595aff--final\n", + " - kedro-org-kedro-70734ce00ee46b58c85b4cf04afbe89d32c06758--final\n", + " - lmfit-lmfit-py-03daaf8ac13158920745bceece36f28b5d599af9--final\n", + " - lmfit-lmfit-py-03f1b9b9660ebd00177f6cb17d12121cdcccd1a5--final\n", + " - lmfit-lmfit-py-1a009ed4fee493ed4fb5dedc7691fa01507de90e--final\n", + " - lmfit-lmfit-py-589c029fff571397ea14d9dbe6ae2283c6f10780--final\n", + " - microsoft-qcodes-6e3c05df1839d7a49dff967c45f392f426e03ad5--final\n", + " - microsoft-qcodes-92bb5d1dabe2732cfe9c823d3451cb43244cf59f--final\n", + " - microsoft-qcodes-ab0c5de9aa89363d66cfbd856b1d2e90a5c5a189--final\n", + " - napari-napari-1546f18ecc13364d5415623a9c11ed760ff043e2--final\n", + " - networkx-networkx-36e8a1ee85ca0ab4195a486451ca7d72153e2e00--final\n", + " - networkx-networkx-37fd64ced70b896d92eb4940f2d6a78bfc3ad328--final\n", + " - networkx-networkx-c8e48981844bf7f910688620166fc865b91fb5be--final\n", + "\n", + "Total images to migrate: 161\n" + ] + } + ], + "source": [ + "# Find images that exist on ECR but not on DockerHub\n", + "missing_tags = ecr_tags - dockerhub_tags\n", + "\n", + "print(f\"\\nFound {len(missing_tags)} images on ECR that are NOT on DockerHub\")\n", + "print(f\"\\nFirst 20 missing images:\")\n", + "for tag in sorted(missing_tags)[:20]:\n", + " print(f\" - {tag}\")\n", + "\n", + "# Store sorted list for migration\n", + "tags_to_migrate = sorted(missing_tags)\n", + "print(f\"\\nTotal images to migrate: {len(tags_to_migrate)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Authenticate with ECR and DockerHub" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Logging in to ECR...\n", + "✓ Logged in to ECR: 204464138089.dkr.ecr.us-east-1.amazonaws.com\n", + "\n", + "Logging in to DockerHub...\n", + "✓ Logged in to DockerHub as formulacode\n" + ] + } + ], + "source": [ + "# Login to ECR\n", + "print(\"Logging in to ECR...\")\n", + "ecr_client = session.client(\"ecr\")\n", + "auth_response = ecr_client.get_authorization_token()\n", + "auth_data = auth_response[\"authorizationData\"][0]\n", + "ecr_username, ecr_password = base64.b64decode(auth_data[\"authorizationToken\"]).decode().split(\":\", 1)\n", + "ecr_endpoint = auth_data[\"proxyEndpoint\"].replace(\"https://\", \"\")\n", + "\n", + "docker_client.login(\n", + " username=ecr_username,\n", + " password=ecr_password,\n", + " registry=ecr_endpoint\n", + ")\n", + "print(f\"✓ Logged in to ECR: {ecr_endpoint}\")\n", + "\n", + "# Login to DockerHub\n", + "print(\"\\nLogging in to DockerHub...\")\n", + "docker_client.login(\n", + " username=DOCKERHUB_USERNAME,\n", + " password=DOCKERHUB_PASSWORD,\n", + " registry=\"docker.io\"\n", + ")\n", + "print(f\"✓ Logged in to DockerHub as {DOCKERHUB_USERNAME}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pull Containers from ECR\n", + "\n", + "This cell pulls all missing containers from ECR. This may take a long time depending on the number and size of images." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm.notebook import tqdm\n", + "\n", + "pulled_images = {}\n", + "pull_errors = {}\n", + "\n", + "print(f\"Pulling {len(tags_to_migrate)} images from ECR...\\n\")\n", + "\n", + "for tag in tqdm(tags_to_migrate, desc=\"Pulling from ECR\"):\n", + " ecr_image_ref = f\"{ECR_REGISTRY}/{ECR_REPO}:{tag}\"\n", + " try:\n", + " print(f\"Pulling {ecr_image_ref}...\")\n", + " image = docker_client.images.pull(ecr_image_ref)\n", + " pulled_images[tag] = image\n", + " print(f\" ✓ Pulled {ecr_image_ref}\")\n", + " except Exception as e:\n", + " pull_errors[tag] = str(e)\n", + " print(f\" ✖ Failed to pull {ecr_image_ref}: {e}\")\n", + "\n", + "print(f\"\\n✓ Successfully pulled {len(pulled_images)} images\")\n", + "if pull_errors:\n", + " print(f\"✖ Failed to pull {len(pull_errors)} images\")\n", + " print(\"\\nPull errors:\")\n", + " for tag, error in list(pull_errors.items())[:10]:\n", + " print(f\" - {tag}: {error}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Push Containers to DockerHub\n", + "\n", + "This cell pushes all pulled containers to DockerHub. This may take a long time.\n", + "\n", + "**Note:** Rate limiting may occur with DockerHub. The cell includes retry logic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"=\" * 80)\n", + "print(\"MIGRATION SUMMARY\")\n", + "print(\"=\" * 80)\n", + "print(f\"Date Filter: Images pushed after {MIN_PUSHED_AT_UTC}\")\n", + "print(f\"Total images on ECR (filtered): {len(ecr_tags)}\")\n", + "print(f\"Total images on DockerHub (before): {len(dockerhub_tags)}\")\n", + "print(f\"Images needing migration: {len(tags_to_migrate)}\")\n", + "print(f\"Images successfully pulled: {len(pulled_images)}\")\n", + "print(f\"Images successfully pushed: {len(pushed_images)}\")\n", + "print(f\"Images failed to pull: {len(pull_errors)}\")\n", + "print(f\"Images failed to push: {len(push_errors)}\")\n", + "print(\"=\" * 80)\n", + "\n", + "if pushed_images:\n", + " print(\"\\nSuccessfully migrated images:\")\n", + " for tag in sorted(pushed_images.keys())[:20]:\n", + " print(f\" ✓ {tag}\")\n", + " if len(pushed_images) > 20:\n", + " print(f\" ... and {len(pushed_images) - 20} more\")\n", + "\n", + "if pull_errors or push_errors:\n", + " print(\"\\n⚠ Some images failed to migrate. Review the errors above.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"=\" * 80)\n", + "print(\"MIGRATION SUMMARY\")\n", + "print(\"=\" * 80)\n", + "print(f\"Total images on ECR: {len(ecr_tags)}\")\n", + "print(f\"Total images on DockerHub (before): {len(dockerhub_tags)}\")\n", + "print(f\"Images needing migration: {len(tags_to_migrate)}\")\n", + "print(f\"Images successfully pulled: {len(pulled_images)}\")\n", + "print(f\"Images successfully pushed: {len(pushed_images)}\")\n", + "print(f\"Images failed to pull: {len(pull_errors)}\")\n", + "print(f\"Images failed to push: {len(push_errors)}\")\n", + "print(\"=\" * 80)\n", + "\n", + "if pushed_images:\n", + " print(\"\\nSuccessfully migrated images:\")\n", + " for tag in sorted(pushed_images.keys())[:20]:\n", + " print(f\" ✓ {tag}\")\n", + " if len(pushed_images) > 20:\n", + " print(f\" ... and {len(pushed_images) - 20} more\")\n", + "\n", + "if pull_errors or push_errors:\n", + " print(\"\\n⚠ Some images failed to migrate. Review the errors above.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Optional: Clean Up Local Images\n", + "\n", + "Uncomment and run this cell to remove the pulled images from local Docker storage to free up space." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print(\"Cleaning up local images...\")\n", + "# for tag, image in pulled_images.items():\n", + "# try:\n", + "# docker_client.images.remove(image.id, force=True)\n", + "# print(f\" Removed {tag}\")\n", + "# except Exception as e:\n", + "# print(f\" Failed to remove {tag}: {e}\")\n", + "# print(\"✓ Cleanup complete\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/scratch/notebooks/migrate_ecr_to_dockerhub.py b/scratch/notebooks/migrate_ecr_to_dockerhub.py new file mode 100644 index 0000000..4aa6846 --- /dev/null +++ b/scratch/notebooks/migrate_ecr_to_dockerhub.py @@ -0,0 +1,437 @@ +# %% [markdown] +# # Migrate Containers from ECR to DockerHub +# +# This notebook: +# 1. Lists all containers on ECR +# 2. Filters ECR images by push date (only includes images pushed after a specified date) +# 3. Lists all containers on DockerHub +# 4. Finds containers that are on ECR but not on DockerHub +# 5. Pulls those containers from ECR +# 6. Pushes them to DockerHub + +# %% [markdown] +# ## Setup and Imports + +# %% +# %cd /mnt/sdd1/atharvas/formulacode/datasmith/ +import os +import base64 +from datetime import datetime, timezone +import docker +import boto3 +from botocore.exceptions import ClientError +from datasmith.docker.dockerhub import _list_dockerhub_tags_single_repo, _get_dockerhub_credentials +from datasmith.docker.ecr import _list_ecr_tags_single_repo +from datasmith.logging_config import configure_logging + +logger = configure_logging() + +# Configuration +AWS_REGION = os.environ.get("AWS_REGION", "us-east-1") +ECR_REPO = "formulacode/all" +DOCKERHUB_NAMESPACE = "formulacode" # Change to your DockerHub namespace +DOCKERHUB_REPO = "all" + +# Date filter - only migrate images pushed after this date +MIN_PUSHED_AT_UTC = datetime(2025, 11, 7, 1, 0, 0, tzinfo=timezone.utc) + +# Get DockerHub credentials +DOCKERHUB_USERNAME = os.environ.get("DOCKERHUB_USERNAME") +DOCKERHUB_PASSWORD = os.environ.get("DOCKERHUB_TOKEN") or os.environ.get("DOCKERHUB_PASSWORD") + +# Get Docker client +docker_client = docker.from_env() + +# Get AWS account ID for ECR +session = boto3.session.Session(region_name=AWS_REGION) +sts = session.client("sts") +account_id = sts.get_caller_identity()["Account"] +ECR_REGISTRY = f"{account_id}.dkr.ecr.{AWS_REGION}.amazonaws.com" + +print(f"AWS Account ID: {account_id}") +print(f"ECR Registry: {ECR_REGISTRY}") +print(f"ECR Repository: {ECR_REPO}") +print(f"DockerHub Namespace: {DOCKERHUB_NAMESPACE}") +print(f"DockerHub Repository: {DOCKERHUB_REPO}") +print(f"Date Filter: Only images pushed after {MIN_PUSHED_AT_UTC}") + +# %% [markdown] +# ## List All Containers on ECR + +# %% +print("Listing all images on ECR...") +ecr_tags = _list_ecr_tags_single_repo(region=AWS_REGION, repo_name=ECR_REPO) + +print(f"\nFound {len(ecr_tags)} images on ECR") +print(f"\nFirst 10 ECR images:") +for tag in sorted(ecr_tags)[:10]: + print(f" - {ECR_REGISTRY}/{ECR_REPO}:{tag}") + +# %% +def filter_ecr_tags_by_push_date( + tags: set[str], + *, + region: str, + repo_name: str, + cutoff: datetime, +) -> set[str]: + """ + Filter ECR tags to only include those pushed after the cutoff date. + + Args: + tags: Set of ECR image tags to filter + region: AWS region + repo_name: ECR repository name + cutoff: Only include images pushed after this datetime (must be timezone-aware) + + Returns: + Set of tags that were pushed after the cutoff date + """ + if not tags: + return set() + + session = boto3.session.Session(region_name=region) + ecr_client = session.client("ecr") + + kept: set[str] = set() + + # Query ECR in chunks of 100 tags (API limit) + CHUNK_SIZE = 100 + tag_list = sorted(tags) + + for i in range(0, len(tag_list), CHUNK_SIZE): + chunk = tag_list[i : i + CHUNK_SIZE] + image_ids = [{"imageTag": tag} for tag in chunk] + + try: + resp = ecr_client.describe_images( + repositoryName=repo_name, + imageIds=image_ids + ) + except ClientError as ce: + code = ce.response.get("Error", {}).get("Code") + if code in {"RepositoryNotFoundException", "ImageNotFoundException"}: + # Skip this chunk if repo/images don't exist + logger.warning(f"Repository or images not found for chunk {i}-{i+len(chunk)}") + continue + # Re-raise unexpected errors + raise + + # Process image details + for detail in resp.get("imageDetails", []): + pushed_at = detail.get("imagePushedAt") # boto3 returns timezone-aware datetime + image_tags = detail.get("imageTags", []) + + if not pushed_at or not image_tags: + continue + + # If this image was pushed after the cutoff, keep all its tags + if pushed_at > cutoff: + for tag in image_tags: + if tag in tags: # Only add if it was in our original set + kept.add(tag) + + return kept + + +print(f"Filtering {len(ecr_tags)} ECR images by push date (after {MIN_PUSHED_AT_UTC})...") +filtered_ecr_tags = filter_ecr_tags_by_push_date( + ecr_tags, + region=AWS_REGION, + repo_name=ECR_REPO, + cutoff=MIN_PUSHED_AT_UTC +) + +print(f"\nFiltered from {len(ecr_tags)} to {len(filtered_ecr_tags)} images") +print(f"Removed {len(ecr_tags) - len(filtered_ecr_tags)} images pushed before {MIN_PUSHED_AT_UTC}") +print(f"\nFirst 10 filtered ECR images:") +for tag in sorted(filtered_ecr_tags)[:10]: + print(f" - {ECR_REGISTRY}/{ECR_REPO}:{tag}") + +# Update ecr_tags to use the filtered set +ecr_tags = filtered_ecr_tags + +# %% [markdown] +# ## Filter ECR Images by Push Date +# +# Filter ECR images to only include those pushed after the specified date. + +# %% [markdown] +# ## List All Containers on DockerHub + +# %% +if not DOCKERHUB_USERNAME or not DOCKERHUB_PASSWORD: + raise ValueError( + "DockerHub credentials required. Set DOCKERHUB_USERNAME and DOCKERHUB_TOKEN environment variables.\n" + "Generate tokens at: https://hub.docker.com/settings/security" + ) + +print("Listing all images on DockerHub...") +dockerhub_tags = _list_dockerhub_tags_single_repo( + namespace=DOCKERHUB_NAMESPACE, + repo_name=DOCKERHUB_REPO, + username=DOCKERHUB_USERNAME, + password=DOCKERHUB_PASSWORD +) + +print(f"\nFound {len(dockerhub_tags)} images on DockerHub") +print(f"\nFirst 10 DockerHub images:") +for tag in sorted(dockerhub_tags)[:10]: + print(f" - docker.io/{DOCKERHUB_NAMESPACE}/{DOCKERHUB_REPO}:{tag}") + +# %% [markdown] +# ## Find Containers on ECR but Not on DockerHub + +# %% +# Find images that exist on ECR but not on DockerHub +missing_tags = ecr_tags - dockerhub_tags + +print(f"\nFound {len(missing_tags)} images on ECR that are NOT on DockerHub") +print(f"\nFirst 20 missing images:") +for tag in sorted(missing_tags)[:20]: + print(f" - {tag}") + +# Store sorted list for migration +tags_to_migrate = sorted(missing_tags) +print(f"\nTotal images to migrate: {len(tags_to_migrate)}") + +# %% [markdown] +# ## Authenticate with ECR and DockerHub + +# %% +def login_to_ecr(docker_client, session, region): + """Login to ECR and return auth credentials. Can be called to refresh expired tokens.""" + ecr_client = session.client("ecr") + auth_response = ecr_client.get_authorization_token() + auth_data = auth_response["authorizationData"][0] + ecr_username, ecr_password = base64.b64decode(auth_data["authorizationToken"]).decode().split(":", 1) + ecr_endpoint = auth_data["proxyEndpoint"].replace("https://", "") + + docker_client.login( + username=ecr_username, + password=ecr_password, + registry=ecr_endpoint + ) + + return { + "username": ecr_username, + "password": ecr_password, + "endpoint": ecr_endpoint + } + +def looks_like_ecr_auth_error(error_msg): + """Check if error message indicates ECR token expiration.""" + if not error_msg: + return False + lower = str(error_msg).lower() + return any( + token in lower + for token in [ + "authorization token has expired", + "no basic auth credentials", + "authorization failed", + "access denied", + "requested access to the resource is denied", + "pull access denied", + ] + ) + +# Login to ECR +print("Logging in to ECR...") +ecr_auth = login_to_ecr(docker_client, session, AWS_REGION) +print(f"✓ Logged in to ECR: {ecr_auth['endpoint']}") + +# Login to DockerHub +print("\nLogging in to DockerHub...") +docker_client.login( + username=DOCKERHUB_USERNAME, + password=DOCKERHUB_PASSWORD, + registry="docker.io" +) +print(f"✓ Logged in to DockerHub as {DOCKERHUB_USERNAME}") + +# %% [markdown] +# ## Pull Containers from ECR +# +# This cell pulls all missing containers from ECR with automatic token refresh on expiration. +# This may take a long time depending on the number and size of images. + +# %% +from tqdm.auto import tqdm + +pulled_images = {} +pull_errors = {} + +print(f"Pulling {len(tags_to_migrate)} images from ECR...\n") + +for tag in tqdm(tags_to_migrate, desc="Pulling from ECR"): + ecr_image_ref = f"{ECR_REGISTRY}/{ECR_REPO}:{tag}" + + # Try pulling with auth retry on token expiration + max_retries = 2 + success = False + + for attempt in range(max_retries): + try: + if attempt == 0: + print(f"Pulling {ecr_image_ref}...") + else: + print(f" Retrying {ecr_image_ref} (attempt {attempt + 1}/{max_retries})...") + + # Pull with explicit auth config + image = docker_client.images.pull( + ecr_image_ref, + auth_config={ + "username": ecr_auth["username"], + "password": ecr_auth["password"] + } + ) + pulled_images[tag] = image + print(f" ✓ Pulled {ecr_image_ref}") + success = True + break + + except Exception as e: + error_msg = str(e) + + # If it's an auth error and we haven't exhausted retries, refresh token and retry + if looks_like_ecr_auth_error(error_msg) and attempt < max_retries - 1: + print(f" ⚠ ECR token expired for {ecr_image_ref}, refreshing and retrying...") + try: + ecr_auth = login_to_ecr(docker_client, session, AWS_REGION) + print(f" ✓ Refreshed ECR token") + except Exception as auth_error: + pull_errors[tag] = f"Failed to refresh ECR token: {auth_error}" + print(f" ✖ Failed to refresh ECR token: {auth_error}") + break + else: + # Non-auth error or exhausted retries + if not success: + pull_errors[tag] = error_msg + print(f" ✖ Failed to pull {ecr_image_ref}: {e}") + break + +print(f"\n✓ Successfully pulled {len(pulled_images)} images") +if pull_errors: + print(f"✖ Failed to pull {len(pull_errors)} images") + print("\nPull errors:") + for tag, error in list(pull_errors.items())[:10]: + print(f" - {tag}: {error}") + +# %% [markdown] +# ## Push Containers to DockerHub +# +# This cell pushes all pulled containers to DockerHub. This may take a long time. +# +# **Note:** Rate limiting may occur with DockerHub. The cell includes retry logic. + +# %% +import time +from tqdm.auto import tqdm + +pushed_images = {} +push_errors = {} + +print(f"Pushing {len(pulled_images)} images to DockerHub...\n") + +for tag, image in tqdm(pulled_images.items(), desc="Pushing to DockerHub"): + ecr_image_ref = f"{ECR_REGISTRY}/{ECR_REPO}:{tag}" + dockerhub_image_ref = f"{DOCKERHUB_NAMESPACE}/{DOCKERHUB_REPO}:{tag}" + + try: + # Tag the image for DockerHub + image.tag(f"{DOCKERHUB_NAMESPACE}/{DOCKERHUB_REPO}", tag=tag) + + # Push to DockerHub with retry logic + max_retries = 3 + for attempt in range(max_retries): + try: + print(f"Pushing {dockerhub_image_ref}... (attempt {attempt + 1}/{max_retries})") + + # Push and wait for completion + push_result = docker_client.images.push( + f"{DOCKERHUB_NAMESPACE}/{DOCKERHUB_REPO}", + tag=tag, + stream=True, + decode=True + ) + + # Check for errors in stream + success = False + for line in push_result: + if "error" in line: + raise RuntimeError(line.get("error", "Unknown error")) + if "aux" in line and "Digest" in line.get("aux", {}): + success = True + digest = line["aux"]["Digest"] + print(f" ✓ Pushed {dockerhub_image_ref} ({digest})") + pushed_images[tag] = dockerhub_image_ref + break + + if success: + break + + except Exception as e: + error_msg = str(e).lower() + if "rate limit" in error_msg or "429" in error_msg or "too many requests" in error_msg: + if attempt < max_retries - 1: + wait_time = 60 * (2 ** attempt) + print(f" ⚠ Rate limit hit, waiting {wait_time}s before retry...") + time.sleep(wait_time) + continue + raise + + except Exception as e: + push_errors[tag] = str(e) + print(f" ✖ Failed to push {dockerhub_image_ref}: {e}") + +print(f"\n✓ Successfully pushed {len(pushed_images)} images") +if push_errors: + print(f"✖ Failed to push {len(push_errors)} images") + print("\nPush errors:") + for tag, error in list(push_errors.items())[:10]: + print(f" - {tag}: {error}") + +# %% [markdown] +# ## Summary + +# %% +print("=" * 80) +print("MIGRATION SUMMARY") +print("=" * 80) +print(f"Date Filter: Images pushed after {MIN_PUSHED_AT_UTC}") +print(f"Total images on ECR (filtered): {len(ecr_tags)}") +print(f"Total images on DockerHub (before): {len(dockerhub_tags)}") +print(f"Images needing migration: {len(tags_to_migrate)}") +print(f"Images successfully pulled: {len(pulled_images)}") +print(f"Images successfully pushed: {len(pushed_images)}") +print(f"Images failed to pull: {len(pull_errors)}") +print(f"Images failed to push: {len(push_errors)}") +print("=" * 80) + +if pushed_images: + print("\nSuccessfully migrated images:") + for tag in sorted(pushed_images.keys())[:20]: + print(f" ✓ {tag}") + if len(pushed_images) > 20: + print(f" ... and {len(pushed_images) - 20} more") + +if pull_errors or push_errors: + print("\n⚠ Some images failed to migrate. Review the errors above.") + +# %% [markdown] +# ## Optional: Clean Up Local Images +# +# Uncomment and run this cell to remove the pulled images from local Docker storage to free up space. + +# %% +# print("Cleaning up local images...") +# for tag, image in pulled_images.items(): +# try: +# docker_client.images.remove(image.id, force=True) +# print(f" Removed {tag}") +# except Exception as e: +# print(f" Failed to remove {tag}: {e}") +# print("✓ Cleanup complete") + + diff --git a/scratch/notebooks/test_aws.ipynb b/scratch/notebooks/test_aws.ipynb new file mode 100644 index 0000000..d7337e4 --- /dev/null +++ b/scratch/notebooks/test_aws.ipynb @@ -0,0 +1,242 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "4c686f30", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/asehgal/formulacode/datasmith\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "00:39:55 WARNING simple_useragent.core: Falling back to historic user agent.\n" + ] + } + ], + "source": [ + "%cd /home/asehgal/formulacode/datasmith\n", + "import datetime\n", + "import json\n", + "from pathlib import Path\n", + "\n", + "import docker\n", + "\n", + "from datasmith.docker.context import ContextRegistry, DockerContext, Task\n", + "from datasmith.notebooks.utils import merge_registries, update_cr\n", + "\n", + "curr_date: str = datetime.datetime.now().isoformat()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a3dcbec", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'1'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "\n", + "os.environ[\"DOCKER_S3_CACHE_READ\"], os.environ[\"DOCKER_S3_CACHE_WRITE\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a66ccc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T20:23:35.769030.json : 1234 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-14T23:44:17.752697_cleaned.json : 521 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-12T10:16:06.549234.json : 1198 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T20:27:54.426920.json : 516 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-14T00:00:25.026766.json : 613 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-09T14:32:37.382974.json : 1156 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-12T10:50:49.251932.json : 1198 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-12T21:32:05.644755.json : 1234 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-14T23:44:17.752697_cleaned2.json : 519 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-09T01:32:38.179134.json : 1198 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T12:24:05.680135.json : 1234 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T20:40:24.271278.json : 628 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T23:38:29.017709.json : 613 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T20:19:13.431602.json : 1234 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-13T19:04:14.475606.json : 1234 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-16T01:17:55.372491.json : 621 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-12T03:42:49.965377.json : 1198 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-09T00:38:42.145419.json : 1016 entries\n", + "scratch/artifacts/processed/downloads/merged_context_registry_2025-09-14T23:44:17.752697.json : 457 entries\n", + "scratch/artifacts/processed/downloads/pandas2/context_registry.json : 43 entries\n", + "scratch/artifacts/processed/downloads/scipy/context_registry.json : 206 entries\n", + "scratch/artifacts/processed/downloads/scikit-image/context_registry.json : 28 entries\n", + "scratch/artifacts/processed/downloads/astropy/context_registry.json : 55 entries\n", + "scratch/artifacts/processed/downloads/numpy/context_registry.json : 67 entries\n", + "scratch/artifacts/processed/downloads/pandas/context_registry.json : 528 entries\n", + "scratch/artifacts/processed/downloads/sklearn/context_registry.json : 38 entries\n", + "scratch/artifacts/processed/downloads/dask/context_registry.json : 137 entries\n", + "scratch/artifacts/processed/downloads/distributed/context_registry.json : 102 entries\n" + ] + } + ], + "source": [ + "results_pth = Path(\"scratch/artifacts/processed/\")\n", + "\n", + "registries = results_pth.rglob(\"**/*context_registry*.json\")\n", + "merged_json = merge_registries(list(registries))\n", + "registry = update_cr(ContextRegistry.deserialize(payload=json.dumps(merged_json)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b75aa144", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "00:39:59 INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials\n", + "00:39:59 INFO datasmith.docker.context: Docker image 'pandas-dev-pandas-8b313d36aff97d28250bd6e8dad02c6da055f6f1:pkg' not found locally. Building with buildx.\n", + "00:39:59 INFO datasmith.docker.context: Using S3 cache for read: type=s3,bucket=datasmith-docker-bucket,region=us-east-1,prefix=docker-cache/layers/82cb7c0f8c8578e1\n", + "00:39:59 INFO datasmith.docker.context: Using S3 cache for write: type=s3,bucket=datasmith-docker-bucket,region=us-east-1,prefix=docker-cache/layers/82cb7c0f8c8578e1,mode=max\n", + "00:39:59 INFO datasmith.docker.context: $ /usr/bin/docker buildx build --load --progress=plain -t pandas-dev-pandas-8b313d36aff97d28250bd6e8dad02c6da055f6f1:pkg --build-arg REPO_URL=https://github.com/pandas-dev/pandas.git --build-arg COMMIT_SHA=8b313d36aff97d28250bd6e8dad02c6da055f6f1 --build-arg ENV_PAYLOAD={\"constraints\": [], \"to_install\": [\"cython~=3.0.5\", \"fsspec==2024.2.0\", \"SQLAlchemy==2.0.27\", \"psycopg2-binary==2.9.9\", \"pyyaml\", \"pytest-cython\", \"meson[ninja]==1.2.1\", \"gitdb\", \"gitpython\", \"markdown\", \"numexpr==2.9.0\", \"types-PyMySQL\", \"natsort\", \"xlrd==2.0.1\", \"html5lib==1.1\", \"nbconvert==7.16.1\", \"pydata-sphinx-theme==0.14\", \"lxml==5.1.0\", \"types-PyYAML\", \"PyQt5==5.15.10\", \"beautifulsoup4==4.12.3\", \"jinja2==3.1.3\", \"nbsphinx\", \"moto\", \"nbformat\", \"pytest==8.0.2\", \"numpydoc\", \"odfpy==1.4.1\", \"qtpy==2.4.1\", \"zstandard==0.22.0\", \"tzdata==2024.1\", \"xlsxwriter==3.2.0\", \"py\", \"adbc-driver-postgresql==0.10.0\", \"types-python-dateutil\", \"dask\", \"flake8==6.1.0\", \"s3fs==2024.2.0\", \"google-auth\", \"pip\", \"types-pytz\", \"xarray==2024.2.0\", \"sphinx-copybutton\", \"requests\", \"pytest-cov\", \"ipykernel\", \"ipython\", \"scipy==1.12.0\", \"adbc-driver-sqlite==0.10.0\", \"python-dateutil==2.8.2\", \"pyreadstat==1.2.6\", \"pyarrow==11.0.0\", \"numpy<2\", \"matplotlib==3.8.3\", \"feedparser\", \"asv==0.6.3\", \"gcsfs==2024.2.0\", \"openpyxl==3.1.2\", \"pyxlsb==1.0.10\", \"pytz==2024.1\", \"hypothesis==6.98.15\", \"sphinx-design\", \"bottleneck==1.3.8\", \"python-calamine==0.2.0\", \"pytz\", \"pygments\", \"fastparquet==2024.2.0\", \"mypy==1.8.0\", \"tokenize-rt\", \"psycopg2==2.9.9\", \"python-dateutil\", \"pytest-xdist==3.5.0\", \"pandoc\", \"seaborn\", \"meson-python==0.13.1\", \"sphinx\", \"blosc\", \"numba==0.59.0\", \"coverage\", \"numpy==1.26.4\", \"pre-commit==3.6.2\", \"types-setuptools\", \"flask\", \"tabulate==0.9.0\", \"ipywidgets\", \"pymysql==1.1.0\", \"versioneer[toml]\", \"tables==3.9.2\", \"notebook==7.1.1\"], \"banned\": [\"pytest-qt\"]} --build-arg BUILDKIT_INLINE_CACHE=1 --label notebook=true --target pkg --network host --cache-from type=s3,bucket=datasmith-docker-bucket,region=us-east-1,prefix=docker-cache/layers/82cb7c0f8c8578e1 --cache-to type=s3,bucket=datasmith-docker-bucket,region=us-east-1,prefix=docker-cache/layers/82cb7c0f8c8578e1,mode=max -\n", + "/home/asehgal/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/subprocess.py:1010: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used\n", + " self.stdin = io.open(p2cwrite, 'wb', bufsize)\n", + "/home/asehgal/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/subprocess.py:1016: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used\n", + " self.stdout = io.open(c2pread, 'rb', bufsize)\n", + "00:58:26 INFO datasmith.docker.context: buildx build completed successfully for 'pandas-dev-pandas-8b313d36aff97d28250bd6e8dad02c6da055f6f1:pkg' in 1107.2 sec.\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "from datasmith.docker.s3_cache_manager import S3CacheConfig, S3DockerCacheManager\n", + "\n", + "task = Task(\n", + " owner=\"pandas-dev\",\n", + " repo=\"pandas\",\n", + " sha=\"8b313d36aff97d28250bd6e8dad02c6da055f6f1\",\n", + " commit_date=1720632658.0,\n", + " env_payload='{\"constraints\": [], \"to_install\": [\"cython~=3.0.5\", \"fsspec==2024.2.0\", \"SQLAlchemy==2.0.27\", \"psycopg2-binary==2.9.9\", \"pyyaml\", \"pytest-cython\", \"meson[ninja]==1.2.1\", \"gitdb\", \"gitpython\", \"markdown\", \"numexpr==2.9.0\", \"types-PyMySQL\", \"natsort\", \"xlrd==2.0.1\", \"html5lib==1.1\", \"nbconvert==7.16.1\", \"pydata-sphinx-theme==0.14\", \"lxml==5.1.0\", \"types-PyYAML\", \"PyQt5==5.15.10\", \"beautifulsoup4==4.12.3\", \"jinja2==3.1.3\", \"nbsphinx\", \"moto\", \"nbformat\", \"pytest==8.0.2\", \"numpydoc\", \"odfpy==1.4.1\", \"qtpy==2.4.1\", \"zstandard==0.22.0\", \"tzdata==2024.1\", \"xlsxwriter==3.2.0\", \"py\", \"adbc-driver-postgresql==0.10.0\", \"types-python-dateutil\", \"dask\", \"flake8==6.1.0\", \"s3fs==2024.2.0\", \"google-auth\", \"pip\", \"types-pytz\", \"xarray==2024.2.0\", \"sphinx-copybutton\", \"requests\", \"pytest-cov\", \"ipykernel\", \"ipython\", \"scipy==1.12.0\", \"adbc-driver-sqlite==0.10.0\", \"python-dateutil==2.8.2\", \"pyreadstat==1.2.6\", \"pyarrow==11.0.0\", \"numpy<2\", \"matplotlib==3.8.3\", \"feedparser\", \"asv==0.6.3\", \"gcsfs==2024.2.0\", \"openpyxl==3.1.2\", \"pyxlsb==1.0.10\", \"pytz==2024.1\", \"hypothesis==6.98.15\", \"sphinx-design\", \"bottleneck==1.3.8\", \"python-calamine==0.2.0\", \"pytz\", \"pygments\", \"fastparquet==2024.2.0\", \"mypy==1.8.0\", \"tokenize-rt\", \"psycopg2==2.9.9\", \"python-dateutil\", \"pytest-xdist==3.5.0\", \"pandoc\", \"seaborn\", \"meson-python==0.13.1\", \"sphinx\", \"blosc\", \"numba==0.59.0\", \"coverage\", \"numpy==1.26.4\", \"pre-commit==3.6.2\", \"types-setuptools\", \"flask\", \"tabulate==0.9.0\", \"ipywidgets\", \"pymysql==1.1.0\", \"versioneer[toml]\", \"tables==3.9.2\", \"notebook==7.1.1\"], \"banned\": [\"pytest-qt\"]}',\n", + " tag=\"pkg\",\n", + ")\n", + "ctx: DockerContext = registry.get(task)\n", + "s3_config = S3CacheConfig(\n", + " bucket=os.environ[\"AWS_S3_BUCKET_DOCKER\"],\n", + " prefix=\"docker-cache\",\n", + " region=os.environ[\"AWS_REGION\"],\n", + " max_cache_age_days=30,\n", + " max_cache_size_gb=100,\n", + " compression=True,\n", + ")\n", + "\n", + "s3_cache_manager = S3DockerCacheManager(s3_config)\n", + "\n", + "res = ctx.build_container_streaming(\n", + " client=docker.from_env(),\n", + " image_name=task.with_tag(\"pkg\").get_image_name(),\n", + " build_args={\n", + " \"REPO_URL\": f\"https://github.com/{task.owner}/{task.repo}.git\",\n", + " \"COMMIT_SHA\": task.sha,\n", + " \"ENV_PAYLOAD\": task.env_payload,\n", + " },\n", + " run_labels={\"notebook\": \"true\"},\n", + " s3_cache_config=s3_cache_manager,\n", + " force=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "7cdd3e5b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n", + "[profile] Running ASV baseline on 8b313d36aff97d28250bd6e8dad02c6da055f6f1\n", + "\u00b7 No information stored about machine 'a9464fce6a38'. I know about nothing.\n", + " \n", + "\n", + "\u00b7 Discovering benchmarks\n", + "\u00b7 Running 1144 total benchmarks (1 commits * 1 environments * 1144 benchmarks)\n", + "[ 0.00%] \u00b7 For pandas commit 8b313d36
:\n", + "[ 0.00%] \u00b7\u00b7 Building for existing-py_opt_conda_envs_asv_3.10_bin_python\n", + "[ 0.00%] \u00b7\u00b7 Benchmarking existing-py_opt_conda_envs_asv_3.10_bin_python\n", + "[ 0.04%] \u00b7\u00b7\u00b7 Running (algorithms.Duplicated.time_duplicated--)....\n", + "[ 0.26%] \u00b7\u00b7\u00b7 Running (algorithms.SortIntegerArray.time_argsort--).[profile] Running ASV baseline on 8b313d36aff97d28250bd6e8dad02c6da055f6f1\n", + "\u00b7 No information stored about machine 'a9464fce6a38'. I know about nothing.\n", + " \n", + "\n", + "\u00b7 Discovering benchmarks\n", + "\u00b7 Running 1144 total benchmarks (1 commits * 1 environments * 1144 benchmarks)\n", + "[ 0.00%] \u00b7 For pandas commit 8b313d36
:\n", + "[ 0.00%] \u00b7\u00b7 Building for existing-py_opt_conda_envs_asv_3.10_bin_python\n", + "[ 0.00%] \u00b7\u00b7 Benchmarking existing-py_opt_conda_envs_asv_3.10_bin_python\n", + "[ 0.04%] \u00b7\u00b7\u00b7 Running (algorithms.Duplicated.time_duplicated--)....\n", + "[ 0.26%] \u00b7\u00b7\u00b7 Running (algorithms.SortIntegerArray.time_argsort--).\n" + ] + } + ], + "source": [ + "from datasmith.agents.context_synthesis import _run_quick_profile\n", + "\n", + "rc, logs = _run_quick_profile(\n", + " client=docker.from_env(),\n", + " image_name=task.with_tag(\"pkg\").get_image_name(),\n", + " run_labels={\"notebook\": \"true\"},\n", + " timeout=50,\n", + ")\n", + "print(rc)\n", + "print(logs.replace(\"\\\\n\", \"\\n\"))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/scratch/scripts/build_and_publish_to_dockerhub.py b/scratch/scripts/build_and_publish_to_dockerhub.py index ffd64d5..82ce4c2 100644 --- a/scratch/scripts/build_and_publish_to_dockerhub.py +++ b/scratch/scripts/build_and_publish_to_dockerhub.py @@ -1,4 +1,7 @@ -"""Build ASV Docker images for commits and publish to DockerHub.""" +"""Build ASV Docker images for commits and publish to DockerHub. + +This script mirrors build_and_publish_to_ecr.py but publishes to DockerHub instead of AWS ECR. +""" from __future__ import annotations @@ -27,12 +30,12 @@ # Concurrency settings # Note: Lower push concurrency for DockerHub to avoid rate limiting _BUILD_CONCURRENCY = int(os.getenv("BUILD_CONCURRENCY", "24")) -_PUSH_CONCURRENCY = int(os.getenv("PUSH_CONCURRENCY", "8")) +_PUSH_CONCURRENCY = int(os.getenv("PUSH_CONCURRENCY", "8")) # Lower than ECR (12) _build_sem = threading.Semaphore(_BUILD_CONCURRENCY) _push_sem = threading.Semaphore(_PUSH_CONCURRENCY) _cr_lock = threading.Lock() # protect ContextRegistry mutations -logger = configure_logging(level=10, stream=open(Path(__file__).with_suffix(".log"), "a")) # noqa: SIM115 +logger = configure_logging(level=10, stream=open(Path(__file__).with_suffix(".log"), "w")) # noqa: SIM115 def parse_args() -> argparse.Namespace: diff --git a/scratch/scripts/build_and_publish_to_ecr.py b/scratch/scripts/build_and_publish_to_ecr.py new file mode 100644 index 0000000..0c604c2 --- /dev/null +++ b/scratch/scripts/build_and_publish_to_ecr.py @@ -0,0 +1,312 @@ +from __future__ import annotations + +import argparse +import contextlib +import datetime +import os +import threading +import uuid +from concurrent.futures import ThreadPoolExecutor, as_completed +from pathlib import Path + +import asv +import boto3 +import pandas as pd +from botocore.exceptions import ClientError + +from datasmith.agents.build import _do_build +from datasmith.core.models import Task +from datasmith.docker.context import ContextRegistry, DockerContext, build_base_image +from datasmith.docker.orchestrator import gen_run_labels, get_docker_client +from datasmith.docker.validation import DockerValidator, ValidationConfig +from datasmith.execution.resolution.task_utils import resolve_task +from datasmith.logging_config import configure_logging +from datasmith.notebooks.utils import update_cr + +_BUILD_CONCURRENCY = int(os.getenv("BUILD_CONCURRENCY", "24")) +_PUSH_CONCURRENCY = int(os.getenv("PUSH_CONCURRENCY", "12")) +_build_sem = threading.Semaphore(_BUILD_CONCURRENCY) +_push_sem = threading.Semaphore(_PUSH_CONCURRENCY) +_cr_lock = threading.Lock() # protect ContextRegistry mutations + +logger = configure_logging(level=10, stream=open(Path(__file__).with_suffix(".log"), "w")) # noqa: SIM115 + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + prog="build_and_publish_to_ecr", + description="Build ASV Docker images for commits and publish to AWS ECR.", + ) + parser.add_argument( + "--commits", + type=Path, + help="Path to a JSONL file containing commit information. Either --dashboard or --commits must be provided.", + ) + parser.add_argument( + "--docker-dir", + type=Path, + default=Path("src/datasmith/docker"), + help="Directory containing the Dockerfile and other necessary files for building the ASV image.", + ) + parser.add_argument("--max-workers", type=int, default=8, help="Max parallel builds/runs.") + parser.add_argument( + "--context-registry", + type=Path, + required=True, + help="Path to the context registry JSON file.", + ) + parser.add_argument( + "--skip-existing", + action="store_true", + help="Skip pushing images that already exist on ECR.", + ) + return parser.parse_args() + + +def process_inputs(args: argparse.Namespace) -> dict[tuple[str, str], set[tuple[str, float, str]]]: + commits = ( + pd.read_json(args.commits, lines=True) if args.commits.suffix == ".jsonl" else pd.read_parquet(args.commits) + ) + all_states = {} + for _, row in commits.iterrows(): + repo_name = row["repo_name"] + # sha = row["sha"] + sha = row["pr_base"]["sha"] + has_asv = row.get("has_asv", True) + if not has_asv: + logger.debug("Skipping %s commit %s as it does not have ASV benchmarks.", repo_name, sha) + continue + owner, repo = repo_name.split("/") + commit_date_unix: float = ( + 0.0 if row.get("date", None) is None else datetime.datetime.fromisoformat(row["date"]).timestamp() + ) + env_payload = row.get("env_payload", "") + if (owner, repo) not in all_states: + all_states[(owner, repo)] = [(sha, commit_date_unix, env_payload)] + else: + all_states[(owner, repo)].append((sha, commit_date_unix, env_payload)) + return all_states + + +def within_3_months(unix_time: float) -> bool: + three_months_ago = datetime.datetime.now() - datetime.timedelta(days=90) + return datetime.datetime.fromtimestamp(unix_time) >= three_months_ago + + +def prepare_tasks( + all_states: dict[tuple[str, str], set[tuple[str, float, str]]], + context_registry: ContextRegistry, +) -> list[Task]: + all_tasks: list[Task] = [] + for (owner, repo), tup in all_states.items(): + tasks = list({ + Task(owner, repo, sha, commit_date=date, env_payload=env_payload) for sha, date, env_payload in sorted(tup) + }) + tasks = [ + t + for t in tasks + if (t.with_tag("pkg") in context_registry) + and (within_3_months(context_registry.get(t.with_tag("pkg")).created_unix)) + ] + all_tasks.extend(tasks) + return all_tasks + + +def _encode_ecr_tag_from_local(local_ref: str) -> str: + """Encode a local image reference into the tag used for single-repo ECR publishing. + + Mirrors datasmith.docker.ecr._encode_tag_from_local used when repository_mode="single": + - local_ref like "repo[:tag]" becomes "repo--tag" (slashes in either side become "__"). + - If the result exceeds 128 chars, add an 8-char hash suffix (not expected here). + """ + import hashlib + + if ":" in local_ref and "/" not in local_ref.split(":", 1)[1]: + repo, tag = local_ref.rsplit(":", 1) + else: + repo, tag = local_ref, "latest" + base = repo.replace("/", "__") + tag_enc = tag.replace("/", "__").replace(":", "--") + composed = f"{base}--{tag_enc}" + if len(composed) <= 128: + return composed + h = hashlib.sha256(composed.encode()).hexdigest()[:8] + trimmed = composed[-(128 - 10) :] + return f"{trimmed}--{h}" + + +def _list_ecr_tags_single_repo(*, region: str, repo_name: str) -> set[str]: + """Return the set of existing image tags for an ECR repository. + + Safe: returns empty set on missing repo or auth issues. Logs warnings instead of raising. + """ + tags: set[str] = set() + try: + session = boto3.session.Session(region_name=region) # pyright: ignore[reportAttributeAccessIssue] + ecr = session.client("ecr") + token: str | None = None + while True: + kwargs = {"repositoryName": repo_name, "maxResults": 1000} + if token: + kwargs["nextToken"] = token + try: + resp = ecr.list_images(**kwargs) + except ClientError as ce: # pragma: no cover - network dependent + code = ce.response.get("Error", {}).get("Code") + if code == "RepositoryNotFoundException": + logger.info("ECR repository %s not found; assuming no existing images.", repo_name) + return set() + logger.warning("Failed to list ECR images for %s: %s", repo_name, ce) + return set() + for img in resp.get("imageIds", []): + t = img.get("imageTag") + if t: + tags.add(str(t)) + token = resp.get("nextToken") + if not token: + break + except Exception as exc: # pragma: no cover - network dependent + logger.warning("Could not query ECR for existing tags (region=%s, repo=%s): %s", region, repo_name, exc) + return set() + return tags + + +def filter_tasks_not_on_ecr( + tasks: list[Task], *, region: str, repository_mode: str = "single", single_repo: str = "formulacode/all" +) -> list[Task]: + """Filter out tasks whose target image already exists on ECR. + + Currently supports repository_mode="single" (default used by Context.build_and_publish_to_ecr). + """ + if repository_mode != "single": + # Fallback: if we don't know how tags are computed, don't filter + logger.warning("ECR pre-filter only supports repository_mode='single'; skipping filter.") + return tasks + + existing_tags = _list_ecr_tags_single_repo(region=region, repo_name=single_repo) + if not existing_tags: + return tasks + + filtered: list[Task] = [] + skipped = 0 + for t in tasks: + local_ref = t.with_tag("final").get_image_name() # e.g., owner-repo-sha:final + enc_tag = _encode_ecr_tag_from_local(local_ref) # e.g., owner-repo-sha--final + if enc_tag in existing_tags: + skipped += 1 + logger.info("Skipping %s (already on ECR as %s:%s)", local_ref, single_repo, enc_tag) + continue + filtered.append(t) + if skipped: + logger.info("Filtered out %d/%d tasks already on ECR", skipped, len(tasks)) + return filtered + + +def main(args: argparse.Namespace) -> None: + # Size the Docker HTTP connection pool to our concurrency to avoid + # adapter/pool starvation when many threads issue Docker API calls. + client = get_docker_client(max_concurrency=8) + all_states = process_inputs(args) + context_registry_pth = args.context_registry + context_registry = ( + ContextRegistry.load_from_file(path=context_registry_pth) + if context_registry_pth.exists() + else ContextRegistry() + ) + context_registry = update_cr(context_registry) + + machine_defaults: dict[str, str] = asv.machine.Machine.get_defaults() # pyright: ignore[reportAttributeAccessIssue] + machine_defaults = { + k: str(v.replace(" ", "_").replace("'", "").replace('"', "")) for k, v in machine_defaults.items() + } + validator = DockerValidator( + client=client, + context_registry=context_registry, + machine_defaults=machine_defaults, + config=ValidationConfig( + output_dir=Path("scratch/docker_validation"), build_timeout=3600, run_timeout=3600, tail_chars=4000 + ), + ) + + logger.info("Building base image...") + base_tag = build_base_image(client, DockerContext()) + logger.debug("%s", base_tag) + # os.environ["DOCKER_CACHE_FROM"] = base_tag + + # Prepare tasks + tasks = prepare_tasks(all_states, context_registry) + # Filter out tasks already present on ECR before building + if args.skip_existing: + aws_region = os.environ.get("AWS_REGION", "us-east-1") + tasks = filter_tasks_not_on_ecr(tasks, region=aws_region) + logger.info("main: Starting work on %d tasks[%d workers]", len(tasks), args.max_workers) + + def build_and_publish_task(task: Task) -> tuple[dict, dict]: + # Create a fresh client per thread. Small pool is fine; build is mostly daemon-side work. + client = get_docker_client(max_concurrency=8) + try: + _task_analysis, task = resolve_task(task) + run_labels = gen_run_labels(task, runid=uuid.uuid4().hex) + ctx = context_registry.get(task.with_tag("pkg")) + + with _build_sem: + partial_build_res = _do_build(validator, task.with_tag("run"), ctx, run_labels) + + if not partial_build_res.ok: + # Only one thread should edit/save the registry at a time + if ( + "docker_build_pkg" in partial_build_res.stderr_tail + or "docker_build_env" in partial_build_res.stderr_tail + ): + with _cr_lock: + context_registry.pop(task.with_tag("pkg")) + context_registry.save_to_file(context_registry_pth) + return partial_build_res.__dict__, {} + + with _push_sem: + build_res, push_results = ctx.build_and_publish_to_ecr( + client=client, + region=os.environ.get("AWS_REGION", "us-east-1"), + task=task.with_tag("final").with_benchmarks(partial_build_res.benchmarks), + timeout_s=3600, + skip_existing=args.skip_existing, + force=True, + ) + + # for tag in ("final", "run"): + # try: + # client.containers.get(task.with_tag(tag).get_container_name()).remove(force=True) + # except Exception: + # logger.exception("Error removing container: %s", task.with_tag(tag).get_container_name()) + + return build_res.__dict__, push_results + finally: + with contextlib.suppress(Exception): + client.api.close() + + results: list[dict] = [] + if args.max_workers < 1: + for t in tasks: + build_res, push_res = build_and_publish_task(t) + all_res = {**build_res, **push_res} + results.append(all_res) + logger.info("Completed: %s", all_res) + else: + with ThreadPoolExecutor(max_workers=args.max_workers) as ex: + futures = [ + ex.submit( + build_and_publish_task, + task=t, + ) + for t in tasks + ] + for fut in as_completed(futures): + build_res, push_res = fut.result() + all_res = {**build_res, **push_res} + results.append(all_res) + logger.info("Completed: %s", all_res) + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/scratch/scripts/prepare_formulacode_dataset.py b/scratch/scripts/prepare_formulacode_dataset.py deleted file mode 100644 index c818a7e..0000000 --- a/scratch/scripts/prepare_formulacode_dataset.py +++ /dev/null @@ -1,667 +0,0 @@ -"""Enrich the FormulaCode master parquet and optionally upload to Hugging Face. - -Derives container_name, queries Docker Hub for available images, normalizes -difficulty/classification, generates task IDs, filters, and sorts — so that -downstream consumers (terminal-bench, HF datasets) get a ready-to-use dataset. -""" - -from __future__ import annotations - -import argparse -import json -import os -import time -from pathlib import Path - -import pandas as pd -import requests - -from datasmith.logging_config import configure_logging - -logger = configure_logging() - - -# --------------------------------------------------------------------------- -# CLI -# --------------------------------------------------------------------------- - - -def parse_args(argv: list[str] | None = None) -> argparse.Namespace: - p = argparse.ArgumentParser( - description="Enrich a FormulaCode commits parquet for terminal-bench.", - ) - p.add_argument("--input", type=Path, required=True, help="Raw parquet file.") - p.add_argument("--output", type=Path, required=True, help="Output enriched parquet.") - p.add_argument( - "--dockerhub-repository", - default="formulacode/all", - help="Docker Hub repository (namespace/repo).", - ) - p.add_argument( - "--filter-by", - type=Path, - default=None, - help="JSON file with (repo_name, sha) keys to keep.", - ) - p.add_argument( - "--limit-per-repo", - type=int, - default=-1, - help="Max tasks per repo family (-1 = no limit).", - ) - p.add_argument( - "--upload-to-hf", - type=str, - default=None, - metavar="REPO_ID", - help=( - "Upload the output parquet to a Hugging Face dataset repo " - "(e.g. 'formulacode/formulacode-all'). " - "Requires HF_TOKEN in tokens.env or environment. " - "Uploads monthly configs (e.g. '2024.01') plus a 'default' " - "config containing all data." - ), - ) - p.add_argument( - "--hf-verified-filter", - type=Path, - default=None, - help=( - "Path to a JSON filter file (same format as --filter-by). " - "When combined with --upload-to-hf, adds a 'verified' config." - ), - ) - p.add_argument( - "--hf-commit-message", - type=str, - default=None, - help="Commit message for the HF upload (default: auto-generated).", - ) - p.add_argument( - "--hf-date-column", - type=str, - default="pr_merged_at", - help="Column to derive monthly splits from (default: pr_merged_at).", - ) - p.add_argument( - "--dry-run", - action="store_true", - help="Print summary without writing output.", - ) - return p.parse_args(argv) - - -# --------------------------------------------------------------------------- -# Load / validate -# --------------------------------------------------------------------------- - - -def load_and_validate(path: Path) -> pd.DataFrame: - if path.suffix == ".csv": - df = pd.read_csv(path) - elif path.suffix == ".parquet": - df = pd.read_parquet(path) - else: - raise ValueError(f"Unsupported format: {path.suffix}") - - required = {"repo_name", "pr_base_sha", "pr_merge_commit_sha"} - missing = required - set(df.columns) - if missing: - raise ValueError(f"Missing required columns: {missing}") - return df - - -# --------------------------------------------------------------------------- -# Derive container_name (ported from dataset.py:38-43) -# --------------------------------------------------------------------------- - - -def derive_container_name(df: pd.DataFrame) -> pd.DataFrame: - df = df.copy() - df["container_name"] = (df["repo_name"].str.replace("/", "-") + "-" + df["pr_base_sha"] + ":final").str.lower() - return df - - -# --------------------------------------------------------------------------- -# Normalize columns -# --------------------------------------------------------------------------- - -_DIFFICULTY_LABELS: tuple[str, ...] = ("easy", "medium", "hard", "unknown") - - -def _levenshtein_distance(source: str, target: str) -> int: - if source == target: - return 0 - if not source: - return len(target) - if not target: - return len(source) - - previous = list(range(len(target) + 1)) - for i, s_char in enumerate(source, start=1): - current = [i] - for j, t_char in enumerate(target, start=1): - insert_cost = current[j - 1] + 1 - delete_cost = previous[j] + 1 - replace_cost = previous[j - 1] + (s_char != t_char) - current.append(min(insert_cost, delete_cost, replace_cost)) - previous = current - return previous[-1] - - -def normalize_difficulty(raw: str | None) -> str: - """Normalize to one of easy/medium/hard/unknown with fuzzy matching.""" - value = (raw or "").strip() - # Strip enum-style prefix like ``DifficultyLevel.`` - value = value.split(".")[-1].strip().lower() - if not value: - return "unknown" - if value in _DIFFICULTY_LABELS: - return value - - best_label = _DIFFICULTY_LABELS[0] - best_distance = _levenshtein_distance(value, best_label) - for label in _DIFFICULTY_LABELS[1:]: - distance = _levenshtein_distance(value, label) - if distance < best_distance: - best_distance = distance - best_label = label - return best_label - - -def normalize_classification(raw: str | None) -> str: - """Normalize classification: strip OptimizationType. prefix, lowercase.""" - value = (raw or "").strip() - value = value.split(".")[-1].strip().lower() - if not value: - return "uncategorized" - # Keep snake_case as-is - return value - - -def normalize_columns(df: pd.DataFrame) -> pd.DataFrame: - df = df.copy() - if "difficulty" in df.columns: - df["difficulty"] = df["difficulty"].apply(normalize_difficulty) - else: - df["difficulty"] = "unknown" - - if "classification" in df.columns: - df["classification"] = df["classification"].apply(normalize_classification) - else: - df["classification"] = "uncategorized" - return df - - -# --------------------------------------------------------------------------- -# Sort by date (ported from dataset.py:46-49) -# --------------------------------------------------------------------------- - - -def sort_by_date(df: pd.DataFrame) -> pd.DataFrame: - df = df.copy() - if "date" not in df.columns and "pr_merged_at" in df.columns: - df["date"] = df["pr_merged_at"] - if "date" in df.columns: - df["date"] = pd.to_datetime(df["date"], errors="coerce") - df = df.sort_values(by="date", ascending=True) - return df - - -# --------------------------------------------------------------------------- -# Filter-by keys (ported from dataset.py:52-63) -# --------------------------------------------------------------------------- - - -def load_filter_keys(filter_path: Path) -> list[tuple[str, str]]: - filter_dict = json.loads(filter_path.read_text()) - return list(map(eval, filter_dict.keys())) - - -def apply_filter_by(df: pd.DataFrame, valid_keys: list[tuple[str, str]]) -> pd.DataFrame: - mask = df[["repo_name", "pr_merge_commit_sha"]].apply(tuple, axis=1).isin(valid_keys) - return df[mask] - - -# --------------------------------------------------------------------------- -# Docker Hub image resolution (ported from dockerhub.py) -# --------------------------------------------------------------------------- - - -def _fetch_dockerhub_api(url: str, timeout: int = 10, max_retries: int = 3) -> dict: - last_error: Exception | None = None - for attempt in range(max_retries): - try: - response = requests.get(url, timeout=timeout) - if response.status_code == 429: - retry_after = int(response.headers.get("Retry-After", 60)) - if attempt < max_retries - 1: - logger.warning("Rate limited. Waiting %ds before retry...", retry_after) - time.sleep(retry_after) - continue - msg = f"Docker Hub rate limit exceeded. Retry after {retry_after}s" - raise RuntimeError(msg) # noqa: TRY301 - if response.status_code == 404: - msg = f"Docker Hub repository not found: {url}" - raise ValueError(msg) # noqa: TRY301 - response.raise_for_status() - return response.json() - except requests.exceptions.RequestException as exc: - last_error = exc - if attempt < max_retries - 1: - wait_time = (2**attempt) * 0.5 - logger.warning("Network error (attempt %d/%d): %s", attempt + 1, max_retries, exc) - time.sleep(wait_time) - continue - except ValueError: - raise - except Exception as exc: - raise RuntimeError(f"Failed to fetch Docker Hub API: {exc}") from exc - - raise RuntimeError(f"Failed to fetch Docker Hub API after {max_retries} attempts: {last_error}") - - -def _get_available_images_from_dockerhub( - repository: str, - page_size: int = 100, -) -> set[str]: - if "/" not in repository: - raise ValueError(f"Repository must be 'namespace/repo', got: {repository}") - - page_size = min(page_size, 100) - available: set[str] = set() - page = 1 - - while True: - url = f"https://hub.docker.com/v2/repositories/{repository}/tags/?page_size={page_size}&page={page}" - try: - data = _fetch_dockerhub_api(url) - except ValueError: - logger.exception("Repository '%s' not found on Docker Hub", repository) - raise - - results = data.get("results", []) - if not results: - break - - for item in results: - tag_name = item.get("name") - if tag_name: - available.add(f"{repository}:{tag_name}") - - if not data.get("next"): - break - page += 1 - time.sleep(0.1) - - logger.info("Found %d tags in %s", len(available), repository) - return available - - -def _dockerhub_ref_to_container_name(image_ref: str, repository: str) -> str | None: - prefix = f"{repository}:" - if not image_ref.startswith(prefix): - return None - tag = image_ref[len(prefix) :] - if "--" not in tag: - return None - base_name, variant = tag.rsplit("--", 1) - if not base_name or not variant: - return None - return f"{base_name}:{variant}" - - -def resolve_dockerhub_images(repository: str) -> dict[str, str]: - """Return {container_name: full_dockerhub_ref} for all tags in *repository*.""" - available = _get_available_images_from_dockerhub(repository) - image_map: dict[str, str] = {} - for img in available: - container_name = _dockerhub_ref_to_container_name(img, repository) - if container_name: - image_map[container_name] = img - return image_map - - -# --------------------------------------------------------------------------- -# Apply image filter (ported from dataset.py:66-72) -# --------------------------------------------------------------------------- - - -def apply_image_filter(df: pd.DataFrame, image_map: dict[str, str]) -> pd.DataFrame: - df = df[df["container_name"].isin(set(image_map.keys()))].copy() - df["image_name"] = df["container_name"].map(image_map) - return df - - -# --------------------------------------------------------------------------- -# Limit per repo (ported from dataset.py:75-87) -# --------------------------------------------------------------------------- - - -def limit_per_repo(df: pd.DataFrame, limit: int) -> pd.DataFrame: - if limit <= 0: - return df - repo_names = df["container_name"].replace(":pkg", "").replace(":final", "").str.split("-").str[:-1].str.join("-") - return df.groupby(repo_names).head(limit) - - -# --------------------------------------------------------------------------- -# Regenerate task IDs (ported from dataset.py:90-104) -# --------------------------------------------------------------------------- - - -def regenerate_task_ids(df: pd.DataFrame) -> pd.DataFrame: - df = df.copy() - counts: dict[str, int] = {} - for i, row in df.iterrows(): - splits = row["container_name"].replace(":pkg", "").replace(":final", "").split("-") - ow_splits = splits[:-1] - owner = "-".join(ow_splits[: len(ow_splits) // 2]) - repo = "-".join(ow_splits[len(ow_splits) // 2 :]) - base_id = f"{owner}_{repo}" - cnt = counts.get(base_id, 0) + 1 - counts[base_id] = cnt - df.at[i, "task_id"] = f"{base_id}_{cnt}" - return df - - -# --------------------------------------------------------------------------- -# Hugging Face upload -# --------------------------------------------------------------------------- - -_DATASMITH_ROOT = Path(__file__).resolve().parents[2] -_DATASET_CARD_TEMPLATE = _DATASMITH_ROOT / "DATASET_CARD.md" -_BANNER_SVG = _DATASMITH_ROOT / "static" / "formula-code-datasmith.svg" - -# Columns to include when uploading to Hugging Face. -_HF_COLUMNS: list[str] = [ - "task_id", - "repo_name", - "container_name", - "image_name", - "difficulty", - "classification", - "patch", - "final_md", - "pr_merged_at", - "pr_merge_commit_sha", - "pr_base_sha", -] - - -def _build_dataset_card( - configs: list[str], - *, - total_rows: int, - verified_rows: int | None = None, - default_config: str = "default", -) -> str: - """Build a HF dataset card README.md with YAML front matter and body.""" - parquet_name = "train-00000-of-00001.parquet" - - # --- YAML front matter --- - lines = ["---"] - lines.append("configs:") - for cfg in configs: - lines.append(f' - config_name: "{cfg}"') - lines.append(" data_files:") - lines.append(" - split: train") - lines.append(f' path: "{cfg}/{parquet_name}"') - lines.append(f'default_config_name: "{default_config}"') - lines.append("task_categories:") - lines.append(" - text-generation") - lines.append("tags:") - lines.append(" - code") - lines.append(" - performance-optimization") - lines.append(" - benchmark") - lines.append("language:") - lines.append(" - en") - lines.append("size_categories:") - lines.append(" - 1K str: - """Upload a DataFrame to a HF dataset repo with monthly configs. - - Creates one config per YYYY-MM, a ``default`` config with all rows, - and any additional configs passed via *extra_configs*. A README.md - with explicit YAML metadata is generated so that HF correctly - recognises every config. - - Returns the commit URL. - """ - import tempfile - - import pyarrow as pa - import pyarrow.parquet as pq - from huggingface_hub import CommitOperationAdd, CommitOperationDelete, HfApi - - token = os.environ.get("HF_TOKEN") - if not token: - raise RuntimeError("HF_TOKEN not set. Add it to tokens.env or export it in the environment.") - - api = HfApi(token=token) - api.create_repo(repo_id, repo_type="dataset", exist_ok=True) - - if commit_message is None: - commit_message = "Update dataset configs via prepare_formulacode_dataset" - - # Only keep key columns (intersected with what's actually present). - keep = [c for c in _HF_COLUMNS if c in df.columns] - df = df[keep].copy() - if extra_configs: - extra_configs = {name: edf[[c for c in keep if c in edf.columns]].copy() for name, edf in extra_configs.items()} - logger.info("Uploading %d columns: %s", len(keep), ", ".join(keep)) - - dates = pd.to_datetime(df[date_column], errors="coerce") - df = df.copy() - df["_month"] = dates.dt.to_period("M").astype(str) - - # Use the full dataframe's schema so every monthly parquet has - # consistent column types (avoids null-type columns in small months). - full_table = pa.Table.from_pandas(df.drop(columns=["_month"]), preserve_index=False) - schema = full_table.schema - - parquet_name = "train-00000-of-00001.parquet" - operations: list[CommitOperationAdd | CommitOperationDelete] = [] - config_names: list[str] = [] - - # Delete all existing parquet files to avoid stale configs - try: - existing_files = api.list_repo_files(repo_id, repo_type="dataset") - for f in existing_files: - if f.endswith(".parquet"): - operations.append(CommitOperationDelete(path_in_repo=f)) - except Exception: # noqa: S110 - pass # repo may be empty - - with tempfile.TemporaryDirectory() as tmp: - tmp_path = Path(tmp) - - # Monthly configs - for month, group in sorted(df.groupby("_month")): - month_dir = tmp_path / month - month_dir.mkdir() - out = month_dir / parquet_name - table = pa.Table.from_pandas(group.drop(columns=["_month"]), preserve_index=False).cast(schema) - pq.write_table(table, out) - operations.append( - CommitOperationAdd( - path_in_repo=f"{month}/{parquet_name}", - path_or_fileobj=str(out), - ) - ) - config_names.append(month) - logger.info(" config %s: %d rows", month, len(group)) - - # Default config (all data) - default_dir = tmp_path / "default" - default_dir.mkdir() - out = default_dir / parquet_name - pq.write_table(full_table, out) - operations.append( - CommitOperationAdd( - path_in_repo=f"default/{parquet_name}", - path_or_fileobj=str(out), - ) - ) - config_names.append("default") - logger.info(" config default: %d rows (all)", len(df)) - - # Extra configs (e.g. "verified") - if extra_configs: - for name, extra_df in extra_configs.items(): - cfg_dir = tmp_path / name - cfg_dir.mkdir() - out = cfg_dir / parquet_name - table = pa.Table.from_pandas(extra_df, preserve_index=False).cast(schema) - pq.write_table(table, out) - operations.append( - CommitOperationAdd( - path_in_repo=f"{name}/{parquet_name}", - path_or_fileobj=str(out), - ) - ) - config_names.append(name) - logger.info(" config %s: %d rows", name, len(extra_df)) - - # README dataset card - verified_rows = len(extra_configs["verified"]) if extra_configs and "verified" in extra_configs else None - readme = _build_dataset_card( - config_names, - total_rows=len(df), - verified_rows=verified_rows, - ) - readme_path = tmp_path / "README.md" - readme_path.write_text(readme) - operations.append( - CommitOperationAdd( - path_in_repo="README.md", - path_or_fileobj=str(readme_path), - ) - ) - - # Banner SVG (referenced by DATASET_CARD.md) - if _BANNER_SVG.exists(): - operations.append( - CommitOperationAdd( - path_in_repo="static/formula-code-datasmith.svg", - path_or_fileobj=str(_BANNER_SVG), - ) - ) - - info = api.create_commit( - repo_id=repo_id, - repo_type="dataset", - operations=operations, - commit_message=commit_message, - ) - - return info.commit_url - - -# --------------------------------------------------------------------------- -# Main -# --------------------------------------------------------------------------- - - -def main() -> None: - args = parse_args() - - logger.info("Loading %s ...", args.input) - df = load_and_validate(args.input) - logger.info("Loaded %d rows", len(df)) - - df = derive_container_name(df) - df = normalize_columns(df) - df = sort_by_date(df) - - if args.filter_by: - keys = load_filter_keys(args.filter_by) - logger.info("Filtering by %d keys from %s", len(keys), args.filter_by) - df = apply_filter_by(df, keys) - logger.info("After filter-by: %d rows", len(df)) - - logger.info("Resolving Docker Hub images from %s ...", args.dockerhub_repository) - image_map = resolve_dockerhub_images(args.dockerhub_repository) - logger.info("Found %d container_name -> image mappings", len(image_map)) - - before = len(df) - df = apply_image_filter(df, image_map) - logger.info( - "Image filter: kept %d / %d rows (%d removed)", - len(df), - before, - before - len(df), - ) - - if args.limit_per_repo > 0: - df = limit_per_repo(df, args.limit_per_repo) - logger.info("After limit-per-repo(%d): %d rows", args.limit_per_repo, len(df)) - - df = regenerate_task_ids(df) - - logger.info("Final dataset: %d rows", len(df)) - logger.info( - "Columns: %s", - ", ".join( - c for c in ["container_name", "image_name", "task_id", "difficulty", "classification"] if c in df.columns - ), - ) - - if args.dry_run: - logger.info("[DRY RUN] Would write to %s", args.output) - print(df[["task_id", "container_name", "image_name", "difficulty", "classification"]].head(20).to_string()) - return - - args.output.parent.mkdir(parents=True, exist_ok=True) - df.to_parquet(args.output, index=False) - logger.info("Wrote enriched parquet to %s", args.output) - - if args.upload_to_hf: - extra_configs: dict[str, pd.DataFrame] | None = None - if args.hf_verified_filter and args.hf_verified_filter.exists(): - vkeys = load_filter_keys(args.hf_verified_filter) - df_verified = apply_filter_by(df, vkeys) - logger.info("Verified config: %d rows from %s", len(df_verified), args.hf_verified_filter) - extra_configs = {"verified": df_verified} - - logger.info("Uploading to HF repo %s (monthly configs) ...", args.upload_to_hf) - url = upload_to_huggingface( - df, - args.upload_to_hf, - date_column=args.hf_date_column, - commit_message=args.hf_commit_message, - extra_configs=extra_configs, - ) - logger.info("Upload complete: %s", url) - - -if __name__ == "__main__": - main() diff --git a/scratch/scripts/synthesize_contexts.py b/scratch/scripts/synthesize_contexts.py index 9671dec..a99ea59 100644 --- a/scratch/scripts/synthesize_contexts.py +++ b/scratch/scripts/synthesize_contexts.py @@ -89,6 +89,11 @@ def parse_args() -> argparse.Namespace: required=True, help="Path to the context registry JSON file.", ) + parser.add_argument( + "--push-to-ecr", + action="store_true", + help="Whether to push built images to AWS ECR.", + ) parser.add_argument( "--push-to-dockerhub", action="store_true", diff --git a/scratch/scripts/update_formulacode.py b/scratch/scripts/update_formulacode.py index 8f1c812..ac4ecee 100755 --- a/scratch/scripts/update_formulacode.py +++ b/scratch/scripts/update_formulacode.py @@ -3,13 +3,12 @@ Orchestration script for updating FormulaCode dataset. This script runs the full pipeline for a given date range: -1. Collect and filter commits from repositories -2. Prepare commits with patches -3. Classify performance commits -4. Synthesize Docker contexts -5. Build and publish to DockerHub -6. Merge perfonly commits into the master parquet -7. Enrich dataset and upload to Hugging Face (optional) +1. Collect commits from repositories +2. Filter commits +3. Prepare commits with patches +4. Classify performance commits +5. Synthesize Docker contexts +6. Build and publish to DockerHub Usage: python scratch/scripts/update_formulacode.py \ @@ -18,7 +17,6 @@ [--dockerhub-namespace DOCKERHUB_NAMESPACE] \ [--skip-existing] \ [--max-workers 32] \ - [--upload-to-hf] \ [--dry-run] """ @@ -104,24 +102,6 @@ def parse_args() -> argparse.Namespace: default=os.environ.get("DOCKERHUB_NAMESPACE"), help="DockerHub namespace (username or org). Defaults to DOCKERHUB_NAMESPACE env var.", ) - p.add_argument( - "--upload-to-hf", - action="store_true", - default=False, - help="Enrich and upload dataset to Hugging Face after merging.", - ) - p.add_argument( - "--hf-repo", - type=str, - default="formulacode/formulacode-all", - help="Hugging Face dataset repo ID.", - ) - p.add_argument( - "--valid-tasks", - type=Path, - default=None, - help="Path to valid_tasks.json for the 'verified' HF config.", - ) return p.parse_args() @@ -156,7 +136,7 @@ def run_command(cmd: list[str], dry_run: bool = False, step_name: str = "") -> i return result.returncode -def main(args: argparse.Namespace) -> int: # noqa: C901 +def main(args: argparse.Namespace) -> int: """Run the FormulaCode update pipeline.""" # Validate inputs validate_date(args.start_date, "start") @@ -266,25 +246,24 @@ def main(args: argparse.Namespace) -> int: # noqa: C901 if exit_code != 0: return exit_code - # Step 5: Build and publish to DockerHub + # Step 5: Build and publish to ECR logger.info("=" * 60) - logger.info("Step 5: Building and publishing to DockerHub") + logger.info("Step 5: Building and publishing to ECR") logger.info("=" * 60) cmd = [ python, - str(SCRIPTS_DIR / "build_and_publish_to_dockerhub.py"), + str(SCRIPTS_DIR / "build_and_publish_to_ecr.py"), "--commits", str(perfonly_parquet), "--context-registry", str(args.context_registry), "--max-workers", - str(args.max_workers), - "--namespace", - args.dockerhub_namespace, + "5", # ECR has rate limits, keep this low + "--skip-existing" ] if args.skip_existing: cmd.append("--skip-existing") - exit_code = run_command(cmd, args.dry_run, "build_and_publish_to_dockerhub") + exit_code = run_command(cmd, args.dry_run, "build_and_publish_to_ecr") if exit_code != 0: return exit_code @@ -304,33 +283,6 @@ def main(args: argparse.Namespace) -> int: # noqa: C901 if exit_code != 0: return exit_code - # Step 7: Enrich dataset and upload to Hugging Face (optional) - if args.upload_to_hf: - logger.info("=" * 60) - logger.info("Step 7: Enriching dataset and uploading to Hugging Face") - logger.info("=" * 60) - - enriched_parquet = args.output_dir / "perfonly_enriched.parquet" - dockerhub_repo = f"{args.dockerhub_namespace}/all" - - cmd = [ - python, - str(SCRIPTS_DIR / "prepare_formulacode_dataset.py"), - "--input", - str(perfonly_master_parquet), - "--output", - str(enriched_parquet), - "--dockerhub-repository", - dockerhub_repo, - "--upload-to-hf", - args.hf_repo, - ] - if args.valid_tasks and args.valid_tasks.exists(): - cmd.extend(["--hf-verified-filter", str(args.valid_tasks)]) - exit_code = run_command(cmd, args.dry_run, "prepare_formulacode_dataset") - if exit_code != 0: - return exit_code - logger.info("=" * 60) logger.info("FormulaCode update completed successfully!") logger.info("Date range: %s to %s", args.start_date, args.end_date) diff --git a/src/datasmith/agents/build.py b/src/datasmith/agents/build.py index a96f63f..4c306f3 100644 --- a/src/datasmith/agents/build.py +++ b/src/datasmith/agents/build.py @@ -155,7 +155,7 @@ def _handle_success( client: docker.DockerClient, build_result: BuildResult, ) -> Path: - """Register context, optionally publish to DockerHub, and write final pickle. + """Register context, optionally publish to ECR, and write final pickle. Returns the path to the final pickle. """ @@ -165,8 +165,18 @@ def _handle_success( with contextlib.suppress(Exception): context_registry.save_to_file(path=args.context_registry) - # Optionally publish to DockerHub - if getattr(args, "push_to_dockerhub", False): + # Optionally publish to ECR + if getattr(args, "push_to_ecr", False): + logger.info("agent_build_and_validate: pushed %s to ECR", task.with_tag("final").get_image_name()) + ctx.build_and_publish_to_ecr( + client=client, + task=task.with_tag("final").with_benchmarks(build_result.benchmarks), + region=os.environ.get("AWS_DEFAULT_REGION", "us-east-1"), + force=True, + skip_existing=False, + timeout_s=args.build_timeout, + ) + elif getattr(args, "push_to_dockerhub", False): logger.info("agent_build_and_validate: pushed %s to Docker Hub", task.with_tag("final").get_image_name()) ctx.build_and_publish_to_dockerhub( client=client, diff --git a/src/datasmith/docker/aws_batch_executor.py b/src/datasmith/docker/aws_batch_executor.py new file mode 100644 index 0000000..496057f --- /dev/null +++ b/src/datasmith/docker/aws_batch_executor.py @@ -0,0 +1,955 @@ +# from __future__ import annotations + +# import base64 +# import dataclasses +# import json +# import random +# import time +# from collections.abc import Mapping, Sequence +# from concurrent.futures import ThreadPoolExecutor, as_completed +# from dataclasses import dataclass +# from pathlib import Path +# from typing import Any + +# import boto3 + +# from datasmith.core.models import BuildResult, Task +# from datasmith.docker.context import DockerContext +# from datasmith.docker.s3_cache_manager import S3CacheConfig, S3DockerCacheManager +# from datasmith.logging_config import get_logger + +# logger = get_logger(__name__) + + +# @dataclass +# class AwsBatchConfig: +# """Extended AWS configuration for batch build and benchmark execution.""" + +# region: str +# s3_bucket: str +# s3_prefix: str = "datasmith-batch-execution" +# subnet_id: str = "" # required +# security_group_ids: Sequence[str] = () # required +# iam_instance_profile_name: str = "" # required +# ami_id: str = "" # AL2023 (Docker available) or a custom AMI with docker preinstalled +# instance_type: str = "c6i.xlarge" # Larger instance for benchmarks +# key_name: str | None = None +# spot_max_price: str | None = None +# tags: Mapping[str, str] = dataclasses.field(default_factory=dict) +# root_volume_gb: int = 100 +# gp3_iops: int | None = None +# gp3_throughput: int | None = None +# stream_logs: bool = True # Enable real-time log streaming via SSM +# log_output_dir: str = "output/batch_logs" # Directory to store streamed logs + +# # Batch tuning +# max_tasks_per_instance: int = 100 # Number of tasks (build + benchmark) per instance +# batch_timeout_s: int = 2 * 60 * 60 # 2 hours max time per batch +# poll_interval_s: int = 30 # Poll every 30 seconds +# max_batch_retries: int = 1 # relaunch if batch times out + +# # Benchmark specific +# num_cores_per_task: int = 4 # CPU cores per benchmark task +# asv_args: str = "--append-samples -a rounds=2 -a repeat=2 --python=same" + +# # Docker layer caching +# enable_s3_cache: bool = True # Enable S3-based Docker layer caching +# cache_bucket: str = "" # S3 bucket for Docker layer cache (required if enable_s3_cache=True) +# cache_prefix: str = "docker-cache" # S3 prefix for cache objects +# cache_region: str = "" # S3 region for cache (defaults to same as region) +# max_cache_age_days: int = 30 # Clean up cache layers older than this +# max_cache_size_gb: int = 400 # Maximum total cache size + +# # Buildx configuration +# use_buildx: bool = True # Use docker buildx for advanced caching and multi-platform builds +# buildx_builder_name: str = "aws-builder" # Name for the buildx builder instance + + +# @dataclass +# class BatchTask: +# """Represents a single task (build + benchmark) in a batch.""" + +# task: Task +# context: DockerContext +# machine_args: dict[str, str] +# asv_args: str +# num_cores: int +# task_id: str # Unique identifier for this task in the batch + + +# @dataclass +# class BatchResult: +# """Result of a batch execution containing build and benchmark results.""" + +# task_id: str +# build_result: BuildResult +# benchmark_exit_code: int +# benchmark_files: dict[str, str] +# benchmark_logs: str +# duration_s: float + + +# class AWSBatchExecutor: +# """ +# Executes Docker builds and ASV benchmarks in batches on AWS EC2 instances. + +# This class provides a scalable alternative to local execution by: +# 1. Batching tasks into groups of ~100 per EC2 instance +# 2. Building Docker images and running benchmarks on each instance +# 3. Collecting results back via S3 +# 4. Managing instance lifecycle (launch, monitor, terminate) +# """ + +# def __init__(self, cfg: AwsBatchConfig): +# self.cfg = cfg +# self.s3 = boto3.client("s3", region_name=cfg.region) +# self.ec2 = boto3.client("ec2", region_name=cfg.region) + +# # Initialize S3 cache manager if enabled +# self.cache_manager = None +# if cfg.enable_s3_cache: +# if not cfg.cache_bucket: +# raise ValueError("cache_bucket must be specified when enable_s3_cache=True") + +# cache_region = cfg.cache_region or cfg.region +# cache_config = S3CacheConfig( +# bucket=cfg.cache_bucket, +# prefix=cfg.cache_prefix, +# region=cache_region, +# max_cache_age_days=cfg.max_cache_age_days, +# max_cache_size_gb=cfg.max_cache_size_gb, +# ) +# self.cache_manager = S3DockerCacheManager(cache_config) +# logger.info( +# "Initialized S3 Docker cache manager with bucket=%s, prefix=%s", cfg.cache_bucket, cfg.cache_prefix +# ) + +# def execute_batch( +# self, +# tasks: Sequence[tuple[Task, DockerContext]], +# machine_args: dict[str, str], +# asv_args: str, +# *, +# run_id: str | None = None, +# ) -> list[BatchResult]: +# """ +# Execute a batch of tasks (build + benchmark) on AWS EC2 instances. + +# Args: +# tasks: List of (Task, DockerContext) pairs to execute +# machine_args: ASV machine configuration +# asv_args: ASV command line arguments +# run_id: Optional run identifier for tracking + +# Returns: +# List of BatchResult objects containing build and benchmark results +# """ +# if not run_id: +# run_id = f"batch-{int(time.time())}-{random.randint(1000, 9999)}" # noqa: S311 + +# logger.info("Starting batch execution for %d tasks with run_id=%s", len(tasks), run_id) + +# # Prepare batch tasks +# batch_tasks = [] +# for i, (task, context) in enumerate(tasks): +# batch_task = BatchTask( +# task=task, +# context=context, +# machine_args=machine_args, +# asv_args=asv_args, +# num_cores=self.cfg.num_cores_per_task, +# task_id=f"{run_id}-task-{i:03d}", +# ) +# batch_tasks.append(batch_task) + +# # Split into batches of max_tasks_per_instance +# batches = [] +# for i in range(0, len(batch_tasks), self.cfg.max_tasks_per_instance): +# batch = batch_tasks[i : i + self.cfg.max_tasks_per_instance] +# batches.append(batch) + +# logger.info("Split %d tasks into %d batches", len(tasks), len(batches)) + +# # Execute batches in parallel using ThreadPoolExecutor +# all_results = [] +# with ThreadPoolExecutor(max_workers=min(len(batches), 10)) as executor: +# future_to_batch = { +# executor.submit(self._execute_single_batch, batch, run_id, batch_idx): (batch_idx, batch) +# for batch_idx, batch in enumerate(batches) +# } + +# # Collect results as they complete +# for future in as_completed(future_to_batch): +# batch_idx, batch = future_to_batch[future] +# try: +# batch_results = future.result() +# all_results.extend(batch_results) +# logger.info("Completed batch %d/%d with %d tasks", batch_idx + 1, len(batches), len(batch)) +# except Exception as e: +# logger.exception("Batch %d failed", batch_idx + 1) +# # Create failure results for this batch +# failure_results = [ +# BatchResult( +# task_id=task.task_id, +# build_result=BuildResult( +# ok=False, +# image_name="", +# image_id=None, +# rc=1, +# duration_s=0.0, +# stderr_tail=str(e), +# stdout_tail="", +# ), +# benchmark_exit_code=1, +# benchmark_files={}, +# benchmark_logs=str(e), +# duration_s=0.0, +# ) +# for task in batch +# ] +# all_results.extend(failure_results) + +# logger.info("Completed batch execution: %d results", len(all_results)) +# return all_results + +# def _execute_single_batch( +# self, +# batch: Sequence[BatchTask], +# run_id: str, +# batch_idx: int, +# ) -> list[BatchResult]: +# """Execute a single batch of tasks on one EC2 instance.""" + +# # Upload batch data to S3 +# batch_data_key = self._upload_batch_data(batch, run_id, batch_idx) + +# # Launch EC2 instance +# instance_id = self._launch_batch_instance(batch, run_id, batch_idx, batch_data_key) + +# try: +# # Wait for results while streaming logs +# results = self._wait_for_batch_results(batch, run_id, batch_idx, instance_id) +# return results +# finally: +# # Clean up instance +# self._terminate_instance(instance_id) + +# def _upload_batch_data( +# self, +# batch: Sequence[BatchTask], +# run_id: str, +# batch_idx: int, +# ) -> str: +# """Upload batch configuration and contexts to S3.""" + +# # Prepare batch data +# batch_data: dict[str, Any] = { +# "run_id": run_id, +# "batch_idx": batch_idx, +# "config": { +# "num_cores_per_task": self.cfg.num_cores_per_task, +# "asv_args": self.cfg.asv_args, +# "batch_timeout_s": self.cfg.batch_timeout_s, +# }, +# "tasks": [], +# } + +# # Serialize each task's context and metadata +# for batch_task in batch: +# task_data = { +# "task_id": batch_task.task_id, +# "task": { +# "owner": batch_task.task.owner, +# "repo": batch_task.task.repo, +# "sha": batch_task.task.sha, +# "commit_date": batch_task.task.commit_date, +# "tag": batch_task.task.tag, +# }, +# "context": batch_task.context.to_dict(), +# "machine_args": batch_task.machine_args, +# "asv_args": batch_task.asv_args, +# "num_cores": batch_task.num_cores, +# } +# batch_data["tasks"].append(task_data) # pyright: ignore[reportListAppend] + +# # Upload to S3 +# batch_data_key = f"{self.cfg.s3_prefix}/batches/{run_id}/batch-{batch_idx:03d}/batch-data.json" +# batch_data_json = json.dumps(batch_data, indent=2) + +# self.s3.put_object( +# Bucket=self.cfg.s3_bucket, +# Key=batch_data_key, +# Body=batch_data_json.encode("utf-8"), +# ContentType="application/json", +# ) + +# logger.info("Uploaded batch data to s3://%s/%s", self.cfg.s3_bucket, batch_data_key) +# return batch_data_key + +# def _launch_batch_instance( +# self, +# batch: Sequence[BatchTask], +# run_id: str, +# batch_idx: int, +# batch_data_key: str, +# ) -> str: +# """Launch an EC2 instance to execute the batch.""" + +# user_data = self._generate_user_data(batch_data_key) + +# spec = { +# "ImageId": self.cfg.ami_id, +# "InstanceType": self.cfg.instance_type, +# "SubnetId": self.cfg.subnet_id, +# "SecurityGroupIds": list(self.cfg.security_group_ids), +# "IamInstanceProfile": {"Name": self.cfg.iam_instance_profile_name}, +# "UserData": base64.b64encode(user_data.encode("utf-8")).decode("ascii"), +# "TagSpecifications": [ +# { +# "ResourceType": "instance", +# "Tags": [ +# {"Key": "Name", "Value": f"ds-batch-exec-{run_id}-b{batch_idx:03d}"}, +# {"Key": "RunId", "Value": run_id}, +# {"Key": "BatchIdx", "Value": str(batch_idx)}, +# *({"Key": k, "Value": v} for k, v in self.cfg.tags.items()), +# ], +# } +# ], +# "BlockDeviceMappings": [ +# { +# "DeviceName": "/dev/xvda", +# "Ebs": { +# "VolumeSize": self.cfg.root_volume_gb, +# "VolumeType": "gp3", +# "DeleteOnTermination": True, +# **({"Iops": self.cfg.gp3_iops} if self.cfg.gp3_iops else {}), +# **({"Throughput": self.cfg.gp3_throughput} if self.cfg.gp3_throughput else {}), +# }, +# } +# ], +# } + +# # Use Spot only when a max price is explicitly provided; otherwise On-Demand +# if self.cfg.spot_max_price: +# spec["InstanceMarketOptions"] = { +# "MarketType": "spot", +# "SpotOptions": { +# "InstanceInterruptionBehavior": "terminate", +# "MaxPrice": self.cfg.spot_max_price, +# }, +# } + +# if self.cfg.key_name: +# spec["KeyName"] = self.cfg.key_name + +# resp = self.ec2.run_instances(MinCount=1, MaxCount=1, **spec) +# instance_id: str = resp["Instances"][0]["InstanceId"] + +# logger.info("Launched instance %s for batch %d", instance_id, batch_idx) +# return instance_id + +# def _generate_user_data(self, batch_data_key: str) -> str: +# """Generate user data script for EC2 instance (Amazon Linux 2 compatible).""" +# script = """#!/bin/bash +# set -eox pipefail + +# # Log user-data to a file for debugging +# exec > >(tee -a /var/log/user-data.log) 2>&1 + +# retry() { +# local n=0 +# local max=5 +# local delay=5 +# while true; do +# "$@" && break || { +# n=$((n+1)) +# if [ $n -ge $max ]; then +# echo "Command failed after $n attempts: $*" +# return 1 +# fi +# echo "Retry $n/$max for: $*"; sleep $delay; +# } +# done +# } + +# echo "==> Installing prerequisites (Amazon Linux 2)" +# retry yum update -y +# retry yum install -y jq git python3 python3-pip awscli bc +# retry amazon-linux-extras install -y docker + +# echo "==> Enabling and starting Docker" +# systemctl enable docker || true +# systemctl start docker || true + +# # Wait for Docker to be ready +# for i in $(seq 1 10); do +# if docker info >/dev/null 2>&1; then +# break +# fi +# echo "Waiting for Docker daemon... ($i/10)" +# sleep 2 +# done + +# echo "==> Installing Python packages" +# retry pip3 install --upgrade boto3 docker + +# echo "==> Preparing workspace" +# mkdir -p /opt/datasmith +# cd /opt/datasmith + +# echo "==> Downloading batch data" +# retry aws s3 cp "s3://{s3_bucket}/{batch_data_key}" batch-data.json + +# echo "==> Parsing batch data" +# BATCH_DATA=$(cat batch-data.json) +# RUN_ID=$(echo "$BATCH_DATA" | jq -r '.run_id') +# BATCH_IDX=$(echo "$BATCH_DATA" | jq -r '.batch_idx') +# NUM_CORES=$(echo "$BATCH_DATA" | jq -r '.config.num_cores_per_task') +# ASV_ARGS=$(echo "$BATCH_DATA" | jq -r '.config.asv_args') +# BATCH_TIMEOUT=$(echo "$BATCH_DATA" | jq -r '.config.batch_timeout_s') + +# echo "Starting batch execution: run_id=$RUN_ID, batch_idx=$BATCH_IDX" + +# TASK_COUNT=$(echo "$BATCH_DATA" | jq '.tasks | length') +# for i in $(seq 0 $((TASK_COUNT - 1))); do +# echo "Processing task $((i + 1))/$TASK_COUNT" + +# TASK_DATA=$(echo "$BATCH_DATA" | jq ".tasks[$i]") +# TASK_ID=$(echo "$TASK_DATA" | jq -r '.task_id') +# OWNER=$(echo "$TASK_DATA" | jq -r '.task.owner' | tr '[:upper:]' '[:lower:]') +# REPO=$(echo "$TASK_DATA" | jq -r '.task.repo' | tr '[:upper:]' '[:lower:]') +# SHA=$(echo "$TASK_DATA" | jq -r '.task.sha' | tr '[:upper:]' '[:lower:]') +# TAG=$(echo "$TASK_DATA" | jq -r '.task.tag' | tr '[:upper:]' '[:lower:]') +# ENV_PAYLOAD=$(echo "$TASK_DATA" | jq -r '.task.env_payload // ""') + +# mkdir -p "task-$TASK_ID" +# cd "task-$TASK_ID" + +# echo "==> Writing Docker context files" +# echo "$TASK_DATA" | jq -r '.context.dockerfile_data' > Dockerfile +# echo "$TASK_DATA" | jq -r '.context.entrypoint_data' > entrypoint.sh +# echo "$TASK_DATA" | jq -r '.context.building_data' > docker_build_pkg.sh +# echo "$TASK_DATA" | jq -r '.context.env_building_data' > docker_build_env.sh +# echo "$TASK_DATA" | jq -r '.context.base_building_data' > docker_build_base.sh +# echo "$TASK_DATA" | jq -r '.context.run_building_data' > docker_build_run.sh +# echo "$TASK_DATA" | jq -r '.context.profile_data' > profile.sh +# echo "$TASK_DATA" | jq -r '.context.run_tests_data' > run_tests.sh +# chmod +x entrypoint.sh docker_build_*.sh profile.sh run_tests.sh || true + +# echo "==> Building Docker image for $OWNER/$REPO@$SHA" +# REPO_URL="https://github.com/$OWNER/$REPO.git" +# IMAGE_NAME="$OWNER-$REPO-$SHA:$TAG" + +# # Set up Docker BuildKit for advanced caching +# export DOCKER_BUILDKIT=1 + +# # Capture build timing and logs +# BUILD_START=$(date +%s.%N) +# BUILD_LOG_FILE="build.log" + +# # Check if we should use buildx +# USE_BUILDX="{use_buildx}" +# BUILDER_NAME="{buildx_builder_name}" + +# if [ "$USE_BUILDX" = "true" ]; then +# echo "Setting up docker buildx builder: $BUILDER_NAME" + +# # Create buildx builder if it doesn't exist +# if ! docker buildx ls | grep -q "$BUILDER_NAME"; then +# echo "Creating buildx builder: $BUILDER_NAME" +# docker buildx create --name "$BUILDER_NAME" --use --driver docker-container || { +# echo "Failed to create buildx builder, falling back to default" +# docker buildx use default +# } +# else +# echo "Using existing buildx builder: $BUILDER_NAME" +# docker buildx use "$BUILDER_NAME" +# fi + +# # Build Docker command with buildx and S3 cache support +# DOCKER_BUILD_CMD="timeout $BATCH_TIMEOUT docker buildx build --load --progress=plain -t $IMAGE_NAME . --build-arg REPO_URL=$REPO_URL --build-arg COMMIT_SHA=$SHA --build-arg ENV_PAYLOAD=\"$ENV_PAYLOAD\" --target $TAG" + +# # Add S3 cache arguments if cache is enabled +# if [ "{enable_s3_cache}" = "true" ]; then +# CACHE_BUCKET="{cache_bucket}" +# CACHE_PREFIX="{cache_prefix}" +# CACHE_REGION="{cache_region}" + +# # Generate cache mount configuration for buildx +# CACHE_FROM="type=s3,bucket=$CACHE_BUCKET,region=$CACHE_REGION,prefix=$CACHE_PREFIX/layers/$OWNER-$REPO-$SHA" +# CACHE_TO="type=s3,bucket=$CACHE_BUCKET,region=$CACHE_REGION,prefix=$CACHE_PREFIX/layers/$OWNER-$REPO-$SHA,mode=max" + +# DOCKER_BUILD_CMD="$DOCKER_BUILD_CMD --cache-from $CACHE_FROM --cache-to $CACHE_TO" + +# echo "Using buildx with S3 cache: bucket=$CACHE_BUCKET, prefix=$CACHE_PREFIX" +# else +# echo "Using buildx without S3 cache" +# fi +# else +# echo "Using standard docker build" + +# # Build Docker command with S3 cache support (legacy) +# DOCKER_BUILD_CMD="timeout $BATCH_TIMEOUT docker build -t $IMAGE_NAME . --build-arg REPO_URL=$REPO_URL --build-arg COMMIT_SHA=$SHA --build-arg ENV_PAYLOAD=\"$ENV_PAYLOAD\" --target $TAG" + +# # Add S3 cache arguments if cache is enabled +# if [ "{enable_s3_cache}" = "true" ]; then +# CACHE_BUCKET="{cache_bucket}" +# CACHE_PREFIX="{cache_prefix}" +# CACHE_REGION="{cache_region}" + +# # Generate cache mount configuration +# CACHE_MOUNT="type=s3,bucket=$CACHE_BUCKET,region=$CACHE_REGION,prefix=$CACHE_PREFIX/layers/$OWNER-$REPO-$SHA" + +# DOCKER_BUILD_CMD="$DOCKER_BUILD_CMD --cache-from $CACHE_MOUNT --cache-to $CACHE_MOUNT,mode=max" + +# echo "Using S3 cache: bucket=$CACHE_BUCKET, prefix=$CACHE_PREFIX" +# fi +# fi + +# if $DOCKER_BUILD_CMD > "$BUILD_LOG_FILE" 2>&1; then +# BUILD_RC=0 +# BUILD_DURATION=$(echo "$(date +%s.%N) - $BUILD_START" | bc -l) +# BUILD_STDOUT_TAIL=$(tail -c 1000 "$BUILD_LOG_FILE" | base64 | tr -d '\n') +# BUILD_STDERR_TAIL="" +# echo "Build completed successfully in $BUILD_DURATION seconds" +# else +# BUILD_RC=$? +# BUILD_DURATION=$(echo "$(date +%s.%N) - $BUILD_START" | bc -l) +# BUILD_STDOUT_TAIL=$(tail -c 1000 "$BUILD_LOG_FILE" | base64 | tr -d '\n') +# BUILD_STDERR_TAIL=$(tail -c 1000 "$BUILD_LOG_FILE" | base64 | tr -d '\n') +# echo "Build failed with exit code $BUILD_RC after $BUILD_DURATION seconds" + +# # Create failure result.json with jq (compatible with older jq versions) +# BUILD_LOG_B64=$(base64 -w 0 "$BUILD_LOG_FILE" 2>/dev/null || base64 "$BUILD_LOG_FILE" | tr -d '\n') +# jq -n \ +# --arg task_id "$TASK_ID" \ +# --arg build_rc "$BUILD_RC" \ +# --arg build_duration "$BUILD_DURATION" \ +# --arg stderr_tail "$BUILD_STDERR_TAIL" \ +# --arg stdout_tail "$BUILD_STDOUT_TAIL" \ +# --arg build_log "$BUILD_LOG_B64" ' +# { +# task_id: $task_id, +# build_result: { +# ok: false, +# image_name: "", +# image_id: null, +# rc: ($build_rc|tonumber), +# duration_s: ($build_duration|tonumber), +# stderr_tail: $stderr_tail, +# stdout_tail: $stdout_tail +# }, +# benchmark_exit_code: 1, +# benchmark_files: { +# "build.log": $build_log +# }, +# benchmark_logs: ("Build failed with exit code " + $build_rc), +# duration_s: ($build_duration|tonumber) +# }' > result.json + +# retry aws s3 cp result.json "s3://{s3_bucket}/{s3_prefix}/results/$RUN_ID/batch-$BATCH_IDX/$TASK_ID/result.json" || true +# cd .. +# continue +# fi + +# echo "==> Running benchmark for $TASK_ID" +# CONTAINER_NAME="benchmark-$TASK_ID" +# mkdir -p output + +# # Capture benchmark timing and logs +# BENCHMARK_START=$(date +%s.%N) +# # We must not use --rm if we want to access container logs after the run, +# # because --rm deletes the container immediately after exit. +# if timeout "$BATCH_TIMEOUT" docker run \ +# --name "$CONTAINER_NAME" \ +# --cpus="$NUM_CORES" \ +# -v "$(pwd)/output:/output" \ +# --entrypoint /profile.sh \ +# "$IMAGE_NAME" \ +# /output/profile "" > benchmark.log 2>&1; then +# BENCHMARK_EXIT_CODE=0 +# echo "Benchmark completed successfully" +# else +# BENCHMARK_EXIT_CODE=$? +# echo "Benchmark failed with exit code $BENCHMARK_EXIT_CODE" +# fi + +# BENCHMARK_DURATION=$(echo "$(date +%s.%N) - $BENCHMARK_START" | bc -l) + +# # Save container logs alongside outputs +# docker logs "$CONTAINER_NAME" > output/container.log 2>&1 || true + +# # Clean up the container after logs are saved +# docker rm "$CONTAINER_NAME" > /dev/null 2>&1 || true +# echo "==> Collecting results for $TASK_ID (exit=$BENCHMARK_EXIT_CODE)" + +# # Calculate total duration +# TOTAL_DURATION=$(echo "$BUILD_DURATION + $BENCHMARK_DURATION" | bc -l) + +# # Build success result.json incrementally with jq (no sed / huge args) +# jq -n \ +# --arg task_id "$TASK_ID" \ +# --arg image_name "$IMAGE_NAME" \ +# --arg build_rc "$BUILD_RC" \ +# --arg build_duration "$BUILD_DURATION" \ +# --arg stderr_tail "$BUILD_STDERR_TAIL" \ +# --arg stdout_tail "$BUILD_STDOUT_TAIL" \ +# --arg bench_exit "$BENCHMARK_EXIT_CODE" \ +# --arg total_duration "$TOTAL_DURATION" \ +# ' +# { +# task_id: $task_id, +# build_result: { +# ok: true, +# image_name: $image_name, +# image_id: null, +# rc: ($build_rc|tonumber), +# duration_s: ($build_duration|tonumber), +# stderr_tail: $stderr_tail, +# stdout_tail: $stdout_tail +# }, +# benchmark_exit_code: ($bench_exit|tonumber), +# benchmark_files: {}, +# benchmark_logs: "", +# duration_s: ($total_duration|tonumber) +# }' > result.json + +# # Add build.log (base64) if present +# if [ -f "$BUILD_LOG_FILE" ]; then +# BUILD_LOG_B64=$(base64 -w 0 "$BUILD_LOG_FILE" 2>/dev/null || base64 "$BUILD_LOG_FILE" | tr -d '\n') +# jq --arg build_log "$BUILD_LOG_B64" \ +# '.benchmark_files += {"build.log": $build_log}' \ +# result.json > result.tmp && mv result.tmp result.json +# fi + +# # Add benchmark.log (base64) and set benchmark_logs to its base64 content +# if [ -f "benchmark.log" ]; then +# BENCHMARK_LOG_B64=$(base64 -w 0 "benchmark.log" 2>/dev/null || base64 "benchmark.log" | tr -d '\n') +# jq --arg benchmark_log "$BENCHMARK_LOG_B64" \ +# '.benchmark_files += {"benchmark.log": $benchmark_log} | .benchmark_logs = $benchmark_log' \ +# result.json > result.tmp && mv result.tmp result.json +# fi + +# # Add any files from output/ (base64) +# if [ -d "output" ]; then +# for file in output/*; do +# if [ -f "$file" ]; then +# filename=$(basename "$file") +# FILE_B64=$(base64 -w 0 "$file" 2>/dev/null || base64 "$file" | tr -d '\n') +# jq --arg content "$FILE_B64" \ +# --arg name "$filename" \ +# '.benchmark_files += {($name): $content}' \ +# result.json > result.tmp && mv result.tmp result.json +# fi +# done +# fi + +# # Upload +# retry aws s3 cp result.json "s3://{s3_bucket}/{s3_prefix}/results/$RUN_ID/batch-$BATCH_IDX/$TASK_ID/result.json" || true + +# docker rmi "$IMAGE_NAME" || true +# cd .. +# done + +# echo "Batch execution completed for run_id=$RUN_ID, batch_idx=$BATCH_IDX" +# # Allow S3 eventual consistency to settle +# sleep 10 +# shutdown -h now || poweroff || halt +# """ + +# # Replace placeholders in the script +# return ( +# script.replace("{s3_bucket}", self.cfg.s3_bucket) +# .replace("{s3_prefix}", self.cfg.s3_prefix) +# .replace("{batch_data_key}", batch_data_key) +# .replace("{enable_s3_cache}", str(self.cfg.enable_s3_cache).lower()) +# .replace("{cache_bucket}", self.cfg.cache_bucket or "") +# .replace("{cache_prefix}", self.cfg.cache_prefix) +# .replace("{cache_region}", self.cfg.cache_region or self.cfg.region) +# .replace("{use_buildx}", str(self.cfg.use_buildx).lower()) +# .replace("{buildx_builder_name}", self.cfg.buildx_builder_name) +# ) + +# def _stream_user_data_logs(self, instance_id: str, batch_idx: int, run_id: str, last_position: int = 0) -> int: +# """Stream user-data logs from EC2 instance to output directory and return the last position read.""" +# try: +# # Use AWS Systems Manager to execute commands on the instance +# ssm = boto3.client("ssm", region_name=self.cfg.region) + +# s3_log_key = f"{self.cfg.s3_prefix}/logs/{run_id}/batch-{batch_idx:03d}-{instance_id}.log" + +# # Command to copy the log file to S3 to bypass the 24KB SSM output limit +# # This overwrites the same file each time, keeping it up to date +# command = f"aws s3 cp /var/log/user-data.log s3://{self.cfg.s3_bucket}/{s3_log_key} 2>/dev/null || echo 'Log file not found or S3 upload failed'" + +# response = ssm.send_command( +# InstanceIds=[instance_id], +# DocumentName="AWS-RunShellScript", +# Parameters={"commands": [command]}, +# TimeoutSeconds=60, # Increased timeout for S3 upload +# ) + +# command_id = response["Command"]["CommandId"] +# time.sleep(3) # Give it time to complete + +# # Wait for command to complete +# tries = 15 # Increased tries for S3 upload +# output = None +# while tries > 0: +# try: +# output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instance_id) +# if output["Status"] not in ["Pending", "InProgress"]: +# break +# except Exception: # noqa: S110 +# pass +# time.sleep(2) +# tries -= 1 + +# if output and output["Status"] == "Success": +# # Download the log file from S3 +# try: +# s3_response = self.s3.get_object(Bucket=self.cfg.s3_bucket, Key=s3_log_key) +# full_log_content = s3_response["Body"].read().decode("utf-8") + +# # Create output directory for logs +# log_dir = Path(self.cfg.log_output_dir) / run_id +# log_dir.mkdir(parents=True, exist_ok=True) + +# # Write logs to batch-specific file +# log_file = log_dir / f"batch-{batch_idx:03d}-{instance_id}.log" + +# # Always rewrite the entire file with the current content +# with open(log_file, "w", encoding="utf-8") as f: +# f.write(full_log_content) + +# # Count total lines +# total_lines = len([line for line in full_log_content.split("\n") if line.strip()]) +# if total_lines > 0: +# logger.info( +# "Updated log file with %d total lines from %s to %s (S3: s3://%s/%s)", +# total_lines, +# instance_id, +# log_file, +# self.cfg.s3_bucket, +# s3_log_key, +# ) + +# # Return the length of the full content as the new position +# return len(full_log_content.encode("utf-8")) + +# except Exception as e: +# logger.debug("Error downloading log file from S3 for %s: %s", instance_id, e) +# return last_position +# else: +# logger.debug( +# "SSM command failed for %s: %s", +# instance_id, +# output.get("Status", "Unknown") if output else "No output", +# ) +# return last_position + +# except Exception as e: +# # SSM agent may not be installed or IAM permissions may be missing +# # This is expected and we should continue without log streaming +# logger.debug("Error streaming user-data logs for %s: %s", instance_id, e) +# return last_position + +# def _create_batch_summary_log(self, run_id: str, batch_idx: int, instance_id: str) -> Path: +# """Create a summary log file for the batch execution.""" +# log_dir = Path(self.cfg.log_output_dir) / run_id +# log_dir.mkdir(parents=True, exist_ok=True) + +# summary_file = log_dir / f"batch-{batch_idx:03d}-summary.log" + +# # Create the S3 log key for reference +# s3_log_key = f"{self.cfg.s3_prefix}/logs/{run_id}/batch-{batch_idx:03d}-{instance_id}.log" + +# with open(summary_file, "w", encoding="utf-8") as f: +# f.write("Batch Execution Summary\n") +# f.write("======================\n") +# f.write(f"Run ID: {run_id}\n") +# f.write(f"Batch Index: {batch_idx}\n") +# f.write(f"Instance ID: {instance_id}\n") +# f.write(f"Started: {time.strftime('%Y-%m-%d %H:%M:%S UTC', time.gmtime())}\n") +# f.write(f"Log Directory: {log_dir}\n") +# f.write(f"Instance Log: batch-{batch_idx:03d}-{instance_id}.log\n") +# f.write(f"S3 Log File: s3://{self.cfg.s3_bucket}/{s3_log_key}\n") +# f.write("\n") +# f.write(f"Real-time logs are being streamed to: batch-{batch_idx:03d}-{instance_id}.log\n") +# f.write(f"Persistent S3 log file: s3://{self.cfg.s3_bucket}/{s3_log_key}\n") +# f.write(f"Use 'tail -f {summary_file}' to monitor this summary\n") +# f.write(f"Use 'tail -f {log_dir}/batch-{batch_idx:03d}-{instance_id}.log' to see instance logs\n") +# f.write(f"Use 'aws s3 cp s3://{self.cfg.s3_bucket}/{s3_log_key} -' to view S3 log directly\n") +# f.write("\n") + +# return summary_file + +# def _wait_for_batch_results( # noqa: C901 +# self, +# batch: Sequence[BatchTask], +# run_id: str, +# batch_idx: int, +# instance_id: str, +# ) -> list[BatchResult]: +# """Wait for batch results to be uploaded to S3 while streaming user-data logs.""" + +# deadline = time.time() + self.cfg.batch_timeout_s +# results: dict[str, BatchResult] = {} +# log_position = 0 + +# # Create summary log file +# summary_file = self._create_batch_summary_log(run_id, batch_idx, instance_id) +# logger.info( +# "Waiting for %d results from batch %d (streaming logs from %s to %s)", +# len(batch), +# batch_idx, +# instance_id, +# summary_file.parent, +# ) + +# while time.time() < deadline and len(results) < len(batch): +# # Stream user-data logs during polling if enabled (do this first) +# if self.cfg.stream_logs: +# log_position = self._stream_user_data_logs(instance_id, batch_idx, run_id, log_position) + +# for batch_task in batch: +# if batch_task.task_id in results: +# continue + +# result_key = ( +# f"{self.cfg.s3_prefix}/results/{run_id}/batch-{batch_idx:03d}/{batch_task.task_id}/result.json" +# ) + +# try: +# obj = self.s3.get_object(Bucket=self.cfg.s3_bucket, Key=result_key) +# result_data = json.loads(obj["Body"].read().decode("utf-8")) + +# # Parse benchmark files (base64 encoded) +# benchmark_files = {} +# for filename, content_b64 in result_data.get("benchmark_files", {}).items(): +# try: +# content = base64.b64decode(content_b64).decode("utf-8") +# benchmark_files[filename] = content +# except Exception: +# benchmark_files[filename] = f"" + +# result = BatchResult( +# task_id=batch_task.task_id, +# build_result=BuildResult( +# ok=result_data["build_result"]["ok"], +# image_name=result_data["build_result"]["image_name"], +# image_id=result_data["build_result"]["image_id"], +# rc=result_data["build_result"]["rc"], +# duration_s=result_data["build_result"]["duration_s"], +# stderr_tail=result_data["build_result"]["stderr_tail"], +# stdout_tail=result_data["build_result"]["stdout_tail"], +# ), +# benchmark_exit_code=result_data["benchmark_exit_code"], +# benchmark_files=benchmark_files, +# benchmark_logs=result_data["benchmark_logs"], +# duration_s=result_data["duration_s"], +# ) + +# results[batch_task.task_id] = result +# logger.info("Received result for task %s", batch_task.task_id) + +# # Update summary log +# with open(summary_file, "a", encoding="utf-8") as f: +# f.write( +# f"[{time.strftime('%H:%M:%S')}] Completed task {batch_task.task_id} " +# f"(build: {'✓' if result.build_result.ok else '✗'}, " +# f"benchmark: {'✓' if result.benchmark_exit_code == 0 else '✗'})\n" +# ) + +# except self.s3.exceptions.NoSuchKey: +# continue +# except Exception as e: +# logger.warning("Error fetching result for %s: %s", batch_task.task_id, e) +# continue + +# if len(results) < len(batch): +# time.sleep(self.cfg.poll_interval_s) + +# if not self._is_instance_running(instance_id): +# logger.warning("Instance %s has terminated, stopping batch result polling", instance_id) +# break + +# # Determine the reason for incomplete results +# instance_terminated = not self._is_instance_running(instance_id) + +# if len(results) < len(batch): +# if instance_terminated: +# logger.warning( +# "Instance terminated while waiting for batch results: got %d/%d", len(results), len(batch) +# ) +# else: +# logger.warning("Timeout waiting for batch results: got %d/%d", len(results), len(batch)) + +# # Write final summary +# with open(summary_file, "a", encoding="utf-8") as f: +# f.write("\n") +# f.write(f"Batch completed: {time.strftime('%Y-%m-%d %H:%M:%S UTC', time.gmtime())}\n") +# f.write(f"Results: {len(results)}/{len(batch)} tasks completed\n") +# if len(results) < len(batch): +# if instance_terminated: +# f.write(f"Status: INSTANCE_TERMINATED - {len(batch) - len(results)} tasks did not complete\n") +# else: +# f.write(f"Status: TIMEOUT - {len(batch) - len(results)} tasks did not complete\n") +# else: +# f.write("Status: SUCCESS - All tasks completed\n") +# f.write(f"Log files available in: {summary_file.parent}\n") + +# return list(results.values()) + +# def _is_instance_running(self, instance_id: str) -> bool: +# """Check if an EC2 instance is still running.""" +# try: +# response = self.ec2.describe_instances(InstanceIds=[instance_id]) +# if not response["Reservations"]: +# logger.warning("No reservations found for instance %s", instance_id) +# return False + +# instance = response["Reservations"][0]["Instances"][0] +# state = instance["State"]["Name"] + +# # Instance is considered running if it's in 'running' state +# # Other states like 'stopping', 'stopped', 'terminating', 'terminated' mean it's not running +# is_running = state == "running" + +# if not is_running: +# logger.info("Instance %s is in state '%s' (not running)", instance_id, state) +# return False +# else: +# return True +# except Exception as e: +# logger.warning("Error checking instance state for %s: %s", instance_id, e) +# # If we can't check the state, assume it's still running to avoid false positives +# return True + +# def _terminate_instance(self, instance_id: str) -> None: +# """Terminate an EC2 instance.""" +# try: +# self.ec2.terminate_instances(InstanceIds=[instance_id]) +# logger.info("Terminated instance %s", instance_id) +# except Exception as e: +# logger.warning("Error terminating instance %s: %s", instance_id, e) + +# def cleanup_cache(self) -> dict[str, int]: +# """Clean up old Docker layer cache entries.""" +# if not self.cache_manager: +# logger.warning("S3 cache not enabled, skipping cleanup") +# return {} + +# logger.info("Starting Docker layer cache cleanup") +# stats = self.cache_manager.cleanup_old_cache() +# logger.info("Cache cleanup completed: %s", stats) +# return stats + +# def get_cache_stats(self) -> dict[str, Any]: +# """Get Docker layer cache statistics.""" +# if not self.cache_manager: +# return {"enabled": False} + +# stats = self.cache_manager.get_cache_stats() +# stats["enabled"] = True +# return stats diff --git a/src/datasmith/docker/context.py b/src/datasmith/docker/context.py index 19e2a22..d3db77b 100644 --- a/src/datasmith/docker/context.py +++ b/src/datasmith/docker/context.py @@ -21,6 +21,8 @@ from datasmith.core.models.build import BuildResult from datasmith.core.models.task import Task from datasmith.docker.dockerhub import publish_images_to_dockerhub +from datasmith.docker.ecr import publish_images_to_ecr +from datasmith.docker.s3_cache_manager import S3DockerCacheManager from datasmith.execution.utils import _get_commit_info from datasmith.logging_config import get_logger @@ -537,6 +539,7 @@ def build_container_streaming( # noqa: C901 timeout_s: float = float("inf"), tail_chars: int = 4000, pull: bool = False, + s3_cache_config: S3DockerCacheManager | dict[str, str] | None = None, use_buildx: bool | None = None, ) -> BuildResult: """ @@ -552,7 +555,17 @@ def build_container_streaming( # noqa: C901 Args: use_buildx: If True, use docker buildx; if False, use SDK; if None, auto-detect """ - s3_cache_config = None + if isinstance(s3_cache_config, S3DockerCacheManager): + s3_cache_config = s3_cache_config.get_cache_mount_config( + dockerfile_content=self.dockerfile_data, + build_args=build_args, + ) + elif (s3_cache_config is None) and (os.environ.get("AWS_S3_CACHE_BUCKET")): + s3_cache_config = { + "bucket": os.environ["AWS_S3_BUCKET_DOCKER"], + "region": os.environ.get("AWS_REGION", "us-east-1"), + "prefix": os.environ.get("AWS_S3_BUCKET_DOCKER_PREFIX", "docker-cache"), + } # Determine whether to use buildx if use_buildx is None: @@ -781,6 +794,98 @@ def build_container_streaming( # noqa: C901 except Exception: logger.exception("Unexpected error deleting image '%s' after build.", image_name) + def build_and_publish_to_ecr( + self, + client: docker.DockerClient, + task: Task, + region: str, + *, + repository_mode: str = "single", # "single" or "mirror" + single_repo: str = "formulacode/all", + ecr_repo_prefix: str | None = None, + skip_existing: bool = True, + parallelism: int = 1, + force: bool = False, + run_labels: dict[str, str] | None = None, + timeout_s: float = 15 * 60, + tail_chars: int = 10_000, + pull: bool = False, + use_buildx: bool | None = None, + boto3_session: Any = None, + ) -> tuple[BuildResult, dict[str, str]]: + """ + Build the Docker image for ``task`` and publish it to AWS ECR. + + Returns (BuildResult, {local_ref: ecr_ref}). If the build fails, the push step + is skipped and the mapping is empty. + """ + if task.sha is None and task.tag in {"pkg", "run"}: + raise ValueError("Task.sha must be set for building package/run images") + + image_name = task.get_image_name() + repo_url = f"https://www.github.com/{task.owner}/{task.repo}" + build_args: dict[str, str] = {"REPO_URL": repo_url} + if task.sha is not None: + build_args["COMMIT_SHA"] = task.sha + if getattr(task, "env_payload", ""): + build_args["ENV_PAYLOAD"] = task.env_payload + if getattr(task, "python_version", ""): + build_args["PY_VERSION"] = task.python_version + if getattr(task, "benchmarks", ""): + build_args["BENCHMARKS"] = task.benchmarks + + if run_labels is None: + run_labels = { + "datasmith.task": f"{task.owner}/{task.repo}", + "datasmith.sha": task.sha or "unknown", + "datasmith.run": "publish", + } + + logger.info("Building image %s for ECR publish", image_name) + build_res = self.build_container_streaming( + client=client, + image_name=image_name, + build_args=build_args, + run_labels=run_labels, + probe=False, + force=force, + delete_img=False, + timeout_s=timeout_s, + tail_chars=tail_chars, + pull=pull, + s3_cache_config=None, + use_buildx=use_buildx, + ) + + if not build_res.ok: + logger.error( + "Build failed for %s (rc=%s); skipping ECR publish.", + image_name, + build_res.rc, + ) + return build_res, {} + + logger.info("Build succeeded for %s; publishing to ECR (region=%s)", image_name, region) + push_results = publish_images_to_ecr( + local_refs=[image_name], + region=region, + repository_mode=repository_mode, + single_repo=single_repo, + ecr_repo_prefix=ecr_repo_prefix, + skip_existing=skip_existing, + verbose=True, + parallelism=parallelism, + boto3_session=boto3_session, + docker_client=client, + ) + + if image_name in push_results: + logger.info("Published %s to %s", image_name, push_results[image_name]) + else: + logger.warning("ECR publish did not return mapping for %s", image_name) + + return build_res, push_results + def build_and_publish_to_dockerhub( self, client: docker.DockerClient, diff --git a/src/datasmith/docker/dockerhub.py b/src/datasmith/docker/dockerhub.py index ff6ffbc..329d84c 100644 --- a/src/datasmith/docker/dockerhub.py +++ b/src/datasmith/docker/dockerhub.py @@ -1,7 +1,7 @@ """DockerHub publishing module for Docker images. -This module provides functionality to publish Docker images to DockerHub. -It supports: +This module provides functionality to publish Docker images to DockerHub, +parallel to the ECR publishing module. It supports: - Single repository mode (all images in one repo with encoded tags) - Mirror mode (each local repo maps to a DockerHub repo) - Skip existing images via Registry API v2 @@ -46,7 +46,7 @@ def publish_images_to_dockerhub( # noqa: C901 """ Publish local Docker images to DockerHub and return {local_ref: dockerhub_ref}. - This function: + This function mirrors the ECR publishing functionality but uses DockerHub: - Authenticates via username/password (no token refresh needed) - Uses Docker Registry HTTP API v2 for tag listing - Handles rate limiting with exponential backoff @@ -60,7 +60,7 @@ def publish_images_to_dockerhub( # noqa: C901 dockerhub_repo_prefix: Prefix for mirror mode repos skip_existing: Skip pushing images that already exist on DockerHub verbose: Enable detailed logging - parallelism: Number of concurrent push operations (default: 4) + parallelism: Number of concurrent push operations (default: 4, lower than ECR due to rate limits) docker_client: Optional Docker client (creates one if None) username: DockerHub username (or from DOCKERHUB_USERNAME env) password: DockerHub token/password (or from DOCKERHUB_TOKEN env) @@ -299,7 +299,8 @@ def _raise_push_failed(err: Optional[str]) -> None: with lock: if verbose: logger.warning( - f"⚠ HTTP 429 (rate limit) pushing {dockerhub_ref}; waiting {wait_time}s before retry..." + f"⚠ HTTP 429 (rate limit) pushing {dockerhub_ref}; " + f"waiting {wait_time}s before retry..." ) time.sleep(wait_time) else: @@ -338,7 +339,9 @@ def _raise_push_failed(err: Optional[str]) -> None: return results -def _get_dockerhub_credentials(username: str | None = None, password: str | None = None) -> tuple[str, str]: +def _get_dockerhub_credentials( + username: str | None = None, password: str | None = None +) -> tuple[str, str]: """ Get DockerHub credentials from multiple sources. @@ -437,7 +440,9 @@ def _list_existing_tags(namespace: str, repo_name: str, username: str, password: if token_resp.status_code == 404: return set() if token_resp.status_code != 200: - logger.warning(f"Failed to get auth token for {namespace}/{repo_name}: HTTP {token_resp.status_code}") + logger.warning( + f"Failed to get auth token for {namespace}/{repo_name}: HTTP {token_resp.status_code}" + ) return set() token_data = token_resp.json() @@ -457,7 +462,9 @@ def _list_existing_tags(namespace: str, repo_name: str, username: str, password: return set() if tags_resp.status_code != 200: - logger.warning(f"Failed to list tags for {namespace}/{repo_name}: HTTP {tags_resp.status_code}") + logger.warning( + f"Failed to list tags for {namespace}/{repo_name}: HTTP {tags_resp.status_code}" + ) return set() tags_data = tags_resp.json() @@ -476,7 +483,7 @@ def _encode_dockerhub_tag_from_local(local_ref: str) -> str: """ Encode a local image reference into the tag used for single-repo DockerHub publishing. - Encoding logic: + Mirrors the ECR encoding logic: - local_ref like "repo[:tag]" becomes "repo--tag" (slashes in either side become "__"). - If the result exceeds 128 chars, add an 8-char hash suffix. @@ -561,7 +568,9 @@ def filter_tasks_not_on_dockerhub( enc_tag = _encode_dockerhub_tag_from_local(local_ref) # e.g., owner-repo-sha--final if enc_tag in existing_tags: skipped += 1 - logger.info("Skipping %s (already on DockerHub as %s/%s:%s)", local_ref, namespace, single_repo, enc_tag) + logger.info( + "Skipping %s (already on DockerHub as %s/%s:%s)", local_ref, namespace, single_repo, enc_tag + ) continue filtered.append(t) if skipped: diff --git a/src/datasmith/docker/ecr.py b/src/datasmith/docker/ecr.py new file mode 100644 index 0000000..29d59a7 --- /dev/null +++ b/src/datasmith/docker/ecr.py @@ -0,0 +1,422 @@ +import base64 +import hashlib +import re +import threading +import time +from concurrent.futures import ThreadPoolExecutor, as_completed +from typing import Any, Optional + +import boto3 +import docker +from botocore.exceptions import ClientError +from docker.errors import APIError + +from datasmith.core.models import Task +from datasmith.logging_config import configure_logging + +logger = configure_logging() + + +def publish_images_to_ecr( # noqa: C901 + local_refs: list[str], + region: str, + *, + repository_mode: str = "single", # "single" or "mirror" + single_repo: str = "formulacode/all", # used when repository_mode="single" + ecr_repo_prefix: str | None = None, # used when repository_mode="mirror" + skip_existing: bool = True, + verbose: bool = True, + parallelism: int = 4, # >1 to push multiple images concurrently + boto3_session: Any = None, + docker_client: Any = None, +) -> dict[str, str]: + """ + Publish local Docker images to ECR and return {local_ref: ecr_ref}. + + This version is a drop-in replacement that: + - Supplies auth_config to docker push so threads don't depend on shared login state. + - Detects auth errors in the push stream (e.g., "no basic auth credentials") and re-logins. + - Retains digest-based success detection and retries with backoff. + """ + + if not local_refs: + return {} + + # --- helpers --- + def _split_image(image_ref: str) -> tuple[str, str]: + # "repo[:tag]" where the part after ":" must not contain "/" + if ":" in image_ref and "/" not in image_ref.split(":", 1)[1]: + repo, tag = image_ref.rsplit(":", 1) + else: + repo, tag = image_ref, "latest" + return repo, tag + + def _sanitize_repo_name(name: str) -> str: + # ECR repo: [a-z0-9._/-] + sanitized = re.sub(r"[^a-z0-9._/-]", "-", name.lower()) + sanitized = re.sub(r"-+", "-", sanitized).strip("-/.") or "repo" + if ecr_repo_prefix and repository_mode == "mirror": + pref = ecr_repo_prefix.strip("/") + sanitized = f"{pref}/{sanitized}" + return sanitized + + def _encode_tag_from_local(local_ref: str) -> str: + """ + Docker tag regex: [A-Za-z0-9_][A-Za-z0-9._-]{0,127} + Encode "/" -> "__", ":" -> "--" to keep info in tag. + Add an 8-char hash suffix if we must truncate to avoid collisions. + """ + repo, tag = _split_image(local_ref) + base = repo.replace("/", "__") + tag_enc = tag.replace("/", "__").replace(":", "--") + composed = f"{base}--{tag_enc}" + if len(composed) <= 128: + return composed + h = hashlib.sha256(composed.encode()).hexdigest()[:8] + trimmed = composed[-(128 - 10) :] + return f"{trimmed}--{h}" + + def _ensure_repo(ecr_client: Any, repo_name: str) -> None: + try: + ecr_client.describe_repositories(repositoryNames=[repo_name]) + except ClientError as ce: + code = ce.response["Error"].get("Code") + if code == "RepositoryNotFoundException": + try: + ecr_client.create_repository(repositoryName=repo_name) + except ClientError as ce2: + if ce2.response["Error"].get("Code") != "RepositoryAlreadyExistsException": + raise + else: + raise + + def _list_existing_tags(ecr_client: Any, repo_name: str) -> set[str]: + tags: set[str] = set() + token = None + while True: + kwargs = {"repositoryName": repo_name, "maxResults": 1000} + if token: + kwargs["nextToken"] = token + try: + resp = ecr_client.list_images(**kwargs) + except ClientError as ce: + if ce.response["Error"].get("Code") == "RepositoryNotFoundException": + return set() + raise + for img in resp.get("imageIds", []): + t = img.get("imageTag") + if t: + tags.add(t) + token = resp.get("nextToken") + if not token: + break + return tags + + # --- setup AWS + Docker --- + session = boto3_session or boto3.session.Session(region_name=region) + sts = session.client("sts") + account_id = sts.get_caller_identity()["Account"] + ecr = session.client("ecr") + dk = docker_client or docker.from_env() + registry = f"{account_id}.dkr.ecr.{region}.amazonaws.com" + + # ECR login (function so we can refresh on 401 or stream-auth errors) + proxy_endpoint_cache: dict[str, str | None] = {"endpoint": None, "username": None, "password": None} + + def _login_to_ecr() -> None: + auth = ecr.get_authorization_token() + auth_data = auth["authorizationData"][0] + proxy_endpoint = auth_data["proxyEndpoint"].replace("https://", "") + username, password = base64.b64decode(auth_data["authorizationToken"]).decode().split(":", 1) + dk.login(username=username, password=password, registry=proxy_endpoint) + proxy_endpoint_cache.update({"endpoint": proxy_endpoint, "username": username, "password": password}) + + _login_to_ecr() + + # Determine target repo/tag per local + unique_refs = sorted({r for r in local_refs if r}) + plan: list[tuple[str, str, str]] = [] # (local_ref, repo_name, tag) + repos_needed: set[str] = set() + + if repository_mode == "single": + repo_name = _sanitize_repo_name(single_repo) + _ensure_repo(ecr, repo_name) + repos_needed.add(repo_name) + for lr in unique_refs: + plan.append((lr, repo_name, _encode_tag_from_local(lr))) + elif repository_mode == "mirror": + for lr in unique_refs: + repo, tag = _split_image(lr) + repo_name = _sanitize_repo_name(repo) + _ensure_repo(ecr, repo_name) + repos_needed.add(repo_name) + plan.append((lr, repo_name, tag)) + else: + raise ValueError('repository_mode must be "single" or "mirror"') + + # Cache existing tags per repo + existing_tags_cache: dict[str, set[str]] = {} + if skip_existing: + for rn in repos_needed: + existing_tags_cache[rn] = _list_existing_tags(ecr, rn) + + lock = threading.Lock() + results: dict[str, str] = {} + failures: dict[str, str] = {} + + if verbose: + logger.debug( + f"Publishing {len(plan)} image(s) to {registry} using mode={repository_mode}, parallelism={parallelism}" + ) + + def _looks_like_auth_error(msg: Optional[str]) -> bool: + if not msg: + return False + m = str(msg).lower() + return ( + "no basic auth credentials" in m + or "authorization status: 401" in m + or "authorization failed" in m + or "denied: requested access to the resource is denied" in m + or "access denied" in m + ) + + def _push_stream_ok_and_digest(lines_iter: Any) -> tuple[bool, Optional[str], Optional[str]]: + """ + Inspect the push stream: + - return (ok, digest, error_message) + - ok=True iff we saw a final 'aux' dict with 'Digest' + """ + ok = False + digest = None + last_error = None + for line in lines_iter: + if not isinstance(line, dict): + continue + if "error" in line: + last_error = str(line.get("error")) + if "errorDetail" in line: + detail = line.get("errorDetail") + last_error = str(detail.get("message") if isinstance(detail, dict) else detail) + aux = line.get("aux") + if isinstance(aux, dict): + d = aux.get("Digest") or aux.get("digest") + if d: + ok = True + digest = str(d) + st = line.get("status") + if st and verbose and any(tok in st for tok in ("Pushed", "Digest", "already exists", "mounted")): + with lock: + logger.debug(st) + return ok, digest, last_error + + def _push_one(local_ref: str, repo_name: str, tag: str) -> None: # noqa: C901 + ecr_ref = f"{registry}/{repo_name}:{tag}" + + # Skip if already present (from cache) + if skip_existing: + tags = existing_tags_cache.get(repo_name) + if tags is not None and tag in tags: + if verbose: + with lock: + logger.debug(f"[DONE] {ecr_ref} already exists — skipping") + with lock: + results[local_ref] = ecr_ref + return + + # Ensure the image exists locally and tag it for the target repo + try: + img = dk.images.get(local_ref) + except Exception as e: + with lock: + failures[local_ref] = f"local image not found: {e}" + logger.debug(f"✖ {local_ref}: not found locally ({e})") + return + + # Tag (idempotent) + try: + img.tag(f"{registry}/{repo_name}", tag=tag) + except Exception as e: + with lock: + failures[local_ref] = f"failed to tag: {e}" + logger.debug(f"✖ failed to tag {local_ref} -> {ecr_ref}: {e}") + return + + if verbose: + with lock: + logger.debug(f"Pushing {ecr_ref} ...") + + # Per-thread low-level client for stable streaming pushes + def _make_api_client() -> Any: + return docker.from_env(timeout=1800).api + + def _raise_push_failed(err: Optional[str]) -> None: + raise RuntimeError(f"push did not complete successfully: {err or 'no digest observed'}") + + max_retries = 3 + backoff_base = 1.5 + + for attempt in range(max_retries): + api = _make_api_client() + try: + stream = api.push( + f"{registry}/{repo_name}", + tag=tag, + stream=True, + decode=True, + auth_config={ + "username": proxy_endpoint_cache["username"], + "password": proxy_endpoint_cache["password"], + }, + ) + ok, digest, err = _push_stream_ok_and_digest(stream) + if ok: + with lock: + if skip_existing: + existing_tags_cache.setdefault(repo_name, set()).add(tag) + results[local_ref] = ecr_ref + if verbose and digest: + logger.debug(f"[DONE] pushed {ecr_ref} ({digest})") + return + + # Stream completed without success — if it looks like auth, refresh and retry + if _looks_like_auth_error(err) and attempt < max_retries - 1: + with lock: + if verbose: + logger.debug(f"⚠ auth issue pushing {ecr_ref}; refreshing ECR token and retrying...") + _login_to_ecr() + else: + _raise_push_failed(err) + + except APIError as e: + # Refresh ECR login on HTTP 401 and retry + code = getattr(getattr(e, "response", None), "status_code", None) + if code == 401 and attempt < max_retries - 1: + with lock: + if verbose: + logger.debug(f"⚠ 401 unauthorized pushing {ecr_ref}; refreshing ECR token and retrying...") + _login_to_ecr() + else: + if attempt >= max_retries - 1: + with lock: + failures[local_ref] = f"Docker APIError: {e}" + logger.debug(f"✖ failed to push {ecr_ref}: {e}") + return + except Exception as e: + if attempt >= max_retries - 1: + with lock: + failures[local_ref] = str(e) + logger.debug(f"✖ failed to push {ecr_ref}: {e}") + return + finally: + if attempt < max_retries - 1: + time.sleep((backoff_base**attempt) * 0.7) + + # Parallel pushes + parallelism = max(1, int(parallelism)) + if parallelism == 1: + for lr, rn, tg in plan: + _push_one(lr, rn, tg) + else: + with ThreadPoolExecutor(max_workers=parallelism) as ex: + futs = [ex.submit(_push_one, lr, rn, tg) for lr, rn, tg in plan] + for _ in as_completed(futs): + pass + + if verbose and failures: + with lock: + logger.debug(f"Completed with {len(results)} success(es) and {len(failures)} failure(s).") + for k, v in failures.items(): + logger.debug(f" • {k}: {v}") + + return results + + +def _encode_ecr_tag_from_local(local_ref: str) -> str: + """Encode a local image reference into the tag used for single-repo ECR publishing. + + Mirrors datasmith.docker.ecr._encode_tag_from_local used when repository_mode="single": + - local_ref like "repo[:tag]" becomes "repo--tag" (slashes in either side become "__"). + - If the result exceeds 128 chars, add an 8-char hash suffix (not expected here). + """ + import hashlib + + if ":" in local_ref and "/" not in local_ref.split(":", 1)[1]: + repo, tag = local_ref.rsplit(":", 1) + else: + repo, tag = local_ref, "latest" + base = repo.replace("/", "__") + tag_enc = tag.replace("/", "__").replace(":", "--") + composed = f"{base}--{tag_enc}" + if len(composed) <= 128: + return composed + h = hashlib.sha256(composed.encode()).hexdigest()[:8] + trimmed = composed[-(128 - 10) :] + return f"{trimmed}--{h}" + + +def _list_ecr_tags_single_repo(*, region: str, repo_name: str) -> set[str]: + """Return the set of existing image tags for an ECR repository. + + Safe: returns empty set on missing repo or auth issues. Logs warnings instead of raising. + """ + tags: set[str] = set() + try: + session = boto3.session.Session(region_name=region) # pyright: ignore[reportAttributeAccessIssue] + ecr = session.client("ecr") + token: str | None = None + while True: + kwargs = {"repositoryName": repo_name, "maxResults": 1000} + if token: + kwargs["nextToken"] = token + try: + resp = ecr.list_images(**kwargs) + except ClientError as ce: # pragma: no cover - network dependent + code = ce.response.get("Error", {}).get("Code") + if code == "RepositoryNotFoundException": + logger.info("ECR repository %s not found; assuming no existing images.", repo_name) + return set() + logger.warning("Failed to list ECR images for %s: %s", repo_name, ce) + return set() + for img in resp.get("imageIds", []): + t = img.get("imageTag") + if t: + tags.add(str(t)) + token = resp.get("nextToken") + if not token: + break + except Exception as exc: # pragma: no cover - network dependent + logger.warning("Could not query ECR for existing tags (region=%s, repo=%s): %s", region, repo_name, exc) + return set() + return tags + + +def filter_tasks_not_on_ecr( + tasks: list[Task], *, region: str, repository_mode: str = "single", single_repo: str = "formulacode/all" +) -> list[Task]: + """Filter out tasks whose target image already exists on ECR. + + Currently supports repository_mode="single" (default used by Context.build_and_publish_to_ecr). + """ + if repository_mode != "single": + # Fallback: if we don't know how tags are computed, don't filter + logger.warning("ECR pre-filter only supports repository_mode='single'; skipping filter.") + return tasks + + existing_tags = _list_ecr_tags_single_repo(region=region, repo_name=single_repo) + if not existing_tags: + return tasks + + filtered: list[Task] = [] + skipped = 0 + for t in tasks: + local_ref = t.with_tag("final").get_image_name() # e.g., owner-repo-sha:final + enc_tag = _encode_ecr_tag_from_local(local_ref) # e.g., owner-repo-sha--final + if enc_tag in existing_tags: + skipped += 1 + logger.info("Skipping %s (already on ECR as %s:%s)", local_ref, single_repo, enc_tag) + continue + filtered.append(t) + if skipped: + logger.info("Filtered out %d/%d tasks already on ECR", skipped, len(tasks)) + return filtered diff --git a/src/datasmith/docker/orchestrator.py b/src/datasmith/docker/orchestrator.py index 5d6e521..73220e7 100644 --- a/src/datasmith/docker/orchestrator.py +++ b/src/datasmith/docker/orchestrator.py @@ -1,21 +1,24 @@ from __future__ import annotations import asyncio +import contextlib import hashlib import io import json import os import sys import tarfile +import time from collections.abc import Sequence from pathlib import Path from typing import Any import docker -from docker.errors import DockerException +from docker.errors import APIError, DockerException, ImageNotFound, NotFound from docker.models.containers import Container +from requests.exceptions import ReadTimeout -from datasmith.docker.context import BuildResult, DockerContext, Task +from datasmith.docker.context import BuildResult, DockerContext, Task, _new_api_client from datasmith.docker.disk_management import docker_data_root, guard_loop from datasmith.logging_config import get_logger @@ -93,7 +96,6 @@ def build_repo_sha_image( ) return build_res - def log_container_output(container: Container, archive: str = "/output") -> dict[str, str]: stream, _stat = container.get_archive(archive) # 3) Load tar stream into memory and walk files @@ -246,22 +248,30 @@ async def batch_orchestrate( output_dir: Path, client: docker.DockerClient | None, *, + use_aws_batch: bool = False, + aws_batch_config: dict[str, Any] | None = None, guard_min_free_gb: float = float(os.getenv("DATASMITH_MIN_FREE_GB", "1200")), guard_interval_s: int = int(os.getenv("DATASMITH_GUARD_INTERVAL_S", "120")), guard_hard_fail: bool = bool(int(os.getenv("DATASMITH_GUARD_HARD_FAIL", "0"))), guard_data_root: str | None = None, ) -> dict[Task, dict[str, str]]: """ - Orchestrate benchmark execution locally. + Orchestrate benchmark execution with optional AWS batch processing. + + This function provides a unified interface that can either: + 1. Run benchmarks locally using the existing orchestrate function + 2. Run benchmarks on AWS EC2 instances in batches for scalability Args: contexts: List of (Task, DockerContext) pairs to execute asv_args: ASV command line arguments machine_args: ASV machine configuration - max_concurrency: Maximum number of concurrent local tasks + max_concurrency: Maximum number of concurrent local tasks (ignored for AWS) n_cores: Number of CPU cores per task output_dir: Directory to store results - client: Docker client + client: Docker client (ignored for AWS) + use_aws_batch: If True, use AWS batch execution instead of local + aws_batch_config: Configuration for AWS batch execution guard_min_free_gb: Minimum free disk space for local execution guard_interval_s: Disk space check interval for local execution guard_hard_fail: Whether to fail hard on low disk space @@ -270,18 +280,99 @@ async def batch_orchestrate( Returns: Dictionary mapping Task to benchmark result files """ - if client is None: - client = get_docker_client(max_concurrency) - return await orchestrate( - contexts=contexts, + if not use_aws_batch: + # Use existing local orchestration + if client is None: + client = get_docker_client(max_concurrency) + return await orchestrate( + contexts=contexts, + asv_args=asv_args, + machine_args=machine_args, + max_concurrency=max_concurrency, + n_cores=n_cores, + output_dir=output_dir, + client=client, + guard_min_free_gb=guard_min_free_gb, + guard_interval_s=guard_interval_s, + guard_hard_fail=guard_hard_fail, + guard_data_root=guard_data_root, + ) + + # AWS batch execution + if not aws_batch_config: + raise ValueError("aws_batch_config is required when use_aws_batch=True") + + from datasmith.docker.aws_batch_executor import AwsBatchConfig, AWSBatchExecutor + + # Create AWS batch config + aws_cfg = AwsBatchConfig( + region=aws_batch_config["region"], + s3_bucket=aws_batch_config["s3_bucket"], + s3_prefix=aws_batch_config.get("s3_prefix", "datasmith-batch-execution"), + subnet_id=aws_batch_config["subnet_id"], + security_group_ids=aws_batch_config["security_group_ids"], + iam_instance_profile_name=aws_batch_config["iam_instance_profile_name"], + ami_id=aws_batch_config["ami_id"], + instance_type=aws_batch_config.get("instance_type", "c6i.xlarge"), + key_name=aws_batch_config.get("key_name"), + spot_max_price=aws_batch_config.get("spot_max_price"), + tags=aws_batch_config.get("tags", {}), + stream_logs=aws_batch_config.get("stream_logs", True), + log_output_dir=aws_batch_config.get("log_output_dir", "output/batch_logs"), + max_tasks_per_instance=aws_batch_config.get("max_tasks_per_instance", 100), + batch_timeout_s=aws_batch_config.get("batch_timeout_s", 2 * 60 * 60), + poll_interval_s=aws_batch_config.get("poll_interval_s", 30), + max_batch_retries=aws_batch_config.get("max_batch_retries", 1), + num_cores_per_task=n_cores, asv_args=asv_args, + ) + + # Create batch executor + batch_executor = AWSBatchExecutor(aws_cfg) + + # Execute batch + run_id = os.environ.get("DATASMITH_RUN_ID") or _compute_deterministic_run_id( + contexts, asv_args=asv_args, machine_args=machine_args, n_cores=n_cores + ) + batch_results = batch_executor.execute_batch( + tasks=contexts, machine_args=machine_args, - max_concurrency=max_concurrency, - n_cores=n_cores, - output_dir=output_dir, - client=client, - guard_min_free_gb=guard_min_free_gb, - guard_interval_s=guard_interval_s, - guard_hard_fail=guard_hard_fail, - guard_data_root=guard_data_root, + asv_args=asv_args, + run_id=run_id, ) + + # Convert batch results to the expected format + files_by_image = {} + for batch_result in batch_results: + # Find the corresponding task + task = None + for t, _ in contexts: + assert t.sha is not None # noqa: S101 + if f"{run_id}-task-" in batch_result.task_id and t.sha in batch_result.task_id: + task = t + break + + if task is None: + logger.warning("Could not find task for batch result %s", batch_result.task_id) + continue + + # Store benchmark files + files_by_image[task] = batch_result.benchmark_files + + # Save individual result to output directory + result_dir = output_dir / "results" / task.get_container_name() + result_dir.mkdir(parents=True, exist_ok=True) + + for filename, content in batch_result.benchmark_files.items(): + file_path = result_dir / filename + file_path.write_text(content) + + # Save logs + log_file = output_dir / "logs" / f"{task.get_container_name()}.log" + log_file.parent.mkdir(parents=True, exist_ok=True) + log_file.write_text(batch_result.benchmark_logs) + + logger.info("Saved results for %s: %d files", task.get_container_name(), len(batch_result.benchmark_files)) + + logger.info("AWS batch execution completed: %d successful results", len(files_by_image)) + return files_by_image diff --git a/src/datasmith/docker/registry_config.py b/src/datasmith/docker/registry_config.py index b0b3abd..3a48555 100644 --- a/src/datasmith/docker/registry_config.py +++ b/src/datasmith/docker/registry_config.py @@ -1,18 +1,19 @@ -"""Registry configuration abstraction for container registries. +"""Registry configuration abstraction for multiple container registries. -This module provides shared abstractions for working with DockerHub -(and potentially other container registries in the future) in a unified way. +This module provides shared abstractions for working with different container +registries (ECR, DockerHub, GCR, GHCR) in a unified way. """ import os from dataclasses import dataclass, field from enum import Enum -from typing import Any +from typing import Any, Optional class RegistryType(Enum): """Supported container registry types.""" + ECR = "ecr" DOCKERHUB = "dockerhub" GCR = "gcr" # Google Container Registry (future) GHCR = "ghcr" # GitHub Container Registry (future) @@ -25,7 +26,7 @@ class RegistryConfig: This dataclass encapsulates all configuration needed to publish images to a specific registry type. It supports: - Common fields used by all registries - - Registry-specific fields (DockerHub namespace, etc.) + - Registry-specific fields (ECR region, DockerHub namespace, etc.) - Loading from environment variables - Validation of required fields per registry type """ @@ -39,6 +40,10 @@ class RegistryConfig: parallelism: int = 4 # Number of concurrent push operations verbose: bool = True # Enable detailed logging + # ECR-specific fields + region: str | None = None # AWS region (e.g., "us-east-1") + ecr_repo_prefix: str | None = None # Prefix for mirror mode + # DockerHub-specific fields namespace: str | None = None # DockerHub namespace (user or org) username: str | None = None # DockerHub username @@ -63,6 +68,12 @@ def from_env(cls, registry_type: RegistryType) -> "RegistryConfig": Environment variables by registry type: + ECR: + - AWS_REGION (default: "us-east-1") + - ECR_REPO_PREFIX (optional) + - ECR_REPOSITORY_MODE (default: "single") + - ECR_SINGLE_REPO (default: "all") + DockerHub: - DOCKERHUB_NAMESPACE (required) - DOCKERHUB_USERNAME (required) @@ -90,13 +101,27 @@ def from_env(cls, registry_type: RegistryType) -> "RegistryConfig": parallelism = int(os.environ.get("REGISTRY_PARALLELISM", "4")) verbose = os.environ.get("REGISTRY_VERBOSE", "true").lower() in ("true", "1", "yes") - if registry_type == RegistryType.DOCKERHUB: + if registry_type == RegistryType.ECR: + return cls( + registry_type=registry_type, + region=os.environ.get("AWS_REGION", "us-east-1"), + ecr_repo_prefix=os.environ.get("ECR_REPO_PREFIX"), + repository_mode=os.environ.get("ECR_REPOSITORY_MODE", "single"), + single_repo=os.environ.get("ECR_SINGLE_REPO", "all"), + skip_existing=skip_existing, + parallelism=parallelism, + verbose=verbose, + ) + + elif registry_type == RegistryType.DOCKERHUB: namespace = os.environ.get("DOCKERHUB_NAMESPACE") username = os.environ.get("DOCKERHUB_USERNAME") password = os.environ.get("DOCKERHUB_TOKEN") or os.environ.get("DOCKERHUB_PASSWORD") if not namespace: - raise ValueError("DOCKERHUB_NAMESPACE environment variable is required for DockerHub registry") + raise ValueError( + "DOCKERHUB_NAMESPACE environment variable is required for DockerHub registry" + ) return cls( registry_type=registry_type, @@ -129,7 +154,11 @@ def validate(self) -> None: Raises: ValueError: If required fields are missing """ - if self.registry_type == RegistryType.DOCKERHUB: + if self.registry_type == RegistryType.ECR: + if not self.region: + raise ValueError("region is required for ECR registry") + + elif self.registry_type == RegistryType.DOCKERHUB: if not self.namespace: raise ValueError("namespace is required for DockerHub registry") # Note: username/password validation happens in dockerhub.py @@ -155,12 +184,16 @@ def get_registry_url(self) -> str: Get the registry URL for this configuration. Returns: - Registry URL (e.g., "docker.io") + Registry URL (e.g., "docker.io", "{account}.dkr.ecr.{region}.amazonaws.com") Raises: NotImplementedError: For registry types that need account-specific URLs """ - if self.registry_type == RegistryType.DOCKERHUB: + if self.registry_type == RegistryType.ECR: + # ECR URL requires account ID, which is fetched at runtime + raise NotImplementedError("ECR registry URL requires account ID from AWS STS") + + elif self.registry_type == RegistryType.DOCKERHUB: return "docker.io" elif self.registry_type == RegistryType.GCR: @@ -186,6 +219,8 @@ def to_dict(self) -> dict[str, Any]: "skip_existing": self.skip_existing, "parallelism": self.parallelism, "verbose": self.verbose, + "region": self.region, + "ecr_repo_prefix": self.ecr_repo_prefix, "namespace": self.namespace, "username": self.username, "password": "***" if self.password else None, # Redact password diff --git a/src/datasmith/docker/s3_cache_manager.py b/src/datasmith/docker/s3_cache_manager.py new file mode 100644 index 0000000..be59f83 --- /dev/null +++ b/src/datasmith/docker/s3_cache_manager.py @@ -0,0 +1,335 @@ +from __future__ import annotations + +import hashlib +import json +import os +import time +from dataclasses import dataclass +from typing import Any + +import boto3 +from botocore.exceptions import ClientError + +from datasmith.logging_config import get_logger + +logger = get_logger(__name__) + + +@dataclass +class S3CacheConfig: + """Configuration for S3-based Docker layer caching.""" + + bucket: str = os.environ.get("AWS_S3_BUCKET_DOCKER", "") + prefix: str = "docker-cache" + region: str = os.environ.get("AWS_REGION", "us-east-1") + max_cache_age_days: int = 30 # Clean up cache layers older than this + max_cache_size_gb: int = 100 # Maximum total cache size + compression: bool = True # Use gzip compression for cache metadata + cache_ttl_hours: int = 720 # How long to keep cache entries (30 days) + + +class S3DockerCacheManager: + """ + Manages Docker layer caching using S3 as a backend. + + This class provides: + 1. Layer push/pull operations to/from S3 + 2. Cache metadata management + 3. Cache cleanup and size management + 4. Integration with Docker BuildKit cache mounts + """ + + def __init__(self, config: S3CacheConfig): + self.config = config + self.s3 = boto3.client("s3", region_name=config.region) + self._ensure_bucket_exists() + + def _ensure_bucket_exists(self) -> None: + """Ensure the S3 bucket exists, create if it doesn't.""" + try: + self.s3.head_bucket(Bucket=self.config.bucket) + logger.debug("S3 cache bucket %s exists", self.config.bucket) + except ClientError as e: + error_code = e.response["Error"]["Code"] + if error_code == "404": + logger.info("Creating S3 cache bucket %s", self.config.bucket) + try: + if self.config.region == "us-east-1": + # us-east-1 doesn't need LocationConstraint + self.s3.create_bucket(Bucket=self.config.bucket) + else: + self.s3.create_bucket( + Bucket=self.config.bucket, + CreateBucketConfiguration={"LocationConstraint": self.config.region}, + ) + logger.info("Created S3 cache bucket %s", self.config.bucket) + except ClientError: + logger.exception("Failed to create S3 bucket %s", self.config.bucket) + raise + else: + logger.exception("Error checking S3 bucket %s", self.config.bucket) + raise + + def _get_cache_key(self, layer_id: str, cache_type: str = "layer") -> str: + """Generate a cache key for a layer or metadata.""" + return f"{self.config.prefix}/{cache_type}/{layer_id}" + + def _get_metadata_key(self, build_context_hash: str) -> str: + """Get the metadata key for a build context.""" + return self._get_cache_key(f"metadata/{build_context_hash}", "metadata") + + def _hash_build_context(self, dockerfile_content: str, build_args: dict[str, str]) -> str: + """Generate a hash for the build context to use as cache key.""" + # Include dockerfile content and build args in hash + content = f"{dockerfile_content}:{json.dumps(build_args, sort_keys=True)}" + return hashlib.sha256(content.encode()).hexdigest()[:16] + + def get_cache_mount_config(self, dockerfile_content: str, build_args: dict[str, str]) -> dict[str, str]: + """ + Get Docker BuildKit cache mount configuration for S3. + + Returns a dict with cache mount options that can be used with + docker build --cache-from and --cache-to. + """ + context_hash = self._hash_build_context(dockerfile_content, build_args) + + # S3 cache mount configuration + cache_config = { + "type": "s3", + "bucket": self.config.bucket, + "region": self.config.region, + "prefix": f"{self.config.prefix}/layers/{context_hash}", + } + + return cache_config + + def get_docker_build_args(self, dockerfile_content: str, build_args: dict[str, str]) -> list[str]: + """ + Get Docker build arguments for cache-from and cache-to. + + Returns a list of command line arguments to add to docker build. + """ + cache_config = self.get_cache_mount_config(dockerfile_content, build_args) + + # Build cache mount string + cache_mount_str = ( + f"type=s3,bucket={cache_config['bucket']},region={cache_config['region']},prefix={cache_config['prefix']}" + ) + + return ["--cache-from", cache_mount_str, "--cache-to", f"{cache_mount_str},mode=max"] + + def store_build_metadata( + self, dockerfile_content: str, build_args: dict[str, str], image_layers: list[str], build_duration: float + ) -> None: + """Store build metadata for cache optimization.""" + context_hash = self._hash_build_context(dockerfile_content, build_args) + metadata_key = self._get_metadata_key(context_hash) + + metadata = { + "context_hash": context_hash, + "dockerfile_hash": hashlib.sha256(dockerfile_content.encode()).hexdigest(), + "build_args": build_args, + "image_layers": image_layers, + "build_duration": build_duration, + "timestamp": time.time(), + "cache_hits": 0, + } + + try: + content_str = json.dumps(metadata, indent=2) + if self.config.compression: + import gzip + + content = gzip.compress(content_str.encode()) + content_type = "application/gzip" + else: + content = content_str.encode() + content_type = "application/json" + + self.s3.put_object( + Bucket=self.config.bucket, + Key=metadata_key, + Body=content, + ContentType=content_type, + Metadata={ + "context-hash": context_hash, + "timestamp": str(int(time.time())), + }, + ) + logger.debug("Stored build metadata for context %s", context_hash) + except Exception: + logger.warning("Failed to store build metadata") + + def get_build_metadata(self, dockerfile_content: str, build_args: dict[str, str]) -> dict[str, Any] | None: + """Retrieve build metadata for cache optimization.""" + context_hash = self._hash_build_context(dockerfile_content, build_args) + metadata_key = self._get_metadata_key(context_hash) + + try: + response = self.s3.get_object(Bucket=self.config.bucket, Key=metadata_key) + content = response["Body"].read() + + # Handle compression + if response.get("ContentType") == "application/gzip": + import gzip + + content = gzip.decompress(content) + + metadata: dict[str, Any] = json.loads(content.decode()) + + # Update cache hit count + metadata["cache_hits"] = metadata.get("cache_hits", 0) + 1 + self.store_build_metadata( + dockerfile_content, build_args, metadata.get("image_layers", []), metadata.get("build_duration", 0.0) + ) + + logger.debug("Retrieved build metadata for context %s (hit #%d)", context_hash, metadata["cache_hits"]) + except ClientError as e: + if e.response["Error"]["Code"] == "NoSuchKey": + logger.debug("No cached metadata found for context %s", context_hash) + return None + else: + logger.warning("Error retrieving build metadata") + return None + except Exception: + logger.warning("Error parsing build metadata") + return None + else: + return metadata + + def cleanup_old_cache(self) -> dict[str, int]: # noqa: C901 + """ + Clean up old cache entries based on age and size limits. + + Returns a dict with cleanup statistics. + """ + stats = {"deleted_metadata": 0, "deleted_layers": 0, "freed_bytes": 0} + cutoff_time = time.time() - (self.config.max_cache_age_days * 24 * 3600) + + try: + # List all cache objects + paginator = self.s3.get_paginator("list_objects_v2") + pages = paginator.paginate(Bucket=self.config.bucket, Prefix=self.config.prefix) + + objects_to_delete = [] + total_size = 0 + + for page in pages: + for obj in page.get("Contents", []): + key = obj["Key"] + size = obj["Size"] + last_modified = obj["LastModified"].timestamp() + + total_size += size + + # Check if object is too old + if last_modified < cutoff_time: + objects_to_delete.append({"Key": key}) + stats["freed_bytes"] += size + + if "metadata/" in key: + stats["deleted_metadata"] += 1 + else: + stats["deleted_layers"] += 1 + + # Also check size limit + size_limit_bytes = self.config.max_cache_size_gb * 1024 * 1024 * 1024 + if total_size > size_limit_bytes: + # Sort by last modified and delete oldest first + logger.info( + "Cache size %d GB exceeds limit %d GB, cleaning up", + total_size // (1024**3), + self.config.max_cache_size_gb, + ) + + # This is a simplified cleanup - in production you might want + # more sophisticated LRU-based cleanup + for obj in sorted(objects_to_delete, key=lambda x: x["Key"]): + if total_size <= size_limit_bytes: + break + objects_to_delete.append(obj) + + # Delete objects in batches + if objects_to_delete: + for i in range(0, len(objects_to_delete), 1000): # S3 batch delete limit + batch = objects_to_delete[i : i + 1000] + self.s3.delete_objects(Bucket=self.config.bucket, Delete={"Objects": batch}) + logger.info("Deleted %d cache objects", len(batch)) + + logger.info("Cache cleanup completed: %s", stats) + return stats # noqa: TRY300 + + except Exception: + logger.exception("Error during cache cleanup") + return stats + + def get_cache_stats(self) -> dict[str, Any]: + """Get cache statistics and usage information.""" + stats: dict[str, Any] = { + "total_objects": 0, + "total_size_bytes": 0, + "metadata_objects": 0, + "layer_objects": 0, + "oldest_object": None, + "newest_object": None, + } + + try: + paginator = self.s3.get_paginator("list_objects_v2") + pages = paginator.paginate(Bucket=self.config.bucket, Prefix=self.config.prefix) + + for page in pages: + for obj in page.get("Contents", []): + key = obj["Key"] + size = obj["Size"] + last_modified = obj["LastModified"] + + stats["total_objects"] += 1 + stats["total_size_bytes"] += size + + if "metadata/" in key: + stats["metadata_objects"] += 1 + else: + stats["layer_objects"] += 1 + + if stats["oldest_object"] is None or last_modified < stats["oldest_object"]: + stats["oldest_object"] = last_modified + if stats["newest_object"] is None or last_modified > stats["newest_object"]: + stats["newest_object"] = last_modified + + # Convert to human-readable sizes + if stats["total_size_bytes"] is not None: + stats["total_size_gb"] = stats["total_size_bytes"] / (1024**3) + stats["total_size_mb"] = stats["total_size_bytes"] / (1024**2) + + except Exception: + logger.exception("Error getting cache stats") + + return stats + + def invalidate_cache(self, dockerfile_content: str, build_args: dict[str, str]) -> None: + """Invalidate cache for a specific build context.""" + context_hash = self._hash_build_context(dockerfile_content, build_args) + + try: + # Delete metadata + metadata_key = self._get_metadata_key(context_hash) + self.s3.delete_object(Bucket=self.config.bucket, Key=metadata_key) + + # Delete associated layers + layer_prefix = f"{self.config.prefix}/layers/{context_hash}/" + paginator = self.s3.get_paginator("list_objects_v2") + pages = paginator.paginate(Bucket=self.config.bucket, Prefix=layer_prefix) + + objects_to_delete = [] + for page in pages: + for obj in page.get("Contents", []): + objects_to_delete.append({"Key": obj["Key"]}) + + if objects_to_delete: + self.s3.delete_objects(Bucket=self.config.bucket, Delete={"Objects": objects_to_delete}) + + logger.info("Invalidated cache for context %s (%d objects)", context_hash, len(objects_to_delete) + 1) + + except Exception: + logger.warning("Error invalidating cache") diff --git a/tests/test_docker_dockerhub.py b/tests/test_docker_dockerhub.py index edfbfe4..8bc6ef0 100644 --- a/tests/test_docker_dockerhub.py +++ b/tests/test_docker_dockerhub.py @@ -1,7 +1,7 @@ """Tests for DockerHub publishing module.""" import os -from unittest.mock import MagicMock, Mock, patch +from unittest.mock import Mock, MagicMock, patch import pytest @@ -17,7 +17,7 @@ class TestDockerHubTagEncoding: - """Test tag encoding for DockerHub.""" + """Test tag encoding for DockerHub (same logic as ECR).""" def test_simple_tag_encoding(self): """Test basic tag encoding without special characters.""" @@ -85,7 +85,9 @@ def test_credentials_parameter_overrides_env(self): def test_credentials_dockerhub_password_fallback(self): """Test fallback to DOCKERHUB_PASSWORD env var.""" - with patch.dict(os.environ, {"DOCKERHUB_USERNAME": "user", "DOCKERHUB_PASSWORD": "pass"}, clear=True): + with patch.dict( + os.environ, {"DOCKERHUB_USERNAME": "user", "DOCKERHUB_PASSWORD": "pass"}, clear=True + ): username, password = _get_dockerhub_credentials() assert username == "user" assert password == "pass" @@ -255,7 +257,9 @@ def test_filter_all_tasks_exist(self, mock_list_tags): "owner2-repo2-def456--final", } - filtered = filter_tasks_not_on_dockerhub(tasks, namespace="testns", username="user", password="pass") + filtered = filter_tasks_not_on_dockerhub( + tasks, namespace="testns", username="user", password="pass" + ) assert len(filtered) == 0 @@ -270,7 +274,9 @@ def test_filter_no_tasks_exist(self, mock_list_tags): # Mock empty tag list mock_list_tags.return_value = set() - filtered = filter_tasks_not_on_dockerhub(tasks, namespace="testns", username="user", password="pass") + filtered = filter_tasks_not_on_dockerhub( + tasks, namespace="testns", username="user", password="pass" + ) assert len(filtered) == 2 @@ -286,7 +292,9 @@ def test_filter_some_tasks_exist(self, mock_list_tags): # Mock that only first task exists mock_list_tags.return_value = {"owner1-repo1-abc123--final"} - filtered = filter_tasks_not_on_dockerhub(tasks, namespace="testns", username="user", password="pass") + filtered = filter_tasks_not_on_dockerhub( + tasks, namespace="testns", username="user", password="pass" + ) assert len(filtered) == 2 assert filtered[0].owner == "owner2" diff --git a/uv.lock b/uv.lock index ca3b002..a20949d 100644 --- a/uv.lock +++ b/uv.lock @@ -323,6 +323,78 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/04/eb/f4151e0c7377a6e08a38108609ba5cede57986802757848688aeedd1b9e8/beautifulsoup4-4.13.5-py3-none-any.whl", hash = "sha256:642085eaa22233aceadff9c69651bc51e8bf3f874fb6d7104ece2beb24b47c4a", size = 105113, upload-time = "2025-08-24T14:06:14.884Z" }, ] +[[package]] +name = "boto3" +version = "1.7.84" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version < '3.10'", +] +dependencies = [ + { name = "botocore", version = "1.10.84", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "jmespath", version = "0.10.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "s3transfer", version = "0.1.13", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/2f/2b/7010a5189859eec725c36081b1d1c8e721000ebdf81a1682ec6b64e1c373/boto3-1.7.84.tar.gz", hash = "sha256:64496f2c814e454e26c024df86bd08fb4643770d0e2b7a8fd70055fc6683eb9d", size = 93151, upload-time = "2018-08-23T23:58:39.6Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/ac/6e/faf7c6c3ae59641c75023fb5dcc8a02c33752ac8ccadf9931e8d8364f2fe/boto3-1.7.84-py2.py3-none-any.whl", hash = "sha256:0ed4b107c3b4550547aaec3c9bb17df068ff92d1f6f4781205800e2cb8a66de5", size = 128502, upload-time = "2018-08-23T23:58:41.734Z" }, +] + +[[package]] +name = "boto3" +version = "1.40.35" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version >= '3.12'", + "python_full_version == '3.11.*'", + "python_full_version == '3.10.*'", +] +dependencies = [ + { name = "botocore", version = "1.40.35", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, + { name = "jmespath", version = "1.0.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, + { name = "s3transfer", version = "0.14.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/08/d0/9082261eb9afbb88896fa2ce018fa10750f32572ab356f13f659761bc5b5/boto3-1.40.35.tar.gz", hash = "sha256:d718df3591c829bcca4c498abb7b09d64d1eecc4e5a2b6cef14b476501211b8a", size = 111563, upload-time = "2025-09-19T19:41:07.704Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/db/26/08d814db09dc46eab747c7ebe1d4af5b5158b68e1d7de82ecc71d419eab3/boto3-1.40.35-py3-none-any.whl", hash = "sha256:f4c1b01dd61e7733b453bca38b004ce030e26ee36e7a3d4a9e45a730b67bc38d", size = 139346, upload-time = "2025-09-19T19:41:05.929Z" }, +] + +[[package]] +name = "botocore" +version = "1.10.84" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version < '3.10'", +] +dependencies = [ + { name = "docutils", marker = "python_full_version < '3.10'" }, + { name = "jmespath", version = "0.10.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "python-dateutil", marker = "python_full_version < '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/67/01/43759329a6f7036aa739e86d446b908fa207222e224e537cd3d66fdb4c29/botocore-1.10.84.tar.gz", hash = "sha256:d3e4b5a2c903ea30d19d41ea2f65d0e51dce54f4f4c4dfd6ecd7b04f240844a8", size = 4612644, upload-time = "2018-08-23T23:58:44.898Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/01/b7/cb08cd1af2bb0d0dfb393101a93b6ab6fb80f109ab7b37f2f34386c11351/botocore-1.10.84-py2.py3-none-any.whl", hash = "sha256:380852e1adb9ba4ba9ff096af61f88a6888197b86e580e1bd786f04ebe6f9c0c", size = 4478913, upload-time = "2018-08-23T23:58:47.736Z" }, +] + +[[package]] +name = "botocore" +version = "1.40.35" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version >= '3.12'", + "python_full_version == '3.11.*'", + "python_full_version == '3.10.*'", +] +dependencies = [ + { name = "jmespath", version = "1.0.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, + { name = "python-dateutil", marker = "python_full_version >= '3.10'" }, + { name = "urllib3", marker = "python_full_version >= '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/da/6f/37f40da07f3cdde367f620874f76b828714409caf8466def65aede6bdf59/botocore-1.40.35.tar.gz", hash = "sha256:67e062752ff579c8cc25f30f9c3a84c72d692516a41a9ee1cf17735767ca78be", size = 14350022, upload-time = "2025-09-19T19:40:56.781Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/42/f4/9942dfb01a8a849daac34b15d5b7ca994c52ef131db2fa3f6e6995f61e0a/botocore-1.40.35-py3-none-any.whl", hash = "sha256:c545de2cbbce161f54ca589fbb677bae14cdbfac7d5f1a27f6a620cb057c26f4", size = 14020774, upload-time = "2025-09-19T19:40:53.498Z" }, +] + [[package]] name = "build" version = "1.3.0" @@ -997,6 +1069,10 @@ version = "0.0.1" source = { editable = "." } dependencies = [ { name = "asv" }, + { name = "boto3", version = "1.7.84", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "boto3", version = "1.40.35", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, + { name = "botocore", version = "1.10.84", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "botocore", version = "1.40.35", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, { name = "docker" }, { name = "dspy", version = "2.6.27", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, { name = "dspy", version = "3.0.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, @@ -1047,6 +1123,8 @@ dev = [ [package.metadata] requires-dist = [ { name = "asv" }, + { name = "boto3", specifier = ">=1.7.84" }, + { name = "botocore", specifier = ">=1.10.84" }, { name = "docker" }, { name = "dspy", specifier = ">=2.6.27" }, { name = "gitpython" }, @@ -1210,6 +1288,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e3/26/57c6fb270950d476074c087527a558ccb6f4436657314bfb6cdf484114c4/docker-7.1.0-py3-none-any.whl", hash = "sha256:c96b93b7f0a746f9e77d325bcfb87422a3d8bd4f03136ae8a85b37f1898d5fc0", size = 147774, upload-time = "2024-05-23T11:13:55.01Z" }, ] +[[package]] +name = "docutils" +version = "0.22.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/e4/47/d869000fb74438584858acc628a364b277fc012695f0dfd513cb10f99768/docutils-0.22.1.tar.gz", hash = "sha256:d2fb50923a313532b6d41a77776d24cb459a594be9b7e4afa1fbcb5bda1893e6", size = 2291655, upload-time = "2025-09-17T17:58:45.409Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/8c/dc/1948b90c5d9dbfa4d1fd3991013a042ba3ac62ebd3afdcb3fac08366e755/docutils-0.22.1-py3-none-any.whl", hash = "sha256:806e896f256a17466426544038f30cb860a99f5d4af640e36c284bfcb1824512", size = 638455, upload-time = "2025-09-17T17:58:42.498Z" }, +] + [[package]] name = "dspy" version = "2.6.27" @@ -1985,6 +2072,32 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/70/f3/ce100253c80063a7b8b406e1d1562657fd4b9b4e1b562db40e68645342fb/jiter-0.11.0-graalpy311-graalpy242_311_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:902b43386c04739229076bd1c4c69de5d115553d982ab442a8ae82947c72ede7", size = 336380, upload-time = "2025-09-15T09:20:36.867Z" }, ] +[[package]] +name = "jmespath" +version = "0.10.0" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version < '3.10'", +] +sdist = { url = "https://files.pythonhosted.org/packages/3c/56/3f325b1eef9791759784aa5046a8f6a1aff8f7c898a2e34506771d3b99d8/jmespath-0.10.0.tar.gz", hash = "sha256:b85d0567b8666149a93172712e68920734333c0ce7e89b78b3e987f71e5ed4f9", size = 21607, upload-time = "2020-05-12T22:03:47.267Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl", hash = "sha256:cdf6525904cc597730141d61b36f2e4b8ecc257c420fa2f4549bac2c2d0cb72f", size = 24489, upload-time = "2020-05-12T22:03:45.643Z" }, +] + +[[package]] +name = "jmespath" +version = "1.0.1" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version >= '3.12'", + "python_full_version == '3.11.*'", + "python_full_version == '3.10.*'", +] +sdist = { url = "https://files.pythonhosted.org/packages/00/2a/e867e8531cf3e36b41201936b7fa7ba7b5702dbef42922193f05c8976cd6/jmespath-1.0.1.tar.gz", hash = "sha256:90261b206d6defd58fdd5e85f478bf633a2901798906be2ad389150c5c60edbe", size = 25843, upload-time = "2022-06-17T18:00:12.224Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/31/b4/b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc/jmespath-1.0.1-py3-none-any.whl", hash = "sha256:02e2e4cc71b5bcab88332eebf907519190dd9e6e82107fa7f83b1003a6252980", size = 20256, upload-time = "2022-06-17T18:00:10.251Z" }, +] + [[package]] name = "joblib" version = "1.5.2" @@ -4482,6 +4595,38 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/3a/ae/3f6f42038bb992d42b23a8b54790593aa04ef297195901d6534e671fd255/ruptures-1.1.10-cp39-cp39-win_amd64.whl", hash = "sha256:dff3801467f0ab7c257305fea1f676b55f16fc8c8e8bd8aafac366da52afa578", size = 476481, upload-time = "2025-09-10T09:48:01.762Z" }, ] +[[package]] +name = "s3transfer" +version = "0.1.13" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version < '3.10'", +] +dependencies = [ + { name = "botocore", version = "1.10.84", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/9a/66/c6a5ae4dbbaf253bd662921b805e4972451a6d214d0dc9fb3300cb642320/s3transfer-0.1.13.tar.gz", hash = "sha256:90dc18e028989c609146e241ea153250be451e05ecc0c2832565231dacdf59c1", size = 103335, upload-time = "2018-02-15T00:25:02.494Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl", hash = "sha256:c7a9ec356982d5e9ab2d4b46391a7d6a950e2b04c472419f5fdec70cc0ada72f", size = 59642, upload-time = "2018-02-15T00:25:05.113Z" }, +] + +[[package]] +name = "s3transfer" +version = "0.14.0" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version >= '3.12'", + "python_full_version == '3.11.*'", + "python_full_version == '3.10.*'", +] +dependencies = [ + { name = "botocore", version = "1.40.35", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/62/74/8d69dcb7a9efe8baa2046891735e5dfe433ad558ae23d9e3c14c633d1d58/s3transfer-0.14.0.tar.gz", hash = "sha256:eff12264e7c8b4985074ccce27a3b38a485bb7f7422cc8046fee9be4983e4125", size = 151547, upload-time = "2025-09-09T19:23:31.089Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/48/f0/ae7ca09223a81a1d890b2557186ea015f6e0502e9b8cb8e1813f1d8cfa4e/s3transfer-0.14.0-py3-none-any.whl", hash = "sha256:ea3b790c7077558ed1f02a3072fb3cb992bbbd253392f4b6e9e8976941c7d456", size = 85712, upload-time = "2025-09-09T19:23:30.041Z" }, +] + [[package]] name = "scipy" version = "1.13.1"