Note: This is a preview feature currently under active development.
bktec can collect historical git commit metadata from your repository and upload it to Buildkite for training test selection models. This data helps test selection identify which tests are relevant to your code changes.
The backfill commands are available under bktec tools and are hidden from bktec --help by default. Setting BKTEC_PREVIEW_SELECTION to a truthy value (1, true, yes, or on) makes them visible in help output. The commands can always be invoked directly regardless of this setting.
- A git repository checkout (full clone recommended for best results)
- A Buildkite API access token with
read_suitesandwrite_suitesscopes - Optional:
BKTEC_PREVIEW_SELECTIONset to a truthy value to see the commands inbktec --help
Collects historical git commit metadata and uploads it to Buildkite. This is the main command for the backfill workflow.
The command performs the following steps:
- Verifies the API token has the required scopes (
read_suitesandwrite_suites) - Fetches the list of commit SHAs from the Buildkite API for your suite
- Detects the default branch of your repository
- Filters out commits that don't exist locally, and fetches missing commits from the remote
- Collects commit metadata (author, committer, message, parent SHAs) in bulk
- Collects diffs concurrently for each commit against its fork-point on the default branch
- Packages everything as a compressed tarball (
commit-metadata.jsonl+metadata.json) - Uploads the tarball to Buildkite via presigned S3
Basic usage (flags):
bktec tools backfill-commit-metadata \
--access-token "bkua_..." \
--organization-slug "my-org" \
--suite-slug "my-suite"Or using environment variables:
export BUILDKITE_TEST_ENGINE_API_ACCESS_TOKEN="bkua_..."
export BUILDKITE_ORGANIZATION_SLUG="my-org"
export BUILDKITE_TEST_ENGINE_SUITE_SLUG="my-suite"
bktec tools backfill-commit-metadataWrite to a local file instead of uploading:
bktec tools backfill-commit-metadata --output commit-metadata.tar.gzSkip full diffs:
bktec tools backfill-commit-metadata --skip-diffsCustomize the lookback window and concurrency:
bktec tools backfill-commit-metadata --days 30 --concurrency 5Upload a previously generated tarball:
bktec tools backfill-commit-metadata \
--upload commit-metadata.tar.gz \
--suite-slug "my-suite"This is useful when you want to generate and upload in separate steps or when retrying a failed upload. If the command fails during upload, it retains the generated tarball locally and prints its path. You can then retry with --upload without re-running the entire metadata collection.
--suite-slug is required even with --upload because the presigned upload endpoint is suite-scoped: the server partitions tarballs by suite for the training pipeline. Use the same suite slug that produced the tarball.
| Environment Variable | Flag | Default | Description |
|---|---|---|---|
BUILDKITE_TEST_ENGINE_API_ACCESS_TOKEN |
--access-token |
Buildkite API access token (required) | |
BUILDKITE_TEST_ENGINE_SUITE_SLUG |
--suite-slug |
Test Engine suite slug (required for backfill) | |
BUILDKITE_ORGANIZATION_SLUG |
--organization-slug |
Buildkite organization slug (required) | |
BUILDKITE_TEST_ENGINE_BASE_URL |
--base-url |
https://api.buildkite.com |
Buildkite API base URL |
BUILDKITE_TEST_ENGINE_SKIP_DIFFS |
--skip-diffs |
false |
Omit full git diffs from the export |
BUILDKITE_TEST_ENGINE_BACKFILL_DAYS |
--days |
90 |
Number of days of commit history to export (1-90) |
BUILDKITE_TEST_ENGINE_REMOTE or BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE |
--remote |
origin |
Git remote name for fetching and branch detection |
BUILDKITE_TEST_ENGINE_BACKFILL_CONCURRENCY |
--concurrency |
10 |
Number of concurrent git operations for diff collection |
BUILDKITE_TEST_ENGINE_DEBUG_ENABLED |
--debug |
false |
Enable debug output |
When using --upload, --access-token, --organization-slug, and --suite-slug are required. Collection-specific flags (--days, --concurrency, --remote, --skip-diffs) are not needed because the command only uploads an existing tarball.
The backfill-commit-metadata command requires both read_suites (to fetch the commit list) and write_suites (to upload the tarball) scopes. If you use --output to write locally without uploading, only read_suites is required; a missing write_suites scope is downgraded to a warning.
When using --upload, only write_suites is required.
Token scopes are verified before any git work begins, so you get a fast failure if the token is misconfigured.
You can run the backfill as a Buildkite pipeline step:
steps:
- label: ":git: Backfill commit metadata"
command: bktec tools backfill-commit-metadata
env:
BKTEC_PREVIEW_SELECTION: "true"
BUILDKITE_TEST_ENGINE_SUITE_SLUG: "my-suite"To run the backfill for multiple suites in the same repository:
steps:
- label: ":git: Backfill commit metadata ({{matrix}})"
command: bktec tools backfill-commit-metadata
matrix:
- "my-rspec-suite"
- "my-jest-suite"
env:
BKTEC_PREVIEW_SELECTION: "true"
BUILDKITE_TEST_ENGINE_SUITE_SLUG: "{{matrix}}"The command fetches the list of commit SHAs from the Buildkite API. The server returns commits that appear in your suite's test execution history for the specified number of days. This means the backfill only processes commits that Buildkite has seen in test runs.
The default branch is detected using a fallback chain: <remote>/HEAD, then <remote>/main, then <remote>/master. The --remote flag controls which remote is used (default origin).
For each commit, the command determines the appropriate base commit to diff against using three strategies:
git merge-base --fork-point(uses reflog data)- Mainline parent fallback (for commits directly on the default branch)
- Plain
git merge-base(for unmerged branches)
Some commits from the API list may not exist in the local checkout (for example, from force-pushed branches or shallow clones). The command attempts to fetch missing commits from the remote. Commits that can't be fetched are skipped with a warning.
For best results, run from a full clone rather than a shallow clone.
The tarball wraps its contents inside a directory named backfill-<org>-<suite>-<timestamp> (for example, backfill-my-org-my-suite-20260402T100000.000Z/). The directory contains two files:
commit-metadata.jsonl-- one JSON object per line, with fields includingcommit_sha,parent_shas,author_name,author_email,author_date,committer_name,committer_email,committer_date,message,files_changed,diff_stat,git_diff, andgit_diff_rawmetadata.json-- archive metadata including the tool version, generation timestamp, commit count, configuration options used, and the date range of commits in the archive