Skip to content

[ML] Add AI-powered build failure analysis to CI pipelines#2909

Open
edsavage wants to merge 21 commits intoelastic:mainfrom
edsavage:ci-build-failure-analyzer
Open

[ML] Add AI-powered build failure analysis to CI pipelines#2909
edsavage wants to merge 21 commits intoelastic:mainfrom
edsavage:ci-build-failure-analyzer

Conversation

@edsavage
Copy link
Contributor

Summary

  • Adds a new Buildkite pipeline step that automatically analyses failed builds using the Anthropic Claude API
  • When a build fails, the step fetches logs from failed steps and posts a structured diagnosis (root cause, classification, suggested fix, confidence) as a Buildkite annotation
  • The step is soft-fail and only runs when the build is actually failing (if: "build.state == 'failed' || build.state == 'failing'")
  • Wired into all three pipelines: PR builds, nightly snapshot builds, and nightly debug builds
  • Claude API key stored in Vault at secret/ci/elastic-ml-cpp/anthropic/claude

New files

  • dev-tools/analyze_build_failure.py — core analysis script
  • .buildkite/pipelines/analyze_build_failure.yml.sh — pipeline step definition

Test plan

Made with Cursor

@prodsecmachine
Copy link

prodsecmachine commented Feb 20, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this create a GitHub comment on the PR with fix suggestions?

@edsavage
Copy link
Contributor Author

Will this create a GitHub comment on the PR with fix suggestions?

No, just as an annotation on the Buildkite build. I guess it could be a GH comment as well though, which would notify the user and have better visibility if changes are suggested.

@edsavage
Copy link
Contributor Author

edsavage commented Feb 27, 2026

Pushed a new commit that adds GitHub PR comments for build failure analysis. When the build is a PR build, the analysis is now posted as a comment directly on the PR (in addition to the Buildkite annotation and optional Slack notification).

Key details:

  • Uses BUILDKITE_PULL_REQUEST env var (set automatically by Buildkite) to detect PR builds
  • Posts via GitHub API using a token from Vault (secret/ci/elastic-ml-cpp/github/pr_comment_token)
  • Uses an HTML comment marker (<!-- build-failure-analysis -->) to find and update existing comments on rebuild/retry, avoiding duplicates
  • Non-PR builds (nightly snapshots, etc.) continue to use Buildkite annotations only

This addresses @valeriy42's feedback about improving visibility for PR authors.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding GH comment functionality. I think it makes a lot of sense to reduce friction so this information is visible to the developer.

My last concern is about burning the API tokens. Is it possible to activate this function per GitHub comment?

I think we have the following user story:
As a developer, I want to ask the CI system why it failed and what needs to be done to fix it.

@edsavage edsavage force-pushed the ci-build-failure-analyzer branch from 55d3ad2 to 19a3caf Compare March 19, 2026 01:48
@edsavage
Copy link
Contributor Author

User Experience: Examining a Failed Build

After this PR merges, here's what a developer sees when a PR build fails:

1. Immediate feedback — GitHub commit statuses (existing)

Red/green marks appear on the PR for each Buildkite step (e.g. "Build on Linux x86_64 RelWithDebInfo — failed"). Each is a clickable link to the Buildkite step. This is the existing publish_commit_status_per_step behaviour.

2. Native PR comment from @elasticmachine (new)

Once the build completes, a PR comment appears listing what failed:

💔 Build Failed

  • Buildkite Build (link)

Failed CI Steps

  • Build on Linux x86_64 RelWithDebInfo (link)
  • Test on Linux x86_64 RelWithDebInfo (link)

History

This is automatic (no opt-in), and includes build history across commits on the PR, flagging flaky builds. Enabled by ELASTIC_PR_COMMENTS_ENABLED in catalog-info.yaml.

3. AI analysis (opt-in for PR builds)

If the failure isn't obvious, the developer comments on the PR:

buildkite analyze

This triggers the analyze_build_failure step which fetches logs from failed steps, sends them to Claude for diagnosis, and posts a Buildkite annotation at the top of the build page.

4. AI analysis PR comment from @github-actions[bot] (new)

A GitHub Actions workflow picks up the analyze step's commit status and posts a PR comment:

🔍 Build Failure Analysis

Root Cause

The CMultiFileDataAdderTest test failed due to a temp file collision...

Classification

test failure

Suggested Fix

Use process ID for unique temp file names...

Confidence

high — The error message clearly indicates...

This comment is updated in-place on subsequent analyses (not duplicated). No personal access token or GitHub App is required — the workflow uses the built-in GITHUB_TOKEN.

5. Slack + email (nightly/snapshot builds)

For non-PR builds, the analysis is also posted to #machine-learn-build via Slack. Email notifications to build-machine-learning@elastic.co continue as before.


Summary

When Where What Source
Per-step PR checks Red/green status with Buildkite links Buildkite (existing)
Build complete PR comment Failed steps + build history @elasticmachine (new)
After buildkite analyze Buildkite build page AI diagnosis annotation Buildkite annotation
After buildkite analyze PR comment AI diagnosis with root cause + fix @github-actions[bot] (new)

The developer can stop at any layer depending on how obvious the failure is. Most of the time steps 1–2 are enough; the AI analysis is there for non-obvious cases.

@edsavage edsavage requested a review from valeriy42 March 19, 2026 02:30
edsavage and others added 10 commits March 20, 2026 13:08
When a Buildkite build fails, a new soft-fail step fetches the failed
step logs and sends them to Claude for diagnosis.  The analysis
(root cause, classification, suggested fix, confidence) is posted as
a Buildkite annotation directly on the build page.

The step uses an `if` guard so it only runs when the build is
failing, and the Claude API key is retrieved from Vault at runtime.

Co-authored-by: Cursor <cursoragent@cursor.com>
When SLACK_WEBHOOK_URL is set, posts a compact summary of each failed
step's AI diagnosis to #machine-learn-build.  The message includes the
classification emoji, root cause, and a link back to the build page.

The webhook URL is retrieved from Vault at runtime; if absent, the
Slack step is silently skipped and only the Buildkite annotation is
posted.

Co-authored-by: Cursor <cursoragent@cursor.com>
When the build is a PR build (BUILDKITE_PULL_REQUEST is set), post the
Claude analysis as a comment on the GitHub PR in addition to the
Buildkite annotation and Slack notification.

Uses an HTML comment marker to find and update existing comments on
rebuild/retry, avoiding duplicate comments on the same PR.

Addresses review feedback from valeriy42 requesting better visibility
of failure analysis for PR authors.

Made-with: Cursor
Allows overriding the PR number from the command line, useful for
local testing of the GitHub comment feature without being in a
Buildkite PR build environment.

Tested end-to-end against build elastic#2232 (Bayesian test timeout),
posting to a throwaway PR. Both initial post and update-in-place
(deduplication) verified working.

Made-with: Cursor
Failure analysis now only runs on PR builds when triggered by a
`buildkite analyze` comment, avoiding unnecessary API token usage.
Nightly and debug pipelines retain automatic analysis on failure.

Made-with: Cursor
Enable the ELASTIC_PR_COMMENTS_ENABLED feature on the PR builds
pipeline so that elasticmachine posts a summary comment listing
failed steps and build history directly on the GitHub PR.

Made-with: Cursor
Replace direct GitHub API calls from the Buildkite analyze step with
a GitHub Actions workflow that uses the built-in GITHUB_TOKEN. The
Buildkite step now saves the analysis as build metadata, and a
GitHub Actions workflow triggered by the commit status event fetches
it and posts/updates the PR comment. This eliminates the need for a
personal access token or GitHub App for PR comments.

Made-with: Cursor
The test confirmed Vault is reachable from GitHub Actions runners
and JWT auth paths exist. Actual OIDC login needs to be verified
with the infra team.

Made-with: Cursor
Apply the same fix as PR elastic#3003 to the analyze_build_failure step:
compute which build step keys will exist based on the platform config
and pass them as ML_BUILD_STEP_KEYS for the shell script to use in
its depends_on section.  This prevents "Step dependencies not found"
errors when not all platforms are built.

Made-with: Cursor
@edsavage edsavage force-pushed the ci-build-failure-analyzer branch from 16df0df to 4cc72f3 Compare March 20, 2026 00:11
@edsavage
Copy link
Contributor Author

buildkite analyze

The analyze_build_failure step already guards itself with
  if: "build.state == 'failed' || build.state == 'failing'"
so it is automatically skipped for passing builds.  Making it
always-on (rather than requiring a special "buildkite analyze"
comment trigger) ensures it is available whenever a build fails
without needing to be requested in advance.

Remove the run_analyze config flag and the "analyze" action from
the PR comment trigger regex since they are no longer needed.

Made-with: Cursor
Introduce a compile error to test the build failure analysis step.
This commit will be reverted immediately after verifying the step.

Made-with: Cursor
Remove the Buildkite `if` condition from analyze_build_failure.yml.sh.
Buildkite evaluates `if` on dynamically uploaded steps at upload time
(not at step execution time), so the condition always saw
build.state == 'running' and the step was never created.

The Python script already checks the build state via the Buildkite
API and exits early if the build passed, so the YAML-level `if` is
unnecessary.

Also reverts the deliberate compile error in CBuildInfo.cc that was
used to test the failure analysis flow.

Made-with: Cursor
Use python:3 instead of python:3-slim for the analyze_build_failure
step. The slim image lacks curl and git which the Buildkite agent
hooks require.

Also reverts the deliberate compile error.

Made-with: Cursor
The "Analyze build failure" step ran successfully on Build elastic#2385,
correctly identifying the deliberate #error as a code bug with high
confidence. Reverting to restore normal builds.

Made-with: Cursor
Instead of always including the analysis step or requiring a full
rebuild, "buildkite analyze" now triggers a lightweight pipeline that
finds the most recent failed build for the branch via the Buildkite
API and analyzes it retroactively — no recompilation needed.

Also improves log extraction: instead of blindly taking the last 30K
chars (which often misses the actual error), the script now scans for
error patterns and extracts matching lines with surrounding context.

Made-with: Cursor
@edsavage
Copy link
Contributor Author

buildkite analyze

Replace BOOST_ERROR/BOOST_FAIL patterns (source-code macro names that
don't appear in logs) with a pattern matching the actual Boost.Test
summary output: "*** N failure(s) detected in test suite".

Made-with: Cursor
The analysis step correctly identified the Boost.Test failure on all
platforms. Reverting to restore normal test behaviour.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants