Skip to content

Add L2 CI workflow with simulated tests and HUD callbacks#4

Merged
atalman merged 3 commits into
mainfrom
add-l2-callbacks
Jun 2, 2026
Merged

Add L2 CI workflow with simulated tests and HUD callbacks#4
atalman merged 3 commits into
mainfrom
add-l2-callbacks

Conversation

@subinz1
Copy link
Copy Markdown
Collaborator

@subinz1 subinz1 commented Jun 1, 2026

Summary

  • Adds a new oot-l2-ci.yml workflow (the existing crcr-dispatch-receiver.yml is untouched)
  • Receives repository_dispatch events (same pull_request/push types)
  • Sends in_progress callback at job start, completed callback at job end
  • Generates randomized per-run test data:
    • 100–500 total tests
    • 0–11% failure rate, 0–4% skip rate
    • 15–60s simulated execution duration
  • Reports test-results JSON and artifact-url in the completed callback
  • Every dispatch produces different numbers → varied data flows through DynamoDB → ClickHouse → HUD

What it validates

  1. OIDC token minting via id-token: write
  2. cross-repo-ci-relay-callback composite action (in_progress + completed)
  3. End-to-end data flow: GHA → Lambda → DynamoDB → ClickHouse replicator → HUD API
  4. Dashboard displays varied pass rates, durations, and failure counts

Test plan

  • Trigger a dispatch to pytorch/crcr-test (via a PR or push to pytorch/pytorch)
  • Verify the OOT L2 workflow runs and completes
  • Check DynamoDB torchci-oot-workflow-job table for the new record
  • Check ClickHouse default.oot_workflow_job for replicated data
  • Verify https://hud.pytorch.org/oot shows the new entry

@meta-cla meta-cla Bot added the cla signed label Jun 1, 2026
@subinz1 subinz1 marked this pull request as draft June 1, 2026 04:52
@subinz1 subinz1 force-pushed the add-l2-callbacks branch from 41617f4 to f62bbb4 Compare June 1, 2026 05:42
@subinz1 subinz1 changed the title Add L2 callbacks to report CI results to PyTorch HUD Add L2 CI workflow with simulated tests and HUD callbacks Jun 1, 2026
@subinz1 subinz1 requested review from groenenboomj and jewelkm89 June 1, 2026 05:43
Add a new oot-l2-ci.yml workflow (separate from the existing L1
dispatch receiver) that demonstrates the full L2 pipeline:

- Listens only to pull_request dispatches (not push)
- Sends in_progress callback at job start via the composite action
- Checks out both this repo and pytorch/pytorch at the PR head SHA
- Generates randomized test counts (100-500 total, 0-11% fail rate,
  0-4% skip rate) and execution duration (15-60s) per run
- Simulates build and test execution with realistic log output
- Sends completed callback with conclusion and test-results payload
- Reports artifact URL pointing to the workflow run

Every dispatch produces different numbers, so repeated runs create
varied data in the OOT HUD dashboard.

The existing crcr-dispatch-receiver.yml (L1) is left unchanged.
@subinz1 subinz1 force-pushed the add-l2-callbacks branch from f62bbb4 to 8f4a42d Compare June 1, 2026 05:47
@subinz1 subinz1 marked this pull request as ready for review June 1, 2026 08:36
Comment thread .github/workflows/oot-l2-ci.yml Outdated
if: always()
uses: pytorch/test-infra/.github/actions/cross-repo-ci-relay-callback@main
with:
callback-url: ${{ secrets.CRCR_CALLBACK_URL }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subinz1
Can we make sure this CRCR_CALLBACK_URL is set.

Copy link
Copy Markdown
Collaborator Author

@subinz1 subinz1 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — CRCR_CALLBACK_URL is not currently set as a repository secret on pytorch/crcr-test. I verified via gh secret list --repo pytorch/crcr-test (returns empty).

This secret should be the Lambda Function URL for the crcr-callback-prod Lambda, deployed via the Terraform config in pytorch/ci-infra#614 (aws_lambda_function_url.callback). The value can be obtained by either:

  1. Running terraform output callback_function_url in the crcr/ workspace, or
  2. Looking it up in the AWS Console → Lambda → crcr-callback-prod → Function URL

@atalman or @can-gaa-hou — could one of you set this secret on pytorch/crcr-test? This is a prerequisite for the L2 callback action to work.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to the aws lambda, but I think @fffrog or @zxiiro have access to it. Furthermore, deploying the Result/Callback lambda needs hud_api_url and hud_bot_key, and I guess these two secrets haven't been set yet and so the lambda has not been ready yet? Can @zxiiro confirm that? Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @atalman!
I see the deployment is completed fine. Can you please help set the crcr callback url  as secret and set hud_api_url and hud_bot_key as well as per @can-gaa-hou's comment.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do those secrets need to go? Are they GitHub Secrets or AWS Secrets Manager secrets?

I see these secrets in AWS_SECRETS_MANAGER

image

However I'll note that HUD_BOT_KEY is currently empty so whoever added it didn't set it to any value.

Copy link
Copy Markdown
Collaborator Author

@subinz1 subinz1 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zxiiro,

  • CRCR_CALLBACK_URL goes in GitHub repo secrets on pytorch/crcr-test — this is the Lambda Function URL from the deployment
  • HUD_API_URL and HUD_BOT_KEY go in AWS Secrets Manager — these are used by the Lambda itself to forward results to the HUD
  • HUD_BOT_KEY needs to be set to the OOT_RELAY_TOKEN value from the HUD's Vercel environment, and @atalman can provide that, since we don't have access to Vercel

CC @groenenboomj @jewelkm89

Copy link
Copy Markdown
Collaborator

@can-gaa-hou can-gaa-hou Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do those secrets need to go? Are they GitHub Secrets or AWS Secrets Manager secrets?
I see these secrets in AWS_SECRETS_MANAGER

Sorry for the late reply. CRCR_CALLBACK_URL will automatically generate after the lambda is deployed, and it will set to be the default value (refer to this PR pytorch/test-infra#8133). For hud_api_url and hud_bot_key, it needs to set in the GitHub secrets in pytorch/ci-infra repo so that it will automatically set to the correct place in the AWS.

cc @subinz1 @zxiiro @atalman

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a PR towards test-infra repo to set the default of callback-url in the CRCR action for making convenience to downstream repos. Downstream repos shouldn't fill the callback-url by themselves.
pytorch/test-infra#8133

cc @zxiiro @atalman

sleep $(( (RANDOM % 5) + 2 ))
echo "Build complete."

- name: Simulate test execution
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to test unsupported returns here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added in e440f5b. The test step now rolls a d20 at the start:

  • Roll 0 (~5%): reports cancelled (skips test execution entirely)
  • Roll 1 (~5%): reports timed_out (runs the sleep, then reports timeout)
  • Roll 2–19 (~90%): normal pass/fail based on randomized test counts

This exercises the HUD's rendering for all conclusion states (success, failure, cancelled, timed_out).

~5% of runs now simulate a 'cancelled' conclusion and ~5% simulate
'timed_out', exercising the HUD's rendering for all supported states.
The remaining ~90% of runs use the normal pass/fail logic.
on:
repository_dispatch:
types:
- pull_request
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to test push type here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test workflow, we scoped it to pull_request only since the HUD currently displays results at the PR level. Should we add push type testing here or in a separate workflow?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If HUD only displays the PR level, it is fine to test pull_request only.

pytorch/test-infra#8133 sets the production Lambda URL as the default
for callback-url in the composite action, so downstream repos no
longer need to pass it explicitly or configure CRCR_CALLBACK_URL as
a repository secret.
@subinz1
Copy link
Copy Markdown
Collaborator Author

subinz1 commented Jun 2, 2026

Thanks @KarhouTam — this aligns with RFC-0050's design goal that downstream repos shouldn't need to manage relay integration details like callback URLs. Updated our test workflow in pytorch/crcr-test#4 (commit ee810f9) to drop the explicit callback-url and rely on this default.

@atalman atalman merged commit 69b0f00 into main Jun 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants