Add L2 CI workflow with simulated tests and HUD callbacks#4
Conversation
Add a new oot-l2-ci.yml workflow (separate from the existing L1 dispatch receiver) that demonstrates the full L2 pipeline: - Listens only to pull_request dispatches (not push) - Sends in_progress callback at job start via the composite action - Checks out both this repo and pytorch/pytorch at the PR head SHA - Generates randomized test counts (100-500 total, 0-11% fail rate, 0-4% skip rate) and execution duration (15-60s) per run - Simulates build and test execution with realistic log output - Sends completed callback with conclusion and test-results payload - Reports artifact URL pointing to the workflow run Every dispatch produces different numbers, so repeated runs create varied data in the OOT HUD dashboard. The existing crcr-dispatch-receiver.yml (L1) is left unchanged.
| if: always() | ||
| uses: pytorch/test-infra/.github/actions/cross-repo-ci-relay-callback@main | ||
| with: | ||
| callback-url: ${{ secrets.CRCR_CALLBACK_URL }} |
There was a problem hiding this comment.
@subinz1
Can we make sure this CRCR_CALLBACK_URL is set.
There was a problem hiding this comment.
Yes — CRCR_CALLBACK_URL is not currently set as a repository secret on pytorch/crcr-test. I verified via gh secret list --repo pytorch/crcr-test (returns empty).
This secret should be the Lambda Function URL for the crcr-callback-prod Lambda, deployed via the Terraform config in pytorch/ci-infra#614 (aws_lambda_function_url.callback). The value can be obtained by either:
- Running
terraform output callback_function_urlin thecrcr/workspace, or - Looking it up in the AWS Console → Lambda →
crcr-callback-prod→ Function URL
@atalman or @can-gaa-hou — could one of you set this secret on pytorch/crcr-test? This is a prerequisite for the L2 callback action to work.
There was a problem hiding this comment.
I don't have access to the aws lambda, but I think @fffrog or @zxiiro have access to it. Furthermore, deploying the Result/Callback lambda needs hud_api_url and hud_bot_key, and I guess these two secrets haven't been set yet and so the lambda has not been ready yet? Can @zxiiro confirm that? Thanks!
There was a problem hiding this comment.
@subinz1 and @can-gaa-hou Started the deployment of lambda: https://github.com/pytorch/ci-infra/actions/runs/26762481905
There was a problem hiding this comment.
Thank you very much @atalman!
I see the deployment is completed fine. Can you please help set the crcr callback url as secret and set hud_api_url and hud_bot_key as well as per @can-gaa-hou's comment.
There was a problem hiding this comment.
Hello @zxiiro,
- CRCR_CALLBACK_URL goes in GitHub repo secrets on pytorch/crcr-test — this is the Lambda Function URL from the deployment
- HUD_API_URL and HUD_BOT_KEY go in AWS Secrets Manager — these are used by the Lambda itself to forward results to the HUD
- HUD_BOT_KEY needs to be set to the OOT_RELAY_TOKEN value from the HUD's Vercel environment, and @atalman can provide that, since we don't have access to Vercel
There was a problem hiding this comment.
Where do those secrets need to go? Are they GitHub Secrets or AWS Secrets Manager secrets?
I see these secrets in AWS_SECRETS_MANAGER
Sorry for the late reply. CRCR_CALLBACK_URL will automatically generate after the lambda is deployed, and it will set to be the default value (refer to this PR pytorch/test-infra#8133). For hud_api_url and hud_bot_key, it needs to set in the GitHub secrets in pytorch/ci-infra repo so that it will automatically set to the correct place in the AWS.
There was a problem hiding this comment.
I've opened a PR towards test-infra repo to set the default of callback-url in the CRCR action for making convenience to downstream repos. Downstream repos shouldn't fill the callback-url by themselves.
pytorch/test-infra#8133
| sleep $(( (RANDOM % 5) + 2 )) | ||
| echo "Build complete." | ||
|
|
||
| - name: Simulate test execution |
There was a problem hiding this comment.
Do we also want to test unsupported returns here?
There was a problem hiding this comment.
Good call — added in e440f5b. The test step now rolls a d20 at the start:
- Roll 0 (~5%): reports
cancelled(skips test execution entirely) - Roll 1 (~5%): reports
timed_out(runs the sleep, then reports timeout) - Roll 2–19 (~90%): normal pass/fail based on randomized test counts
This exercises the HUD's rendering for all conclusion states (success, failure, cancelled, timed_out).
~5% of runs now simulate a 'cancelled' conclusion and ~5% simulate 'timed_out', exercising the HUD's rendering for all supported states. The remaining ~90% of runs use the normal pass/fail logic.
| on: | ||
| repository_dispatch: | ||
| types: | ||
| - pull_request |
There was a problem hiding this comment.
Do we also need to test push type here?
There was a problem hiding this comment.
For this test workflow, we scoped it to pull_request only since the HUD currently displays results at the PR level. Should we add push type testing here or in a separate workflow?
There was a problem hiding this comment.
If HUD only displays the PR level, it is fine to test pull_request only.
pytorch/test-infra#8133 sets the production Lambda URL as the default for callback-url in the composite action, so downstream repos no longer need to pass it explicitly or configure CRCR_CALLBACK_URL as a repository secret.
|
Thanks @KarhouTam — this aligns with RFC-0050's design goal that downstream repos shouldn't need to manage relay integration details like callback URLs. Updated our test workflow in pytorch/crcr-test#4 (commit ee810f9) to drop the explicit callback-url and rely on this default. |

Summary
oot-l2-ci.ymlworkflow (the existingcrcr-dispatch-receiver.ymlis untouched)repository_dispatchevents (samepull_request/pushtypes)in_progresscallback at job start,completedcallback at job endtest-resultsJSON andartifact-urlin the completed callbackWhat it validates
id-token: writecross-repo-ci-relay-callbackcomposite action (in_progress+completed)Test plan
pytorch/crcr-test(via a PR or push topytorch/pytorch)OOT L2workflow runs and completestorchci-oot-workflow-jobtable for the new recorddefault.oot_workflow_jobfor replicated data