Add L2 CI workflow with simulated tests and HUD callbacks by subinz1 · Pull Request #4 · pytorch/crcr-test

subinz1 · 2026-06-01T04:46:03Z

Summary

Adds a new oot-l2-ci.yml workflow (the existing crcr-dispatch-receiver.yml is untouched)
Receives repository_dispatch events (same pull_request/push types)
Sends in_progress callback at job start, completed callback at job end
Generates randomized per-run test data:
- 100–500 total tests
- 0–11% failure rate, 0–4% skip rate
- 15–60s simulated execution duration
Reports test-results JSON and artifact-url in the completed callback
Every dispatch produces different numbers → varied data flows through DynamoDB → ClickHouse → HUD

What it validates

OIDC token minting via id-token: write
cross-repo-ci-relay-callback composite action (in_progress + completed)
End-to-end data flow: GHA → Lambda → DynamoDB → ClickHouse replicator → HUD API
Dashboard displays varied pass rates, durations, and failure counts

Test plan

Trigger a dispatch to pytorch/crcr-test (via a PR or push to pytorch/pytorch)
Verify the OOT L2 workflow runs and completes
Check DynamoDB torchci-oot-workflow-job table for the new record
Check ClickHouse default.oot_workflow_job for replicated data
Verify https://hud.pytorch.org/oot shows the new entry

Add a new oot-l2-ci.yml workflow (separate from the existing L1 dispatch receiver) that demonstrates the full L2 pipeline: - Listens only to pull_request dispatches (not push) - Sends in_progress callback at job start via the composite action - Checks out both this repo and pytorch/pytorch at the PR head SHA - Generates randomized test counts (100-500 total, 0-11% fail rate, 0-4% skip rate) and execution duration (15-60s) per run - Simulates build and test execution with realistic log output - Sends completed callback with conclusion and test-results payload - Reports artifact URL pointing to the workflow run Every dispatch produces different numbers, so repeated runs create varied data in the OOT HUD dashboard. The existing crcr-dispatch-receiver.yml (L1) is left unchanged.

jewelkm89 · 2026-06-01T09:54:14Z

+        if: always()
+        uses: pytorch/test-infra/.github/actions/cross-repo-ci-relay-callback@main
+        with:
+          callback-url: ${{ secrets.CRCR_CALLBACK_URL }}


@subinz1
Can we make sure this CRCR_CALLBACK_URL is set.

Yes — CRCR_CALLBACK_URL is not currently set as a repository secret on pytorch/crcr-test. I verified via gh secret list --repo pytorch/crcr-test (returns empty).

This secret should be the Lambda Function URL for the crcr-callback-prod Lambda, deployed via the Terraform config in pytorch/ci-infra#614 (aws_lambda_function_url.callback). The value can be obtained by either:

Running terraform output callback_function_url in the crcr/ workspace, or

Looking it up in the AWS Console → Lambda → crcr-callback-prod → Function URL

@atalman or @can-gaa-hou — could one of you set this secret on pytorch/crcr-test? This is a prerequisite for the L2 callback action to work.

I don't have access to the aws lambda, but I think @fffrog or @zxiiro have access to it. Furthermore, deploying the Result/Callback lambda needs hud_api_url and hud_bot_key, and I guess these two secrets haven't been set yet and so the lambda has not been ready yet? Can @zxiiro confirm that? Thanks!

@subinz1 and @can-gaa-hou Started the deployment of lambda: https://github.com/pytorch/ci-infra/actions/runs/26762481905

Thank you very much @atalman!
I see the deployment is completed fine. Can you please help set the crcr callback url as secret and set hud_api_url and hud_bot_key as well as per @can-gaa-hou's comment.

Where do those secrets need to go? Are they GitHub Secrets or AWS Secrets Manager secrets?

I see these secrets in AWS_SECRETS_MANAGER

However I'll note that HUD_BOT_KEY is currently empty so whoever added it didn't set it to any value.

Hello @zxiiro,

CRCR_CALLBACK_URL goes in GitHub repo secrets on pytorch/crcr-test — this is the Lambda Function URL from the deployment

HUD_API_URL and HUD_BOT_KEY go in AWS Secrets Manager — these are used by the Lambda itself to forward results to the HUD

HUD_BOT_KEY needs to be set to the OOT_RELAY_TOKEN value from the HUD's Vercel environment, and @atalman can provide that, since we don't have access to Vercel

CC @groenenboomj @jewelkm89

Where do those secrets need to go? Are they GitHub Secrets or AWS Secrets Manager secrets?
I see these secrets in AWS_SECRETS_MANAGER

Sorry for the late reply. CRCR_CALLBACK_URL will automatically generate after the lambda is deployed, and it will set to be the default value (refer to this PR pytorch/test-infra#8133). For hud_api_url and hud_bot_key, it needs to set in the GitHub secrets in pytorch/ci-infra repo so that it will automatically set to the correct place in the AWS.

cc @subinz1 @zxiiro @atalman

I've opened a PR towards test-infra repo to set the default of callback-url in the CRCR action for making convenience to downstream repos. Downstream repos shouldn't fill the callback-url by themselves.
pytorch/test-infra#8133

cc @zxiiro @atalman

groenenboomj · 2026-06-01T15:30:26Z

+          sleep $(( (RANDOM % 5) + 2 ))
+          echo "Build complete."
+
+      - name: Simulate test execution


Do we also want to test unsupported returns here?

Good call — added in e440f5b. The test step now rolls a d20 at the start:

Roll 0 (~5%): reports cancelled (skips test execution entirely)

Roll 1 (~5%): reports timed_out (runs the sleep, then reports timeout)

Roll 2–19 (~90%): normal pass/fail based on randomized test counts

This exercises the HUD's rendering for all conclusion states (success, failure, cancelled, timed_out).

~5% of runs now simulate a 'cancelled' conclusion and ~5% simulate 'timed_out', exercising the HUD's rendering for all supported states. The remaining ~90% of runs use the normal pass/fail logic.

can-gaa-hou · 2026-06-02T01:50:51Z

+on:
+  repository_dispatch:
+    types:
+      - pull_request


Do we also need to test push type here?

For this test workflow, we scoped it to pull_request only since the HUD currently displays results at the PR level. Should we add push type testing here or in a separate workflow?

If HUD only displays the PR level, it is fine to test pull_request only.

pytorch/test-infra#8133 sets the production Lambda URL as the default for callback-url in the composite action, so downstream repos no longer need to pass it explicitly or configure CRCR_CALLBACK_URL as a repository secret.

subinz1 · 2026-06-02T16:03:12Z

Thanks @KarhouTam — this aligns with RFC-0050's design goal that downstream repos shouldn't need to manage relay integration details like callback URLs. Updated our test workflow in pytorch/crcr-test#4 (commit ee810f9) to drop the explicit callback-url and rely on this default.

meta-cla Bot added the cla signed label Jun 1, 2026

subinz1 marked this pull request as draft June 1, 2026 04:52

subinz1 force-pushed the add-l2-callbacks branch from 41617f4 to f62bbb4 Compare June 1, 2026 05:42

subinz1 changed the title ~~Add L2 callbacks to report CI results to PyTorch HUD~~ Add L2 CI workflow with simulated tests and HUD callbacks Jun 1, 2026

subinz1 requested review from groenenboomj and jewelkm89 June 1, 2026 05:43

subinz1 force-pushed the add-l2-callbacks branch from f62bbb4 to 8f4a42d Compare June 1, 2026 05:47

subinz1 marked this pull request as ready for review June 1, 2026 08:36

jewelkm89 reviewed Jun 1, 2026

View reviewed changes

jewelkm89 requested review from KarhouTam, atalman and can-gaa-hou June 1, 2026 09:59

groenenboomj reviewed Jun 1, 2026

View reviewed changes

Add randomized edge-case conclusions (cancelled, timed_out)

e440f5b

~5% of runs now simulate a 'cancelled' conclusion and ~5% simulate 'timed_out', exercising the HUD's rendering for all supported states. The remaining ~90% of runs use the normal pass/fail logic.

can-gaa-hou reviewed Jun 2, 2026

View reviewed changes

Remove explicit callback-url, use action default

ee810f9

pytorch/test-infra#8133 sets the production Lambda URL as the default for callback-url in the composite action, so downstream repos no longer need to pass it explicitly or configure CRCR_CALLBACK_URL as a repository secret.

malfet approved these changes Jun 2, 2026

View reviewed changes

jewelkm89 approved these changes Jun 2, 2026

View reviewed changes

atalman approved these changes Jun 2, 2026

View reviewed changes

atalman merged commit 69b0f00 into main Jun 2, 2026
1 check passed

Conversation

subinz1 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it validates

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subinz1 Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subinz1 Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

can-gaa-hou Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subinz1 commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

subinz1 commented Jun 1, 2026 •

edited

Loading

subinz1 Jun 1, 2026 •

edited

Loading

subinz1 Jun 1, 2026 •

edited

Loading

can-gaa-hou Jun 2, 2026 •

edited

Loading