Skip to content

test(e2e): add GPU workload image artifacts#1484

Open
elezar wants to merge 4 commits into
mainfrom
feat/1476-gpu-workload-images/elezar
Open

test(e2e): add GPU workload image artifacts#1484
elezar wants to merge 4 commits into
mainfrom
feat/1476-gpu-workload-images/elezar

Conversation

@elezar

@elezar elezar commented May 20, 2026

Copy link
Copy Markdown
Member

🏗️ build-from-issue-agent

Summary

Define local GPU workload image artifacts for smoke-pass, smoke-fail, and cuda-basic validation. The build task supports Docker or Podman through the existing container-engine helper, tags images with the OpenShell source revision plus an external-input fingerprint, and writes latest local refs plus a workload manifest for downstream tooling.

Related Issue

Closes #1476

Changes

  • e2e/gpu/images/smoke-pass: adds a positive marker-only workload image.
  • e2e/gpu/images/smoke-fail: adds a stable negative-path workload image.
  • e2e/gpu/images/cuda-basic: builds CUDA samples deviceQuery and vectorAdd from NVIDIA/cuda-samples v12.8, copies the binaries into the OpenShell community base image, and runs both validations.
  • tasks/scripts/e2e-gpu-build-images.sh: builds only the supported workload image set, supports subset selection, records source and external build inputs, labels images, writes latest.env, and emits workloads.yaml.
  • tasks/test.toml: adds mise run e2e:workloads:build.
  • e2e/gpu/README.md: documents the workload image contract, build task, direct validation flow, current mise run e2e:gpu behavior, and the fact that the generated manifest is not consumed by the current Rust GPU target yet.
  • .gitignore: ignores generated GPU workload build metadata.

Review Follow-up

  • Removed README text that described a manifest-driven Rust runner that is not part of this PR.
  • Restricted image discovery to the supported image directories instead of discovering arbitrary Dockerfiles with incomplete metadata support.
  • Added traceability for mutable external CUDA/base inputs through tag fingerprinting, labels, env output, and manifest fields.

Testing

  • bash -n tasks/scripts/e2e-gpu-build-images.sh
  • git diff --check
  • Unsupported workload selection fails with the supported list
  • mise run pre-commit
  • Previous local GPU image/direct Docker/CDI validation is recorded in the PR comments
  • Previous local mise run e2e:gpu validation is recorded in the PR comments

Tests added:

  • Unit: N/A - this PR adds image artifacts and task wiring.
  • Integration: N/A.
  • E2E: Adds GPU workload image artifacts intended for e2e validation; direct Docker/CDI validation and the existing Docker GPU e2e lane were run locally and recorded in PR comments.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

Documentation updated:

  • e2e/gpu/README.md: workload contract, build task, generated manifest, current GPU e2e target, and direct validation.
  • e2e/gpu/images/*/README.md: per-image purpose, build, and direct-run instructions.

@elezar

elezar commented May 20, 2026

Copy link
Copy Markdown
Member Author

🏗️ build-from-issue-agent

E2E Test Attestation

Local E2E tests passed. CI does not currently run this GPU host validation, so this comment records the local run.

Field Value
Commit 785872b41279978e5d95c721b6785c776287702d
Command mise run e2e:gpu
Gateway mode Docker
Result PASS

Test Summary

7 passed; 0 failed; 0 ignored; finished in 5.92s

Tests Executed

  • gpu_device_selection::gpu_invalid_device_request_fails - PASSED
  • gpu_device_selection::parse_cdi_gpu_device_ids_ignores_unexpected_nested_devices - PASSED
  • gpu_device_selection::parse_cdi_gpu_device_ids_reads_discovered_devices - PASSED
  • gpu_device_selection::parse_cdi_gpu_device_ids_reads_lowercase_host_discovered_devices - PASSED
  • gpu_device_selection::gpu_request_without_device_matches_plain_all_gpu_container - PASSED
  • gpu_device_selection::gpu_all_device_request_matches_plain_all_gpu_container - PASSED
  • gpu_device_selection::gpu_request_for_each_discovered_device_matches_plain_container - PASSED

Direct Image Validation

  • mise run e2e:gpu:images:build built localhost/openshell/gpu-workload-smoke-pass:785872b4, localhost/openshell/gpu-workload-smoke-fail:785872b4, and localhost/openshell/gpu-workload-cuda-basic:785872b4.
  • docker run --rm localhost/openshell/gpu-workload-smoke-pass:785872b4 passed.
  • docker run --rm localhost/openshell/gpu-workload-smoke-fail:785872b4 exited 42 with the expected failure marker.
  • docker run --rm --device nvidia.com/gpu=all localhost/openshell/gpu-workload-cuda-basic:785872b4 passed on the local NVIDIA L4 host.

@elezar

elezar commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

At the time this was posted, I was considering moving the image definitions out of this repo to reduce churn. That is no longer the active path for this PR. Review and merge readiness should be evaluated against the changes in this PR directly.

@elezar elezar closed this Jun 1, 2026
@elezar elezar deleted the feat/1476-gpu-workload-images/elezar branch June 1, 2026 21:19
@elezar elezar restored the feat/1476-gpu-workload-images/elezar branch June 3, 2026 11:01
@elezar elezar reopened this Jun 3, 2026
@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch from 7c6557a to 5cc2d92 Compare June 3, 2026 13:28
@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch from 5cc2d92 to efe4d25 Compare June 4, 2026 12:56
@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e-gpu Requires GPU end-to-end coverage labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored, project-valid GPU E2E infrastructure work for #1476, adding local GPU workload image artifacts and build task wiring.
Head SHA: efe4d25d9c3743c9caf26680ae7f3e2e11e24f14

Review findings:

  • Warning: e2e/gpu/README.md documents a manifest-driven Rust E2E runner that reads OPENSHELL_E2E_WORKLOAD_MANIFEST, defaults to .build/workloads.yaml, skips when missing, and enforces expect, but the PR does not add that runner and the existing GPU test target is gpu_device_selection, not gpu. Please either remove that section until the runner lands or add the parser/test target in this PR.
  • Warning: tasks/scripts/e2e-gpu-build-images.sh discovers arbitrary Dockerfiles under e2e/gpu/images, but metadata generation only supports the three hard-coded image names. Please either restrict discovery to the supported list or add per-image metadata for expectation/env-var generation.
  • Warning: the CUDA workload image uses mutable image/tag inputs and a GitHub tag tarball without checksum verification, while the local output tag is only the OpenShell source SHA. If these become CI or published validation artifacts, please pin external inputs by digest/commit and checksum, or include those refs in the generated manifest/tag.

Docs: this is E2E infrastructure documentation, not a direct Fern UX/API change.
E2E: test:e2e-gpu is being applied because this adds GPU validation workload artifacts.

Next state: gator:in-review

@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for efe4d25. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

elezar added 2 commits June 10, 2026 22:27
Closes #1476

Define local GPU workload image sources for smoke-pass, smoke-fail, and cuda-basic validation, plus a mise build task that tags images with the source revision and records the latest local image refs.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch from efe4d25 to de40d64 Compare June 10, 2026 20:45
elezar added 2 commits June 10, 2026 22:49
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head de40d6411b3b0129840111578343db04059d9ff7 after the new commit test(e2e): address gpu workload review feedback.

Disposition: resolved.

Remaining items:

  • No blocking review items remain. The README no longer claims the checked-in Rust GPU E2E target consumes the generated workload manifest, the build script is now restricted to the supported image set, and the manifest/build metadata records the external input refs for local debugging.
  • Nonblocking residual risk: the CUDA workload still uses mutable external image/tag inputs and downloads the CUDA samples archive without checksum verification. This is acceptable for the current local-only artifact flow; if these images become CI gates or published validation artifacts, require digest-pinned base/build images and a commit plus checksum for CUDA samples.
  • OpenShell / Branch Checks and OpenShell / GPU E2E are pending for the current head.

Next state: gator:watch-pipeline

@elezar elezar removed the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 10, 2026
@elezar elezar added the gator:watch-pipeline Gator is monitoring PR CI/CD status label Jun 10, 2026
@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch from de40d64 to 8426fac Compare June 10, 2026 20:54
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Updated PR #1484 on top of current main.

Gator follow-up in 8426fac5:

  • README no longer documents a manifest-driven Rust runner that is not in this PR. It now points at the current mise run e2e:gpu / gpu_device_selection path and states that the generated manifest is not consumed yet.
  • tasks/scripts/e2e-gpu-build-images.sh now restricts image discovery to the supported workload set: smoke-pass, smoke-fail, and cuda-basic.
  • The build script now records external CUDA/base image inputs via an input fingerprint, image labels, latest.env, and workloads.yaml, so local tags and generated metadata reflect mutable input choices.

Also rewrote the branch to add the missing DCO sign-off on the existing README-lint commit. Validation on the updated branch: bash -n tasks/scripts/e2e-gpu-build-images.sh, git diff --check, unsupported workload selection negative path, and mise run pre-commit.

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete for latest head 8426fac56a61753b58dcd829c243e81febd6bd27.

Validation: maintainer-authored GPU E2E infrastructure work for #1476, adding local GPU workload image artifacts and build task wiring.
Review: I re-checked after elezar's 2026-06-10 update; the earlier review feedback is resolved. No blocking review items remain. Residual note: if these workload images become CI gates or published artifacts, require pinned external inputs and checksum verification.
Docs: the GPU E2E workload documentation was updated in e2e/gpu/README.md and per-image READMEs; this is not a Fern UX/API documentation gate.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.
E2E: test:e2e-gpu is applied and the GPU E2E required gate is green.

Human maintainer approval or merge decision is now required.

@elezar elezar added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:approval-needed Gator completed review; maintainer approval needed test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test(e2e): define GPU validation image artifacts

1 participant