Skip to content

GPU: Initial refactoring commit#51

Draft
aditigaur4 wants to merge 5 commits intomainfrom
gpu_refactor
Draft

GPU: Initial refactoring commit#51
aditigaur4 wants to merge 5 commits intomainfrom
gpu_refactor

Conversation

@aditigaur4
Copy link
Copy Markdown
Collaborator

  • Set policy only for XIDs and remove policy as a top level healthcheck to avoid duplicating results.
  • Consolidate background healthcheck results to avoid duplication.
  • Control golden list of de-depuplicated errors that identify unique issues in background healthchecks.
  • Use field watches for other issues.

@aditigaur4 aditigaur4 marked this pull request as draft April 17, 2026 00:47
@aditigaur4 aditigaur4 force-pushed the gpu_refactor branch 3 times, most recently from d55af03 to d313f91 Compare April 21, 2026 23:21
@aditigaur4 aditigaur4 changed the title Initial refactoring commit GPU: Initial refactoring commit Apr 29, 2026
aditigaur4 added 5 commits May 1, 2026 18:25
- Set policy only for XIDs and remove policy as a top level healthcheck
to avoid duplicating results.
- Consolidate background healthcheck results to avoid duplication.
- Control golden list of de-depuplicated errors that identify unique
issues in background healthchecks.
- Use field watches for other issues.
This allows running epilog checks only on GPU's that have
been allocated to the job.

Also allows configurability on the diganostic tests.

Makes gpu memory allocation a prolog check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant