Skip to content

Ensure CUDA devices are usable before training#585

Open
rob-maron wants to merge 1 commit intomainfrom
cuda-smoke-test
Open

Ensure CUDA devices are usable before training#585
rob-maron wants to merge 1 commit intomainfrom
cuda-smoke-test

Conversation

@rob-maron
Copy link
Copy Markdown
Collaborator

@rob-maron rob-maron commented Feb 24, 2026

Closes #550

Adds a CUDA smoke test that validates that the requested devices are functional before training begins

The test is designed to trigger cuDNN runtime compilation to catch issues early. Credit to @arilotter for the core test logic.

One question potentially for discussion is whether or not this gates out older cards. I think it should run fine on even smaller Turing cards, but anything older it might not

@rob-maron rob-maron changed the title Ensure cuda devices are usable before training Ensure CUDA devices are usable before training Feb 24, 2026
Copy link
Copy Markdown
Contributor

@ethernet8023 ethernet8023 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I'm not worried about RTX20xx and older devices for now, I imagine we have a hundred other compatibility issues with them. We can return to this later if we decide to start supporting that hardware.

@rob-maron
Copy link
Copy Markdown
Collaborator Author

Did some testing on some different cards. As expected, it fails on T4s (Turing), but succeeds on H100s, A100s, L4s

@rob-maron rob-maron force-pushed the cuda-smoke-test branch 2 times, most recently from 8d424fc to 0363f80 Compare March 10, 2026 15:49
@rob-maron rob-maron closed this Mar 11, 2026
@rob-maron rob-maron reopened this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

client should attempt cuda operations early to cause early failure if there's a problem

2 participants