Skip to content

[ROCm7.0]: Error handling logic change #860

@iupaikov-amd

Description

@iupaikov-amd

Problem Description

Since ROCm7.x the error behavior was changed to match its' CUDA counterpart. This is a follow-up issue for PR #859

It would be great if Triton team with more repository knowledge than me could go through the code handling hip calls and see where this change could also lead to a broken state. Potential candidate is driver.c in third_party/amd. Didn't make changes there because I'm not sure when is this used at all. Would be probably better if you could wrap all hip calls in a wrapper and just handle error discard there.

Jira that lead to this: https://ontrack-internal.amd.com/browse/SWDEV-546704
More info on error changes: https://ontrack-internal.amd.com/browse/SWDEV-438790

Operating System

n/a

CPU

n/a

GPU

n/a

ROCm Version

7.x

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions