Skip to content

✨[Feature] Support IAttention based quantization for MHA #4167

@narendasan

Description

@narendasan

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

The IAttention APIs introduce some new code paths for some of the quantized scales that should unlock further performance. We need to enable using these APIs from ModelOpt generated checkpoints. I would suspect that the implementation will be similar to how we handle Convolution and its layer specific quantization.

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions