Is your feature request related to a problem? Please describe.
Describe the solution you'd like
The IAttention APIs introduce some new code paths for some of the quantized scales that should unlock further performance. We need to enable using these APIs from ModelOpt generated checkpoints. I would suspect that the implementation will be similar to how we handle Convolution and its layer specific quantization.
Describe alternatives you've considered
Additional context
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
The IAttention APIs introduce some new code paths for some of the quantized scales that should unlock further performance. We need to enable using these APIs from ModelOpt generated checkpoints. I would suspect that the implementation will be similar to how we handle Convolution and its layer specific quantization.
Describe alternatives you've considered
Additional context