Fix all four scaling multipliers for Granite by sdeeptan-aws · Pull Request #48 · aws-neuron/neuronx-distributed-inference

sdeeptan-aws · 2026-02-18T19:57:23Z

Description

Updated Granite-3.1-8b-instruct contrib model to correctly apply all four Granite-specific scaling multipliers. The original implementation only applied residual_multiplier correctly — attention_multiplier was stored but never used, embedding_multiplier was applied to weights (breaking tied embeddings), logits_scaling was missing entirely, and manual QKV key renaming in state dict conversion conflicted with preshard_hook. Fixes include overriding prep_qkv_tensors to pre-scale Q for the kernel's built-in 1/sqrt(head_dim), applying embedding_multiplier in the forward pass via get_model_output, adding ScaledColumnParallelLinear for logits scaling, and removing manual key renaming. Achieves 100% token match (64/64 tokens).

Model Information

Model Name: Granite-3.1-8b-instruct
Model Architecture: Decoder-only transformer (GQA 32Q/8KV, 32 layers, hidden_size=4096, custom scaling multipliers)
Purpose: Text generation / instruction following

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Token match accuracy validation
- Performance metrics (TTFT, throughput)
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
- Example Checkpoints: Links to compatible model checkpoints
- Testing Instructions: Command to run the test suite for the model
Source Code (src/)
- Modeling code following NxD Inference patterns

Optional Components

Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/granite-3.1-8b-instruct/
  README.md
  /src
    modeling_granite.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=2, batch_size=1, seq_len=128, bfloat16. Four Granite-specific scaling multipliers validated:

Attention multiplier fix: NeuronX kernels apply 1/sqrt(head_dim) (0.0884) internally, but Granite uses attention_multiplier (0.0078125 = 1/head_dim). Override prep_qkv_tensors to pre-scale Q by attention_multiplier * sqrt(head_dim) so the kernel's built-in scaling produces the correct result.
Embedding multiplier fix: Original applied embedding_multiplier (12.0) to embed_tokens.weight in state dict conversion. With tie_word_embeddings=True, this also scales lm_head.weight, producing incorrect logits. Fix: apply in forward pass via get_model_output override.
Logits scaling fix: Added ScaledColumnParallelLinear that divides output by logits_scaling (16.0). Used for lm_head instead of standard ColumnParallelLinear. Original had no logits scaling at all.
State dict conversion fix: Removed manual QKV/o_proj key renaming that conflicted with preshard_hook in GroupQueryAttention_QKV and GroupQueryAttention_O, which already handle the renaming automatically.

Test Results:

Test	Status	Result
Smoke Test	✅ PASS	Model loads successfully
Token Matching	✅ PASS	100% match (64/64 tokens)
TTFT (P50)	✅ PASS	~20ms (threshold: 100ms)
Throughput	✅ PASS	~100 tok/s (threshold: 10 tok/s)

Compatibility

Tested with:

Instance Type(s): Trn1
Configuration: TP=2, batch_size=1, seq_len=128, bfloat16

Additional Information

Audit all multipliers: Granite defines four custom scaling factors in config.json. The original code stored all four but only residual_multiplier was applied correctly. Always verify each multiplier is actually used in the code path.
Tied weights + weight scaling = broken: Never apply multipliers to weights shared between embedding and lm_head. Apply in the forward pass instead.
Know what the kernel does: NeuronX attention kernels apply 1/sqrt(head_dim) internally. To use a different scaling, pre-scale Q with a correction factor: correction = attention_multiplier * sqrt(head_dim).
Don't fight preshard_hook: The framework's preshard hooks handle QKV (q_proj → qkv_proj.q_proj) and o_proj (o_proj → o_proj.o_proj) key renaming. Manual renaming in convert_hf_to_neuron_state_dict causes double-nesting.
ScaledColumnParallelLinear pattern: When you need to apply a scalar to the output of a parallel linear layer, subclass it rather than modifying the model's forward pass.

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Fix all four scaling multipliers for Granite

05872e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix all four scaling multipliers for Granite#48

Fix all four scaling multipliers for Granite#48
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:granite

sdeeptan-aws commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

sdeeptan-aws commented Feb 18, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments