Skip to content

Fix all four scaling multipliers for Granite#48

Open
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:granite
Open

Fix all four scaling multipliers for Granite#48
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:granite

Conversation

@sdeeptan-aws
Copy link
Contributor

Description

Updated Granite-3.1-8b-instruct contrib model to correctly apply all four Granite-specific scaling multipliers. The original implementation only applied residual_multiplier correctly — attention_multiplier was stored but never used, embedding_multiplier was applied to weights (breaking tied embeddings), logits_scaling was missing entirely, and manual QKV key renaming in state dict conversion conflicted with preshard_hook. Fixes include overriding prep_qkv_tensors to pre-scale Q for the kernel's built-in 1/sqrt(head_dim), applying embedding_multiplier in the forward pass via get_model_output, adding ScaledColumnParallelLinear for logits scaling, and removing manual key renaming. Achieves 100% token match (64/64 tokens).

Model Information

Model Name: Granite-3.1-8b-instruct
Model Architecture: Decoder-only transformer (GQA 32Q/8KV, 32 layers, hidden_size=4096, custom scaling multipliers)
Purpose: Text generation / instruction following

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • Token match accuracy validation
    • Performance metrics (TTFT, throughput)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:
    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)
    • Modeling code following NxD Inference patterns

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/granite-3.1-8b-instruct/
  README.md
  /src
    modeling_granite.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=2, batch_size=1, seq_len=128, bfloat16. Four Granite-specific scaling multipliers validated:

  1. Attention multiplier fix: NeuronX kernels apply 1/sqrt(head_dim) (0.0884) internally, but Granite uses attention_multiplier (0.0078125 = 1/head_dim). Override prep_qkv_tensors to pre-scale Q by attention_multiplier * sqrt(head_dim) so the kernel's built-in scaling produces the correct result.
  2. Embedding multiplier fix: Original applied embedding_multiplier (12.0) to embed_tokens.weight in state dict conversion. With tie_word_embeddings=True, this also scales lm_head.weight, producing incorrect logits. Fix: apply in forward pass via get_model_output override.
  3. Logits scaling fix: Added ScaledColumnParallelLinear that divides output by logits_scaling (16.0). Used for lm_head instead of standard ColumnParallelLinear. Original had no logits scaling at all.
  4. State dict conversion fix: Removed manual QKV/o_proj key renaming that conflicted with preshard_hook in GroupQueryAttention_QKV and GroupQueryAttention_O, which already handle the renaming automatically.

Test Results:

Test Status Result
Smoke Test ✅ PASS Model loads successfully
Token Matching ✅ PASS 100% match (64/64 tokens)
TTFT (P50) ✅ PASS ~20ms (threshold: 100ms)
Throughput ✅ PASS ~100 tok/s (threshold: 10 tok/s)

Compatibility

Tested with:

  • Instance Type(s): Trn1
  • Configuration: TP=2, batch_size=1, seq_len=128, bfloat16

Additional Information

  • Audit all multipliers: Granite defines four custom scaling factors in config.json. The original code stored all four but only residual_multiplier was applied correctly. Always verify each multiplier is actually used in the code path.
  • Tied weights + weight scaling = broken: Never apply multipliers to weights shared between embedding and lm_head. Apply in the forward pass instead.
  • Know what the kernel does: NeuronX attention kernels apply 1/sqrt(head_dim) internally. To use a different scaling, pre-scale Q with a correction factor: correction = attention_multiplier * sqrt(head_dim).
  • Don't fight preshard_hook: The framework's preshard hooks handle QKV (q_projqkv_proj.q_proj) and o_proj (o_projo_proj.o_proj) key renaming. Manual renaming in convert_hf_to_neuron_state_dict causes double-nesting.
  • ScaledColumnParallelLinear pattern: When you need to apply a scalar to the output of a parallel linear layer, subclass it rather than modifying the model's forward pass.

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments