Fix interleaved RoPE and partial rotary factor for GLM-4 by sdeeptan-aws · Pull Request #39 · aws-neuron/neuronx-distributed-inference

sdeeptan-aws · 2026-02-18T16:09:52Z

Description

Updated GLM-4-9B-Chat-HF contrib model with correct interleaved RoPE implementation, partial rotary factor handling, fused gate_up_proj splitting, and updated README with architecture details and validation results. Key discovery: the model uses model_type="glm" (not glm4), which means HuggingFace loads GlmForCausalLM — a different architecture with 2 RMSNorm layers per decoder (not 4), interleaved RoPE rotation (even/odd indices, not split-half), and partial_rotary_factor=0.5. Validation achieves 90.62% token match (29/32 tokens before BF16 divergence).

Model Information

Model Name: GLM-4-9B-Chat-HF
Model Architecture: Decoder-only transformer (GLM with interleaved RoPE, GQA 32Q/2KV)
Purpose: Text generation / Chat

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Validates model accuracy with token matching
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
- Example Checkpoints: Links to compatible model checkpoints
- Testing Instructions: Command to run the test suite for the model
Source Code (src/)
- Modeling code following NxD Inference patterns

Optional Components

Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/glm-4-9b-chat-hf/
    README.md
    /src
        modeling_glm4.py
    /test
        /integration
            test_model.py

Testing

Model was compiled and tested with TP=2, batch_size=1, seq_len=128, bfloat16. Three key architectural features were validated against HuggingFace reference:

Interleaved RoPE: GLM uses x[..., 0::2] / x[..., 1::2] rotation with repeat_interleave for cos/sin — different from LLaMA's split-half pattern
Partial rotary factor=0.5: Only 64 of 128 head_dim dimensions receive RoPE
Fused gate_up_proj: Checkpoint weight correctly split into separate gate_proj and up_proj

Test Results:

Test	Status	Result
Smoke Test	✅ PASS	Model loads successfully
Token Matching (generic prompt)	⚠️ LOW	53% match
Token Matching (specific prompt)	✅ GOOD	90.62% match (29/32 tokens)

Late divergence at token 29+ is expected BF16 vs FP32 numerical precision accumulation, not an implementation error.

Compatibility

Tested with:

Instance Type(s): Trn1
Configuration: TP=2, batch_size=1, seq_len=128, bfloat16

Additional Information

model_type="glm" not "glm4": HuggingFace loads GlmForCausalLM which has 2 RMSNorm layers per decoder, not 4. Using the wrong model type would load a completely different architecture
Interleaved RoPE pattern: GLM rotates even/odd indices and uses repeat_interleave for cos/sin expansion, unlike LLaMA's split-half approach with cat
Partial rotary factor=0.5: Only the first half of head dimensions (64/128) get RoPE; the rest pass through unchanged
Fused gate_up_proj: The HF checkpoint stores a single fused [2*intermediate_size, hidden_size] weight that must be split at intermediate_size into gate_proj and up_proj
GQA with 32 Q heads and 2 KV heads: Grouped query attention with attention bias enabled
Generic prompt sensitivity: Open-ended prompts show ~53% accuracy due to high entropy; deterministic prompts like "The capital of France is" give 90.62%

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Fix interleaved RoPE and partial rotary factor for GLM-4

03f5455

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix interleaved RoPE and partial rotary factor for GLM-4#39

Fix interleaved RoPE and partial rotary factor for GLM-4#39
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:glm

sdeeptan-aws commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

sdeeptan-aws commented Feb 18, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments