Skip to content

Add NoPE layer support and tied embeddings for SmolLM3-3B#46

Open
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:smollm3
Open

Add NoPE layer support and tied embeddings for SmolLM3-3B#46
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:smollm3

Conversation

@sdeeptan-aws
Copy link
Contributor

Description

Updated SmolLM3-3B contrib model with NoPE (No Position Embedding) layer support where every 4th layer skips RoPE entirely, tied embeddings handling via update_state_dict_for_tied_weights, and GQA with 4 KV heads. The model's unique feature is the no_rope_layers config array that controls per-layer RoPE application — layers at indices 3, 7, 11, 15, ... receive no positional encoding. Math prompts like "The square root of 144 is" achieve 100% token match; open-ended prompts diverge due to BF16 precision.

Model Information

Model Name: SmolLM3-3B
Model Architecture: Decoder-only transformer (NoPE layers, GQA 16Q/4KV, tied embeddings, 36 layers, hidden_size=2048)
Purpose: Text generation

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • Multi-prompt integration test validating token match accuracy
    • Uses deterministic math prompts for reliable validation
    • Test can compile and run the model on Neuron
  • README.md with the following sections:
    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)
    • Modeling code following NxD Inference patterns

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/SmolLM3-3B/
  README.md
  /src
    modeling_smollm3_3b.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=2, batch_size=1, seq_len=128, bfloat16. Three key architectural features validated:

  1. NoPE layers: no_rope_layers config array with 0/1 values — every 4th layer (indices 3, 7, 11, 15, ...) has 0 meaning no RoPE applied. Attention class checks no_rope_layers[layer_idx] and passes rotary_emb=None to NeuronAttentionBase for NoPE layers.
  2. Tied embeddings: tie_word_embeddings=true — handled via update_state_dict_for_tied_weights which copies embed_tokens.weight to lm_head.weight in state dict, not by manual weight assignment in __init__.
  3. GQA with 4 KV heads: 16 Q heads, 4 KV heads (head_dim=128). TP degree must be compatible with KV head count.

Test Results:

Test Status Result
Smoke Test ✅ PASS Model loads successfully
Token Matching ✅ PASS 100% match (math prompt)

Multi-Prompt Accuracy:

Prompt Match Rate Notes
"The square root of 144 is" 100% Deterministic math
"Water boils at" 96.88% Diverges after ~16 tokens
"The chemical formula for water is" 96.88% Diverges after ~1 token
"The capital of France is" 37.5% Diverges after ~12 tokens
"def fibonacci(n):" 34.38% Diverges after ~11 tokens
"1+1=" 6.25% Diverges after ~1 token

Lower-scoring prompts reflect BF16 style divergence — both HF and Neuron produce correct outputs but differ in phrasing.

Compatibility

Tested with:

  • Instance Type(s): Trn1
  • Configuration: TP=2, batch_size=1, seq_len=128, bfloat16

Additional Information

  • NoPE layers are unique to SmolLM3: Check no_rope_layers in config — a per-layer array where 0 means skip RoPE entirely. This is different from partial RoPE (which applies to a fraction of head dimensions).
  • Conditional RoPE in attention: NeuronSmolLM3Attention checks config.no_rope_layers[layer_idx] and creates rotary_emb=None for NoPE layers, passing it to NeuronAttentionBase.
  • Tied embeddings via state dict: Don't assign lm_head.weight = embed_tokens.weight in __init__. Use update_state_dict_for_tied_weights to clone the weight in the state dict.
  • Math prompts for validation: "The square root of 144 is" gives 100% accuracy because the answer is deterministic. Open-ended prompts diverge due to close logits under BF16.
  • GQA sharding: TP degree must divide evenly into KV heads (4), or use CONVERT_TO_MHA for incompatible TP degrees.

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments