Update GPT-2 with Conv1D transposition and vocab padding by sdeeptan-aws · Pull Request #42 · aws-neuron/neuronx-distributed-inference

sdeeptan-aws · 2026-02-18T17:27:11Z

Description

Updated GPT-2 contrib model with Conv1D weight transposition, vocab size padding for non-power-of-2 vocabulary (50257), tied embeddings handling, and learned absolute position embeddings (not RoPE). The model has several unique architectural features compared to modern LLMs: Conv1D layers store weights transposed vs standard Linear, the vocab size isn't divisible by common TP degrees requiring pad=True, and embeddings are tied between input and lm_head. Validation achieves 100% token match and all LightEval benchmarks pass within ±2% of HuggingFace reference.

Model Information

Model Name: GPT-2
Model Architecture: Decoder-only transformer (GPT-2 with Conv1D, learned position embeddings, tied embeddings)
Purpose: Text generation

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Validates model accuracy with token matching
- LightEval benchmark comparison against HF reference
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
- Example Checkpoints: Links to compatible model checkpoints
- Testing Instructions: Command to run the test suite for the model
Source Code (src/)
- Modeling code following NxD Inference patterns

Optional Components

Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/gpt2/
    README.md
    /src
        modeling_gpt2.py
    /test
        /integration
            test_model.py

Testing

Model was compiled and tested with TP=2, batch_size=1, seq_len=128, bfloat16. Four key architectural features validated:

Conv1D weight transposition: All projection weights transposed during state dict conversion (Conv1D stores [in, out], Linear expects [out, in])
Vocab size padding: 50257 is not divisible by TP degrees — pad=True on embedding and lm_head
Tied embeddings: Handled via update_state_dict_for_tied_weights, not manual weight tying in __init__
Learned position embeddings: Absolute position embedding table (not RoPE)

Test Results:

Test	Status	Result
Smoke Test	✅ PASS	Model loads successfully
Token Matching	✅ PASS	100% match

LightEval Benchmark Results (full evaluation, all samples):

Task	Neuron (BF16)	HF (FP32)	Delta	Status
arc:challenge	0.1937	0.1903	+0.003	✅ PASS
arc:easy	0.4398	0.4381	+0.002	✅ PASS
hellaswag (em)	0.0066	0.0050	+0.002	✅ PASS
truthfulqa_mc1	0.2375	0.2277	+0.010	✅ PASS
truthfulqa_mc2	0.4252	0.4069	+0.018	✅ PASS
winogrande	0.4862	0.4838	+0.002	✅ PASS

All benchmarks within ±2% of HF reference. Largest delta: truthfulqa_mc2 at +1.8%.

Compatibility

Tested with:

Instance Type(s): Trn1
Configuration: TP=2, batch_size=1, seq_len=128, bfloat16

Additional Information

Conv1D ≠ Linear: GPT-2 family uses Conv1D layers which store weights transposed compared to nn.Linear. All projection weights (c_attn, c_proj, c_fc, mlp.c_proj) must be transposed during state dict conversion
Vocab padding required: 50257 % 2 = 1, 50257 % 4 = 1, 50257 % 8 = 1 — pad=True is critical for embedding and lm_head to avoid crashes or incorrect results from uneven sharding
Don't manually tie weights: The NeuronX framework handles tied embeddings via state dict (update_state_dict_for_tied_weights), not by assigning lm_head.weight = embed_tokens.weight in __init__
Learned position embeddings: Uses a separate wpe embedding table, not RoPE — simpler but requires correct weight loading
LayerNorm with bias: Standard nn.LayerNorm, not RMSNorm
Fused QKV: Single c_attn projection split into Q, K, V (concatenated layout, not interleaved like GPTNeoX)

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Update GPT-2 with Conv1D transposition and vocab padding

d4350f6

aws-yishanm approved these changes Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GPT-2 with Conv1D transposition and vocab padding#42

Update GPT-2 with Conv1D transposition and vocab padding#42
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws:gpt2

sdeeptan-aws commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

sdeeptan-aws commented Feb 18, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments