Add LongRoPE and fix state dict conversion for Phi-3.5-mini-instruct#44
Open
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
Open
Add LongRoPE and fix state dict conversion for Phi-3.5-mini-instruct#44sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
sdeeptan-aws wants to merge 1 commit intoaws-neuron:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Updated Phi-3.5-mini-instruct contrib model with LongRoPE implementation (learned position-dependent scaling factors for 128k context), fused QKV and gate_up projection splitting in state dict conversion, and corrected o_proj mapping to avoid double-prefixing by
preshard_hook. The model usesPhi3LongRoPEScaledRotaryEmbeddingwith separateshort_factor(seq ≤ 4096) andlong_factor(longer sequences) arrays instead of standard RoPE. Initial accuracy was 17.19% with repetition loops; after fixes, achieves 100% token match.Model Information
Model Name: Phi-3.5-mini-instruct
Model Architecture: Decoder-only transformer (Phi-3 with LongRoPE, MHA, 32 heads, 32 layers, hidden_size=3072)
Purpose: Text generation / instruction following
Checklist
Required Components
test/integration/test_model.py)src/)Optional Components
Folder Structure
Testing
Model was compiled and tested with TP=2, batch_size=1, seq_len=128, bfloat16. Three key fixes validated:
Phi3LongRoPEScaledRotaryEmbeddingwith learnedshort_factorandlong_factorarrays. Standard RoPE uses fixed inverse frequencies; LongRoPE multiplies by position-dependent scaling factors and applies asqrt(1 + log(scale)/log(orig_max_pos))correction to cos/sin values.qkv_proj.weight→ split intoqkv_proj.q_proj,qkv_proj.k_proj,qkv_proj.v_proj.gate_up_proj.weight→ split intogate_projandup_proj.o_proj.o_projin conversion, butGroupQueryAttention_O.preshard_hookalso adds the prefix, resulting ino_proj.o_proj.o_proj(broken). Fix: leaveo_projas-is and let preshard_hook handle it.Test Results:
Multi-Prompt Accuracy:
Compatibility
Tested with:
Additional Information
rope_scaling.typein config. LongRoPE uses learned per-dimension scaling factors stored inconfig.json, not fixed interpolation.short_factor(48 values for seq ≤ 4096) andlong_factor(48 values for longer sequences) are trained parameters, not computed at runtime.GroupQueryAttention_O.preshard_hookexpectso_proj.weightand maps it too_proj.o_proj.weight. Addingo_proj.o_projin your conversion causes triple-nesting.qkv_projlayer concatenating Q, K, V — must be split during state dict conversion with correct head dimension slicing.gate_up_projlayer concatenating gate and up projections — split atintermediate_sizeboundary.Related Issues
N/A
vLLM Integration
By submitting this PR, I confirm that: