This repository contains my implementation of the core components of DeepSeek's novel models, including Mixture of Experts (MOE), Multi-head Latent Attention Mechanism, and Rotary Positional Encoding (RoPE). I had a fun experience reading through Build a DeepSeek Model (From Scratch) and implementing these core components has solidified my understanding of MLA kv-caching and MOE model architecture, which helps reduce inference latency.
You would probably need to attend a lecture or read a book to understand the nitty-gritty details of the entire DeepSeek. But this repository is useful to understand some noteworthy points that I list below. The architecture diagram below (Figure 1) provides a visual overview of how these components work together.
Figure 1: DeepSeek R1 Architecture with Mixture-of-Experts (MoE) and Multi-Head Latent Attention components.
MOE replaces the traditional dense feed-forward neural networks with smaller, sparse expert networks managed by routers. This enables the model to have different domain experts across different problem sets and decreases inference time as only a subset of experts are activated for each token in a forward pass. DeepSeek also introduces shared expert networks to provide experts with general knowledge and combat knowledge redundancy and the knowledge hybridity problem that is prevalent in traditional MOE networks.
MLA replaces the traditional Multi-head attention block in the transformer architecture. The main advantage of the MLA block is to compress the Key & Value matrices in the attention computation to optimize them for inference. This mechanism greatly reduces the amount of memory required with little to no loss in performance.
Instead of applying traditional positional encoding at the start of the Transformer model, DeepSeek utilizes RoPE inside MLA. Unlike traditional positional encoding that does not preserve the magnitude of the input vector and slightly corrupts it, RoPE only applies rotation, which helps preserve the magnitude of the input vector.
