Skip to content

nebHailemariam/deepseek-mini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deepseek-mini

Motivation

This repository contains my implementation of the core components of DeepSeek's novel models, including Mixture of Experts (MOE), Multi-head Latent Attention Mechanism, and Rotary Positional Encoding (RoPE). I had a fun experience reading through Build a DeepSeek Model (From Scratch) and implementing these core components has solidified my understanding of MLA kv-caching and MOE model architecture, which helps reduce inference latency.

You would probably need to attend a lecture or read a book to understand the nitty-gritty details of the entire DeepSeek. But this repository is useful to understand some noteworthy points that I list below. The architecture diagram below (Figure 1) provides a visual overview of how these components work together.

DeepSeek R1 Architecture

Figure 1: DeepSeek R1 Architecture with Mixture-of-Experts (MoE) and Multi-Head Latent Attention components.

1. Mixture of Experts

MOE replaces the traditional dense feed-forward neural networks with smaller, sparse expert networks managed by routers. This enables the model to have different domain experts across different problem sets and decreases inference time as only a subset of experts are activated for each token in a forward pass. DeepSeek also introduces shared expert networks to provide experts with general knowledge and combat knowledge redundancy and the knowledge hybridity problem that is prevalent in traditional MOE networks.

2. Multi-head Latent Attention

MLA replaces the traditional Multi-head attention block in the transformer architecture. The main advantage of the MLA block is to compress the Key & Value matrices in the attention computation to optimize them for inference. This mechanism greatly reduces the amount of memory required with little to no loss in performance.

3. Rotary Positional Encoding

Instead of applying traditional positional encoding at the start of the Transformer model, DeepSeek utilizes RoPE inside MLA. Unlike traditional positional encoding that does not preserve the magnitude of the input vector and slightly corrupts it, RoPE only applies rotation, which helps preserve the magnitude of the input vector.

References

  1. Build a DeepSeek Model (From Scratch)
  2. Open-Source DeepSeek code release on Hugging-Face
  3. DeepSeek R1 vs V3 Architecture Comparison

About

Implementation of DeepSeek’s major novel components for learning and experimentation purposes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors