deepseek-mini

Motivation

This repository contains my implementation of the core components of DeepSeek's novel models, including Mixture of Experts (MOE), Multi-head Latent Attention Mechanism, and Rotary Positional Encoding (RoPE). I had a fun experience reading through Build a DeepSeek Model (From Scratch) and implementing these core components has solidified my understanding of MLA kv-caching and MOE model architecture, which helps reduce inference latency.

You would probably need to attend a lecture or read a book to understand the nitty-gritty details of the entire DeepSeek. But this repository is useful to understand some noteworthy points that I list below. The architecture diagram below (Figure 1) provides a visual overview of how these components work together.

Figure 1: DeepSeek R1 Architecture with Mixture-of-Experts (MoE) and Multi-Head Latent Attention components.

1. Mixture of Experts

MOE replaces the traditional dense feed-forward neural networks with smaller, sparse expert networks managed by routers. This enables the model to have different domain experts across different problem sets and decreases inference time as only a subset of experts are activated for each token in a forward pass. DeepSeek also introduces shared expert networks to provide experts with general knowledge and combat knowledge redundancy and the knowledge hybridity problem that is prevalent in traditional MOE networks.

2. Multi-head Latent Attention

MLA replaces the traditional Multi-head attention block in the transformer architecture. The main advantage of the MLA block is to compress the Key & Value matrices in the attention computation to optimize them for inference. This mechanism greatly reduces the amount of memory required with little to no loss in performance.

3. Rotary Positional Encoding

Instead of applying traditional positional encoding at the start of the Transformer model, DeepSeek utilizes RoPE inside MLA. Unlike traditional positional encoding that does not preserve the magnitude of the input vector and slightly corrupts it, RoPE only applies rotation, which helps preserve the magnitude of the input vector.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
attention.py		attention.py
encoding.py		encoding.py
main.ipynb		main.ipynb
moe.py		moe.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deepseek-mini

Motivation

1. Mixture of Experts

2. Multi-head Latent Attention

3. Rotary Positional Encoding

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

deepseek-mini

Motivation

1. Mixture of Experts

2. Multi-head Latent Attention

3. Rotary Positional Encoding

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages