LoRA is a parameter-efficient fine-tuning technique for large pre-trained models. Rather than updating all weights during fine-tuning, LoRA injects small trainable matrices alongside frozen weights, reducing trainable parameters by orders of magnitude with no loss in performance.
Traditional fine-tuning of a model like GPT-3 (175B parameters) requires updating every weight in every matrix. This is a computationally enormous task. In Formal Algorithms for Transformers (Phuong & Hutter, 2022), the learnable weight matrices in a Transformer include:
W_e, W_p, W_q, W_k, W_v, W_o, W_1, W_2, W_u, ...
Each of these is a large matrix. In GPT-3, the attention matrices W_q, W_k, W_v, W_o are each of shape (12,288 × 12,288). That's ~150 million parameters each.
A few classical approaches existed before LoRA, including:
- Full fine-tuning: update all 175B parameters. Expensive.
- Partial fine-tuning: freeze most layers, update a subset. Less expressive.
LoRA introduces a third, far more efficient approach.
When multiplying two matrices, the output takes the outer dimensions:
(m × r) × (r × n) = (m × n)
┌─────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ │ × | r × n │ = │ │
│m × r│ └─────────────────────┘ │ m × n │
│ │ │ │
└─────┘ └─────────────────────┘
Example:
A B Output
[1000 × 4] × [4 × 1000] = [1000 × 1000]
| Matrix | Parameters |
|---|---|
| (1000 × 1000) | 1,000,000 |
| (1000 × 4) + (4 × 1000) | 8,000 |
That's a 125× reduction in parameters. Shrink the rank to r=2 and you get a 250× reduction.
For GPT-3's attention matrices (shape 12,288 × 12,288) with rank r = 4:
| Method | Parameters per matrix |
|---|---|
| Full fine-tuning | ~150,900,000 |
LoRA (r=4) |
2 × (12,288 × 4) = 98,304 |
Optimization improvement: ~1,535× per attention matrix.
Instead of directly modifying a weight matrix W, LoRA learns a low-rank change to it:
where:
- A has shape
(r × k): projects down from input dimensionkto rankr - B has shape
(d × r): projects back up to output dimensiond - r ≪ min(d, k) (common choices:
r = 4orr = 8)
For the square attention matrices in GPT-3, d = k = d_model. The full adapted weight used at inference is:
If we define the output as h, the equation for inference becomes:
The trainable matrices are additionally scaled by α/r, where α is a constant (typically set equal to the first r tried and left untuned). The modified forward pass is:
This scaling reduces the need to retune hyperparameters when varying r.
Original weight matrix W₀ (d × k):
┌─────────────────────┐
│ │ (frozen, never updated)
│ W₀ │
│ │
└─────────────────────┘
LoRA decomposition ΔW = B × A:
B: (d × r) A: (r × k)
┌─────┐ ┌─────────────────────┐
│ │ × | A │
│ B │ └─────────────────────┘
│ │
└─────┘
= ΔW: (d × k) ← same shape as W₀
Combined: W' = W₀ + BA
Because W₀ + BA is the same shape as W₀, there is no additional inference cost after training.
- Freeze all pretrained weights. Set W₀ and lock it.
-
Inject two small trainable matrices A (shape
r × k) and B (shaped × r) alongside W₀. -
Initialize: A is drawn from a Gaussian distribution, and B is initialized to zero. This means at the start of training:
$$\Delta W = B \cdot A = \mathbf{0} \cdot A = \mathbf{0}$$ The model begins training behaving exactly like the pretrained model. - Train: only A and B receive gradient updates. W₀ is untouched.
-
Merge: after training, compute the final weight:
$$W' = W_0 + B \cdot A$$
The authors argue that when fine-tuning a pretrained model on a new task, the necessary changes to the weights live in a low-dimensional subspace. GPT-3 already encodes English grammar, world knowledge, and reasoning. Adapting it to write legal summaries only requires nudging a small slice of its behavior.
This is the same intuition behind Principal Component Analysis (PCA): most of the useful variance in high-dimensional data can be captured in a surprisingly small number of dimensions. LoRA applies this idea directly to weight updates.
The attention module contains four projection matrices: W_q, W_k, W_v, and W_o. The authors experimentally investigated which combinations to adapt under a fixed parameter budget (Table 5 of the paper). The key finding was that adapting W_q and W_v together yields the best overall performance. Spreading the parameter budget across more matrix types (at a lower rank each) outperformed concentrating all parameters into a single matrix type. In the paper's experiments, LoRA is applied primarily to W_q and W_v for this reason.
The authors benchmarked LoRA against full fine-tuning and other parameter-efficient methods. LoRA matches or beats full fine-tuning on downstream tasks with hundreds to thousands of times fewer trainable parameters.
Limitations:
- More complex domain shifts may require higher rank
ror adapting more layers. - The optimal choice of
rand which layers to adapt is partly empirical;r = 4andr = 8are commonly used in industry.
During training, standard backpropagation computes a loss L and propagates gradients backward through the network. In full fine-tuning, the gradient with respect to W is:
This is a (d × k) matrix, which is expensive to store and update.
In LoRA, since W₀ is frozen, gradients only flow through A and B. Let the forward pass output be the unscaled version from above:
where A is (r × k) and B is (d × r). The gradient with respect to the output is ∂L/∂h. By the chain rule:
- B^⊤ has shape
(r × d), collapsing the(d×1)upstream gradient into a(r×1)vector in the low-rank space. - The outer product with x^⊤
(1 × k)gives the result shape(r × k), matching A.
- ∂L/∂h has shape
(d × 1). - The outer product gives result shape
(d × r), matching B.
At step 0, since B = 0:
A receives no gradient update at the very first step. Meanwhile, B's gradient ∂L/∂B = ∂L/∂h · (Ax)^T is nonzero since A is initialized from a Gaussian. This means B begins accumulating updates first, after which A's gradients become nonzero and both matrices train together. Crucially, this initialization ensures ΔW = BA = 0 at the start, so the model begins training behaving exactly like the pretrained model.
Using standard gradient descent with learning rate η:
In practice, LoRA is compatible with any optimizer (Adam, AdamW, etc.), and the low dimensionality of A and B means the optimizer state itself is also dramatically smaller, a secondary but meaningful efficiency gain.
Based on Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021).