I was curious how LSTMs actually work. So I built one from scratch using only NumPy, then compared it against PyTorch.
When I finally trained my custom LSTM and compared it to PyTorch models, something unexpected happened.
My model was much smoother.
At first I thought this meant it was worse. But the plot told a different story.
- My LSTM captured the long-term trend extremely well
- PyTorch models followed short-term oscillations more aggressively
- My model was stable, calm, and conservative
- PyTorch models were reactive and noisy
This forced me to confront a deeper idea:
Models don't just learn from data. They learn from the biases encoded in their implementation.
My small learning rate, summed gradients, momentum, and scalar input all pushed the model toward long-term structure over short-term noise.
Writing an LSTM from scratch taught me things no framework tutorial ever did:
- Why gradients explode before they explode
- Why momentum is stored motion, not magic
- Why stability often means high bias, not correctness
- Why most deep learning bugs are shape and flow bugs, not math bugs
- Why frameworks feel "easy" only because they hide thousands of careful design decisions
The most interesting part wasn't that the model worked.
It was realizing that:
- Learning rate is not just speed
- Smoothness is not just underfitting
- Correctness and stability are different goals
- Abstraction hides tradeoffs, not complexity
| Model | MSE |
|---|---|
| PyTorch CNN | 0.000900 |
| PyTorch RNN | 0.003673 |
| PyTorch LSTM | 0.004645 |
| Custom LSTM | 0.006117 |
┌─────────────────────────────────────────────────────────────┐
│ LSTM Cell │
│ │
│ xt ──┬──► [Forget Gate] ──► ft │
│ ├──► [Input Gate] ──► it │
│ ├──► [C-tilde] ──► c̃t ◄── The secret sauce │
│ └──► [Output Gate] ──► ot │
│ │
│ Cell State: ct = ft ⊙ ct-1 + it ⊙ c̃t │
│ Hidden State: ht = ot ⊙ tanh(ct) │
└─────────────────────────────────────────────────────────────┘
Everyone talks about the forget gate. But the real elegance is in c̃t — the candidate cell state.
It's a proposal. C-tilde says: "Here's what I think we should remember." But it doesn't write directly to memory — it passes through the input gate first.
The network learns two things separately:
- What information to extract (c-tilde)
- How much of it to keep (input gate)
Tanh keeps it bounded [-1, 1], centered at zero, with smooth gradients. This lets c-tilde say "add this" or "subtract this" — not just accumulate.
Bug #1: Initializing gradients with random values instead of zeros. The model was learning from noise.
Bug #2: Computing gradients perfectly... then never applying them. The optimizer step was missing entirely.
Both bugs produced models that "worked" — they ran, produced outputs, even showed decreasing loss. But they weren't learning.
Vanilla RNNs fail because gradients multiply through the same weight matrix at each step. Eigenvalues < 1 = vanishing. Eigenvalues > 1 = exploding.
LSTMs fix this with the cell state highway. Gradients flow through the forget gate, which the network learns to control. When ft ≈ 1, gradients pass through unchanged. The network learns to keep the highway open.
pip install numpy torch plotly pandas
python testlstm.pyThe best way to understand a neural network is to build one that doesn't work, then figure out why.
