Why LSTM? The Problem with RNN
⚠️ RNN's Memory Problem
Standard RNNs struggle with long-term dependencies. Information from early in a sequence fades away before it can be used.
Example sentence:
"I grew up in France... [100 words later] ...so I speak fluent ___"
RNN forgets "France" by the time it needs to predict "French"!
💡 LSTM's Solution
LSTM introduces a Cell State — a highway that carries information across many time steps with minimal loss. Special gates control what to remember and forget.
Memory Retention Comparison
🔑 Key Insight: LSTM can remember information for hundreds of time steps!
LSTM Architecture Overview
The LSTM Cell
Two Memory States:
Cell State (C)
The "highway" for long-term memory. Information flows with minimal changes, protected by gates.
Hidden State (h)
The "working memory" — a filtered version of cell state, used for output and next step.
Three Gates:
Forget
Input
Output
The Three Gates - Deep Dive
Gates are like smart filters with values between 0 (block everything) and 1 (let everything through):
Forget Gate
What to throw away
f(t) = σ(Wf·[h(t-1), x(t)] + bf)
Outputs 0-1 for each cell state value
💭 Example: New subject in text? Forget old subject's gender.
Input Gate
What new info to store
i(t) = σ(Wi·[h(t-1), x(t)] + bi)
c̃(t) = tanh(Wc·[h(t-1), x(t)] + bc)
💭 Example: New subject? Store new gender info.
Output Gate
What to output
o(t) = σ(Wo·[h(t-1), x(t)] + bo)
h(t) = o(t) * tanh(C(t))
💭 Example: Verb coming next? Output singular/plural info.
🧮 Cell State Update Formula
Old memory × forget factor + New memory × input factor
Data Flow - Step by Step Animation
Click "Play Animation" to see how data flows through each gate of the LSTM cell.
f = σ(Wf·[h,x])
i = σ(Wi·[h,x])
c̃ = tanh(Wc·[h,x])
o = σ(Wo·[h,x])
RNN vs LSTM Comparison
Standard RNN
Simple recurrence
Single hidden state
Simple tanh activation
Short-term memory only
Vanishing gradients
Faster to train
Fewer parameters
Best for: Short sequences, simple patterns
LSTM
Gated memory cells
Cell state + Hidden state
3 gates control flow
Long-term memory
Solves vanishing gradients
Slower to train
More parameters (4x RNN)
Best for: Long sequences, complex dependencies
Parameter Count Comparison
📝 LSTM Summary
Forget Gate
Decides what to remove from memory
Input Gate
Decides what new info to store
Candidate
Creates new candidate values
Output Gate
Decides what to output
🎯 Key Takeaway: LSTM's gates allow it to selectively remember important information and forget irrelevant details, solving the vanishing gradient problem!