🧠

Understanding LSTM

Long Short-Term Memory - Complete Flow

1

Why LSTM? The Problem with RNN

⚠️ RNN's Memory Problem

Standard RNNs struggle with long-term dependencies. Information from early in a sequence fades away before it can be used.

Example sentence:

"I grew up in France... [100 words later] ...so I speak fluent ___"

RNN forgets "France" by the time it needs to predict "French"!

💡 LSTM's Solution

LSTM introduces a Cell State — a highway that carries information across many time steps with minimal loss. Special gates control what to remember and forget.

Memory Retention Comparison

RNN Memory Fades quickly
t=0 t=5 t=10 t=20 ❌
LSTM Memory Preserved long-term
t=0 t=5 t=10 t=20 ✓

🔑 Key Insight: LSTM can remember information for hundreds of time steps!

2

LSTM Architecture Overview

The LSTM Cell

Cell State (Long-term Memory Highway) h(t-1) h(t) x(t) f Forget × i Input Candidate × + o Output tanh × LSTM Cell at time step t

Two Memory States:

C
Cell State (C)

The "highway" for long-term memory. Information flows with minimal changes, protected by gates.

h
Hidden State (h)

The "working memory" — a filtered version of cell state, used for output and next step.

Three Gates:

f

Forget

i

Input

o

Output

3

The Three Gates - Deep Dive

Gates are like smart filters with values between 0 (block everything) and 1 (let everything through):

f

Forget Gate

What to throw away

f(t) = σ(Wf·[h(t-1), x(t)] + bf)

Outputs 0-1 for each cell state value

0
= Completely forget
1
= Completely keep

💭 Example: New subject in text? Forget old subject's gender.

i

Input Gate

What new info to store

i(t) = σ(Wi·[h(t-1), x(t)] + bi)

c̃(t) = tanh(Wc·[h(t-1), x(t)] + bc)

i
= How much to update
= Candidate values

💭 Example: New subject? Store new gender info.

o

Output Gate

What to output

o(t) = σ(Wo·[h(t-1), x(t)] + bo)

h(t) = o(t) * tanh(C(t))

o
= Filter cell state
h
= Final output

💭 Example: Verb coming next? Output singular/plural info.

🧮 Cell State Update Formula

C(t) = f(t) × C(t-1) + i(t) × c̃(t)

Old memory × forget factor + New memory × input factor

4

Data Flow - Step by Step Animation

1
2
3
4
5
LSTM Cell Cell State C(t-1) C(t) h(t-1) x(t) [,] f σ Forget Gate × i σ Input Gate tanh Candidate × + o σ Output Gate tanh × h(t) y(t) to next step

Click "Play Animation" to see how data flows through each gate of the LSTM cell.

f = σ(Wf·[h,x])

i = σ(Wi·[h,x])

c̃ = tanh(Wc·[h,x])

o = σ(Wo·[h,x])

5

RNN vs LSTM Comparison

🔄

Standard RNN

Simple recurrence

Single hidden state

Simple tanh activation

Short-term memory only

Vanishing gradients

Faster to train

Fewer parameters

Best for: Short sequences, simple patterns

🧠

LSTM

Gated memory cells

Cell state + Hidden state

3 gates control flow

Long-term memory

Solves vanishing gradients

Slower to train

More parameters (4x RNN)

Best for: Long sequences, complex dependencies

Parameter Count Comparison

RNN ~10K params
LSTM ~40K params (4x)
GRU (alternative) ~30K params (3x)

📝 LSTM Summary

f

Forget Gate

Decides what to remove from memory

i

Input Gate

Decides what new info to store

Candidate

Creates new candidate values

o

Output Gate

Decides what to output

🎯 Key Takeaway: LSTM's gates allow it to selectively remember important information and forget irrelevant details, solving the vanishing gradient problem!

✨ Language Modeling
🎵 Music Generation
📈 Time Series
🗣️ Speech Recognition
🌐 Machine Translation