LSTM

1

Why LSTM? The Problem with RNN

⚠️ RNN's Memory Problem

Standard RNNs struggle with long-term dependencies. Information from early in a sequence fades away before it can be used.

Example sentence:

"I grew up in France... [100 words later] ...so I speak fluent ___"

RNN forgets "France" by the time it needs to predict "French"!

💡 LSTM's Solution

LSTM introduces a Cell State — a highway that carries information across many time steps with minimal loss. Special gates control what to remember and forget.

Memory Retention Comparison

RNN Memory Fades quickly

t=0 t=5 t=10 t=20 ❌

LSTM Memory Preserved long-term

t=0 t=5 t=10 t=20 ✓

🔑 Key Insight: LSTM can remember information for hundreds of time steps!

2

LSTM Architecture Overview

The LSTM Cell

Two Memory States:

C

Cell State (C)

The "highway" for long-term memory. Information flows with minimal changes, protected by gates.

h

Hidden State (h)

The "working memory" — a filtered version of cell state, used for output and next step.

Three Gates:

f

Forget

i

Input

o

Output

3

The Three Gates - Deep Dive

Gates are like smart filters with values between 0 (block everything) and 1 (let everything through):

f

Forget Gate

What to throw away

f(t) = σ(Wf·[h(t-1), x(t)] + bf)

Outputs 0-1 for each cell state value

0

= Completely forget

1

= Completely keep

💭 Example: New subject in text? Forget old subject's gender.

i

Input Gate

What new info to store

i(t) = σ(Wi·[h(t-1), x(t)] + bi)

c̃(t) = tanh(Wc·[h(t-1), x(t)] + bc)

i

= How much to update

c̃

= Candidate values

💭 Example: New subject? Store new gender info.

o

Output Gate

What to output

o(t) = σ(Wo·[h(t-1), x(t)] + bo)

h(t) = o(t) * tanh(C(t))

o

= Filter cell state

h

= Final output

💭 Example: Verb coming next? Output singular/plural info.

🧮 Cell State Update Formula

C(t) = f(t) × C(t-1) + i(t) × c̃(t)

Old memory × forget factor + New memory × input factor

4

Data Flow - Step by Step Animation

1

2

3

4

5

Click "Play Animation" to see how data flows through each gate of the LSTM cell.

f = σ(Wf·[h,x])

i = σ(Wi·[h,x])

c̃ = tanh(Wc·[h,x])

o = σ(Wo·[h,x])

5

RNN vs LSTM Comparison

🔄

Standard RNN

Simple recurrence

•

Single hidden state

•

Simple tanh activation

✗

Short-term memory only

✗

Vanishing gradients

✓

Faster to train

✓

Fewer parameters

Best for: Short sequences, simple patterns

🧠

Gated memory cells

•

Cell state + Hidden state

•

3 gates control flow

✓

Long-term memory

✓

Solves vanishing gradients

✗

Slower to train

✗

More parameters (4x RNN)

Best for: Long sequences, complex dependencies

Parameter Count Comparison

RNN ~10K params

LSTM ~40K params (4x)

GRU (alternative) ~30K params (3x)

📝 LSTM Summary

f

Forget Gate

Decides what to remove from memory

i

Input Gate

Decides what new info to store

c̃

Candidate

Creates new candidate values

o

Output Gate

Decides what to output

🎯 Key Takeaway: LSTM's gates allow it to selectively remember important information and forget irrelevant details, solving the vanishing gradient problem!

✨ Language Modeling

🎵 Music Generation

📈 Time Series

🗣️ Speech Recognition

🌐 Machine Translation