I Use This When...
I need a model for ordered data such as text, audio, logs, or time series, and the order itself matters. RNNs and LSTMs are the classic way to carry state through a sequence step by step.
History
RNN: Rumelhart (1986). LSTM: Hochreiter & Schmidhuber (1997). GRU: Cho et al. (2014). Dominated sequence tasks until Transformers replaced them.
Why It Exists
The "why" chain is:
- MLPs assume fixed-size inputs and no ordering.
- Sequence tasks need memory of what came before.
- A hidden state can carry context forward one token or timestep at a time.
- Plain RNNs struggle to remember far-away information.
- LSTM and GRU add gates to control memory explicitly.
These models exist because ordered data needs recurrence and memory, not just static feature processing.
How It Works
Visual Intuition
Imagine reading a sentence word by word.
- at each step, the model sees the new token
- it updates an internal memory state
- that state carries forward what it thinks still matters
LSTM adds gates so the model can decide what to forget, what to keep, and what to expose.
The sequence-memory era is represented here:
Step by Step
- Start with an initial hidden state
- Read one token or timestep at a time
- Combine the current input with the previous hidden state
- Produce a new hidden state
- For LSTM/GRU, use gates to control memory flow
- Repeat through the sequence and train with backpropagation through time
The recurrent state is the key mechanism: information can survive across many steps if the architecture keeps it alive.
Code
# concept sketch
# for token in sequence:
# h = rnn_cell(token, h)
# output = decoder(h)
The Math Inside
Vanilla RNN update:
h_t = tanh(W_h h_{t-1} + W_x x_t + b)
x_t: current inputh_{t-1}: previous hidden stateh_t: new hidden state
Problem:
- gradients must pass through many repeated steps
- repeated multiplication can make them vanish or explode
LSTM addresses this with gated memory:
- forget gate
f_t - input gate
i_t - output gate
o_t - cell state
c_t
The core idea is not the exact formula set, but controlled memory flow. GRU keeps the same spirit with fewer gates.
Math Prerequisites
- Chain Rule - why gradients pass repeatedly through time
- MLP & Backprop - recurrent nets are trained with the same gradient logic
- Transformer - the architecture that replaced recurrence for many tasks
Related
- MLP & Backprop — Base architecture
- Transformer — What replaced RNNs
- BERT — Transformer for understanding
- Chain Rule — Why gradients vanish