Recurrent Neural Networks
Recurrent Neural Networks

Recurrent Neural Networks (RNNs) – Complete Guide with Examples

Recurrent Neural Networks (RNNs) — Complete Guide with Examples

Deep Learning Series | A comprehensive guide to RNNs with code examples.


Table of Contents

  1. What is an RNN?
  2. How RNNs Work
  3. Types of RNNs
  4. LSTM & GRU
  5. Code Example: NumPy from Scratch
  6. Code Example: PyTorch LSTM
  7. Code Example: Keras GRU
  8. Code Example: Sentiment Analysis
  9. Common Problems & Solutions
  10. Real-World Applications
  11. RNN vs CNN vs FNN
  12. Summary

1. What is a Recurrent Neural Network?

A Recurrent Neural Network (RNN) is a type of artificial neural network designed to work with sequential or time-series data. Unlike feedforward networks that process inputs independently, RNNs have a memory mechanism — they retain information from previous steps and use it to influence future outputs.

Think of reading a sentence: understanding the word “bank” depends on whether you read “river bank” or “bank account.” RNNs solve this by maintaining a hidden state that carries context through each time step.

Key Intuition: The defining feature of an RNN is that its outputs are fed back as inputs to the next step. This creates a loop — a form of short-term memory that allows the network to process sequences of arbitrary length.

2. How RNNs Work

At each time step t, an RNN takes two inputs: the current input x_t and the previous hidden state h_(t-1). It produces a new hidden state h_t and optionally an output y_t.

Unrolled RNN Diagram

         y0          y1          y2          y3
         |           |           |           |
    +---------+  +---------+  +---------+  +---------+
    |   h0    |->|   h1    |->|   h2    |->|   h3    |
    +---------+  +---------+  +---------+  +---------+
      ^  |          ^              ^              ^
      |  +--loop    |              |              |
         x0         x1            x2             x3

  Each node shares the same weights W, U, V across all time steps.
  The loop arrow represents the recurrent connection (memory).

Core Equations

  // Hidden state update
  h_t = tanh(W_h * h_(t-1) + W_x * x_t + b_h)

  // Output calculation
  y_t = W_y * h_t + b_y

  Where:
    W_h = weight matrix for hidden state
    W_x = weight matrix for input
    W_y = weight matrix for output
    b_h, b_y = bias vectors
    tanh = hyperbolic tangent activation function

3. Types of RNN Architectures

RNNs come in several configurations depending on the relationship between inputs and outputs:

Architecture Description Example Use Case
One-to-One Single input → Single output. Standard neural network. Image classification
One-to-Many Single input → Sequence output. Image captioning
Many-to-One Sequence input → Single output. Sentiment analysis
Many-to-Many (sync) Sequence input → Sequence output (same length). POS tagging
Many-to-Many (async) Encoder-decoder. Sequence in → Sequence out (different length). Machine translation
Bidirectional Processes sequence forward AND backward. Named entity recognition

4. LSTM & GRU

Vanilla RNNs struggle with long sequences due to the vanishing gradient problem. Two architectures solve this with learned gating mechanisms:

LSTM — Long Short-Term Memory

LSTMs introduce a cell state (long-term memory) alongside the hidden state (short-term memory). Three learned gates control information flow:

  // Forget gate - what to erase from cell state
  f_t = sigmoid(W_f * [h_(t-1), x_t] + b_f)

  // Input gate - what new info to store
  i_t = sigmoid(W_i * [h_(t-1), x_t] + b_i)
  C_tilde = tanh(W_C * [h_(t-1), x_t] + b_C)

  // Cell state update
  C_t = f_t * C_(t-1) + i_t * C_tilde

  // Output gate
  o_t = sigmoid(W_o * [h_(t-1), x_t] + b_o)
  h_t = o_t * tanh(C_t)

GRU — Gated Recurrent Unit

GRUs merge the forget and input gates into a single update gate and combine the cell and hidden states. Fewer parameters, often comparable performance to LSTM:

  // Reset gate - how much past to forget
  r_t = sigmoid(W_r * [h_(t-1), x_t])

  // Update gate - how much past to keep
  z_t = sigmoid(W_z * [h_(t-1), x_t])

  // Candidate hidden state
  h_tilde = tanh(W * [r_t * h_(t-1), x_t])

  // Final hidden state
  h_t = (1 - z_t) * h_(t-1) + z_t * h_tilde

5. Code Example: Simple RNN from Scratch (NumPy)

A minimal RNN built from scratch using only NumPy to understand the core mechanics:

import numpy as np

# --- Minimal RNN from scratch ---

class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Xavier initialization
        self.W_xh = np.random.randn(hidden_size, input_size) * 0.01
        self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.W_hy = np.random.randn(output_size, hidden_size) * 0.01
        self.b_h  = np.zeros((hidden_size, 1))
        self.b_y  = np.zeros((output_size, 1))

    def forward(self, inputs):
        """
        inputs: list of one-hot encoded vectors (each shape: input_size x 1)
        Returns outputs and hidden states at each time step
        """
        h = np.zeros((self.W_hh.shape[0], 1))  # initial hidden state
        outputs, hidden_states = [], [h]

        for x in inputs:
            # Hidden state: h_t = tanh(W_xh*x + W_hh*h_prev + b_h)
            h = np.tanh(self.W_xh @ x + self.W_hh @ h + self.b_h)

            # Output: y_t = W_hy*h_t + b_y
            y = self.W_hy @ h + self.b_y

            outputs.append(y)
            hidden_states.append(h)

        return outputs, hidden_states


# --- Example: Process a 3-word sequence ---
input_size  = 4   # vocabulary size
hidden_size = 8   # hidden units
output_size = 4   # output size

rnn = SimpleRNN(input_size, hidden_size, output_size)

# Simulate 3 one-hot vectors (3 time steps)
sequence = [
    np.array([[1], [0], [0], [0]]),   # word "hello"
    np.array([[0], [1], [0], [0]]),   # word "world"
    np.array([[0], [0], [1], [0]]),   # word "rnn"
]

outputs, hidden_states = rnn.forward(sequence)
print(f"Outputs per step: {[o.shape for o in outputs]}")
print(f"Hidden states:    {len(hidden_states)} of shape {hidden_states[0].shape}")

6. Code Example: LSTM Classifier (PyTorch)

Using PyTorch’s built-in LSTM module for text classification:

import torch
import torch.nn as nn

# --- Bidirectional LSTM Text Classifier ---

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, n_layers=2):
        super(LSTMClassifier, self).__init__()

        # Embedding: token indices -> dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        # Bidirectional LSTM with dropout
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.3
        )

        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        output, (hidden, cell) = self.lstm(embedded)

        # Concatenate final forward + backward hidden states
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        hidden = self.dropout(hidden)
        return self.fc(hidden)


# --- Setup ---
model = LSTMClassifier(
    vocab_size=10000,
    embed_dim=128,
    hidden_dim=256,
    num_classes=2       # positive / negative
)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

def train_step(batch_x, batch_y):
    model.train()
    optimizer.zero_grad()
    predictions = model(batch_x)
    loss = criterion(predictions, batch_y)
    loss.backward()
    # Clip gradients to prevent exploding gradients
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    return loss.item()

7. Code Example: GRU Time-Series Predictor (Keras)

Using Keras/TensorFlow to build a stacked GRU model for time-series forecasting:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# --- Stacked GRU for Time-Series Prediction ---

def build_gru_model(seq_len, n_features, n_units=64):
    model = keras.Sequential([
        layers.Input(shape=(seq_len, n_features)),

        # First GRU layer - return sequences for stacking
        layers.GRU(n_units, return_sequences=True, dropout=0.2),

        # Second GRU layer
        layers.GRU(n_units // 2, return_sequences=False),

        # Dense output
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(1)     # regression output
    ])
    return model


model = build_gru_model(seq_len=30, n_features=5)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mse',
    metrics=['mae']
)

model.summary()

# Training with callbacks
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3)
    ]
)

8. Code Example: Sentiment Analysis on IMDB (Keras)

Complete end-to-end sentiment analysis using a Bidirectional LSTM on the IMDB dataset:

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import layers, models

# --- Sentiment Analysis on IMDB Reviews ---

MAX_FEATURES = 10000   # vocabulary size
MAX_LEN      = 200     # max review length
EMBED_DIM    = 64

# Load pre-tokenized dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)

# Pad sequences to the same length
X_train = sequence.pad_sequences(X_train, maxlen=MAX_LEN)
X_test  = sequence.pad_sequences(X_test,  maxlen=MAX_LEN)

# Build Bidirectional LSTM model
model = models.Sequential([
    layers.Embedding(MAX_FEATURES, EMBED_DIM, mask_zero=True),
    layers.Bidirectional(layers.LSTM(64, return_sequences=True)),
    layers.Bidirectional(layers.LSTM(32)),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')   # binary: positive/negative
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train
model.fit(X_train, y_train, epochs=5, batch_size=64, validation_split=0.2)

# Evaluate - typically achieves ~89-91% accuracy
loss, acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {acc:.2%}")


# --- Predict on a new review ---
import numpy as np

word_index = imdb.get_word_index()

def encode_review(text):
    tokens = text.lower().split()
    encoded = [word_index.get(w, 2) + 3 for w in tokens]
    return sequence.pad_sequences([encoded], maxlen=MAX_LEN)

review = "This movie was absolutely fantastic and I loved every minute of it"
pred = model.predict(encode_review(review))[0][0]
print(f"Sentiment: {'Positive' if pred > 0.5 else 'Negative'} ({pred:.2%})")

9. Common Problems & Solutions

Problem 1: Vanishing Gradient

During backpropagation through time (BPTT), gradients are multiplied repeatedly. If weights are less than 1, gradients shrink exponentially toward zero — the network forgets distant past information and learning stalls for long sequences.

Solutions: Use LSTM or GRU gates • Use gradient clipping • Use residual connections

Problem 2: Exploding Gradient

The opposite of vanishing — if weights are greater than 1, gradients grow exponentially during BPTT, causing NaN values and unstable training. Common with deep or long RNNs.

Solutions: Gradient clipping (clip_grad_norm_) • Careful weight initialization

Problem 3: Long-Term Dependencies

Even with LSTM, very long sequences (500+ steps) can be hard to model. The network may fail to connect distant but relevant parts of a sequence (e.g., beginning of a paragraph to the end).

Solutions: Attention mechanisms • Transformer architecture

Problem 4: Slow Sequential Training

RNNs process time steps one by one — they cannot be parallelized like CNNs or Transformers. This makes training on very long sequences extremely slow on modern GPU hardware.

Solutions: Truncated BPTT • Move to Transformer-based models


10. Real-World Applications

  • Speech Recognition — Converting audio waveforms to text (e.g., Siri, Google Assistant)
  • Machine Translation — Translating between languages (e.g., early Google Translate)
  • Sentiment Analysis — Classifying reviews as positive or negative
  • Text Generation — Generating new text in a learned style
  • Stock Price Forecasting — Predicting future prices from historical time series
  • Music Generation — Composing melodies note by note
  • Medical Diagnosis — Analyzing EHR sequences and patient timelines
  • Chatbots & NLP — Understanding conversational context
  • Image Captioning — Generating text descriptions from images
  • Anomaly Detection — Spotting irregularities in logs or sensor data

11. RNN vs CNN vs Feedforward Neural Network

Feature RNN / LSTM / GRU CNN Feedforward (FNN / MLP)
Best For Sequential / time-series data Spatial data (images, video) Tabular / fixed-size data
Memory Hidden state (short & long term) None between samples None between samples
Input Size Variable length sequences Fixed spatial dimensions Fixed vector size
Parallelizable Limited (sequential steps) Highly parallelizable Fully parallelizable
Training Speed Slow for long sequences Fast with GPU Very fast
Key Strength Temporal patterns, NLP Feature hierarchies, vision Simple classification/regression
Popular Variants LSTM, GRU, BiRNN, Seq2Seq ResNet, VGG, EfficientNet MLP, AutoEncoder
Handles Sequences? Yes (natively) Partially (with 1D conv) No (fixed input only)

12. Summary

  1. RNNs have memory. The hidden state carries information from previous time steps, enabling sequential data processing impossible for standard feedforward networks.
  2. LSTM & GRU solve vanishing gradients. Gating mechanisms selectively remember or forget information, enabling networks to learn both short and long-range dependencies.
  3. Many architecture variants exist. One-to-many, many-to-one, many-to-many, and bidirectional configurations cover virtually every sequential modeling scenario.
  4. Transformers are the modern successor. For most NLP tasks today, Transformer-based models (BERT, GPT) outperform RNNs due to better parallelism and attention mechanisms. However, RNNs remain valuable for streaming data and resource-constrained environments.
  5. Key hyperparameters to tune: hidden size, number of layers, dropout rate, learning rate, sequence length, and batch size.

Recurrent Neural Networks — Deep Learning Guide | RNN • LSTM • GRU • BiRNN

Leave a Reply

Your email address will not be published. Required fields are marked *