Lesson 8: Advanced Deep Learning: Attention and Transformers
Introduction
Welcome to Lesson 8! Today, we're diving into some of the most exciting advancements in deep learning: Attention mechanisms, Transformer architectures, and BERT. These technologies have revolutionized natural language processing and are behind many of the impressive AI language models you've likely heard about, like GPT-3 and BERT.
Don't worry if these terms sound complex - we'll break them down with simple analogies and hands-on examples. By the end of this lesson, you'll understand how these models work and even create a simple project using them!
1. Attention Mechanisms: Focusing on What Matters
Imagine you're at a party with many conversations happening simultaneously. When you focus on one particular conversation, you're using "attention" to filter out irrelevant information and focus on what's important. Attention mechanisms in deep learning work similarly.
In the context of natural language processing, attention allows a model to focus on different parts of the input sequence when producing each part of the output sequence. This is particularly useful for tasks like translation, where different words in the output may depend on different parts of the input.
Here's a simple implementation of self-attention using PyTorch:
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads"
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
def forward(self, values, keys, query, mask):
N = query.shape[0]
value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
# Split embedding into self.heads pieces
values = values.reshape(N, value_len, self.heads, self.head_dim)
keys = keys.reshape(N, key_len, self.heads, self.head_dim)
queries = query.reshape(N, query_len, self.heads, self.head_dim)
values = self.values(values)
keys = self.keys(keys)
queries = self.queries(queries)
# Scaled dot-product attention
energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
if mask is not None:
energy = energy.masked_fill(mask == 0, float("-1e20"))
attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
N, query_len, self.heads * self.head_dim
)
out = self.fc_out(out)
return out
# Example usage
embed_size = 256
heads = 8
attention = SelfAttention(embed_size, heads)
x = torch.randn((32, 10, embed_size)) # (batch_size, seq_len, embed_size)
output = attention(x, x, x, mask=None)
print(output.shape) # Should be (32, 10, 256)
This code defines a self-attention mechanism that can be used as a building block in more complex models like Transformers.
Interactive Visualization: Attention Weights
Let's visualize how attention works. Enter a sentence, and we'll show you a heatmap of simulated attention weights:
In this heatmap, darker colors indicate higher attention weights. You can see which words the model "attends to" when processing each word in the sentence.
2. Transformer Architecture: The Power of Self-Attention
The Transformer architecture, introduced in the paper "Attention is All You Need", is like a super-powered language processing factory. It uses multiple layers of self-attention to process input sequences in parallel, making it both more efficient and more effective than previous sequential models like RNNs.
Transformers consist of an encoder (which processes the input) and a decoder (which generates the output). Both the encoder and decoder are made up of multiple layers of self-attention and feedforward neural networks.
Here's a simplified implementation of a Transformer model:
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, embed_size, heads, dropout, forward_expansion):
super(TransformerBlock, self).__init__()
self.attention = SelfAttention(embed_size, heads)
self.norm1 = nn.LayerNorm(embed_size)
self.norm2 = nn.LayerNorm(embed_size)
self.feed_forward = nn.Sequential(
nn.Linear(embed_size, forward_expansion * embed_size),
nn.ReLU(),
nn.Linear(forward_expansion * embed_size, embed_size),
)
self.dropout = nn.Dropout(dropout)
def forward(self, value, key, query, mask):
attention = self.attention(value, key, query, mask)
x = self.dropout(self.norm1(attention + query))
forward = self.feed_forward(x)
out = self.dropout(self.norm2(forward + x))
return out
class Transformer(nn.Module):
def __init__(
self,
src_vocab_size,
trg_vocab_size,
src_pad_idx,
trg_pad_idx,
embed_size=256,
num_layers=6,
forward_expansion=4,
heads=8,
dropout=0,
device="cuda",
max_length=100,
):
super(Transformer, self).__init__()
self.encoder = Encoder(
src_vocab_size,
embed_size,
num_layers,
heads,
device,
forward_expansion,
dropout,
max_length,
)
self.decoder = Decoder(
trg_vocab_size,
embed_size,
num_layers,
heads,
forward_expansion,
dropout,
device,
max_length,
)
self.src_pad_idx = src_pad_idx
self.trg_pad_idx = trg_pad_idx
self.device = device
def make_src_mask(self, src):
src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
# (N, 1, 1, src_len)
return src_mask.to(self.device)
def make_trg_mask(self, trg):
N, trg_len = trg.shape
trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
N, 1, trg_len, trg_len
)
return trg_mask.to(self.device)
def forward(self, src, trg):
src_mask = self.make_src_mask(src)
trg_mask = self.make_trg_mask(trg)
enc_src = self.encoder(src, src_mask)
out = self.decoder(trg, enc_src, src_mask, trg_mask)
return out
# Example usage
src_vocab_size = 10000
trg_vocab_size = 10000
src_pad_idx = 0
trg_pad_idx = 0
transformer = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to("cuda")
x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to("cuda")
trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to("cuda")
out = transformer(x, trg[:, :-1])
print(out.shape) # Should be (2, 7, 10000)
This code defines a basic Transformer model that can be used for tasks like translation or text generation.
Interactive Demo: Transformer Output
Let's simulate a simple Transformer model. Enter some text, and we'll show you a (very simplified) Transformer output:
Transformer output:
This is a very simplified simulation. In reality, Transformer outputs would be much more complex and context-aware.
3. BERT: Bidirectional Transformers for Language Understanding
BERT (Bidirectional Encoder Representations from Transformers) is like a language comprehension expert. It's pre-trained on a large corpus of text and can be fine-tuned for various language tasks.
The key innovation of BERT is that it's bidirectional. This means it looks at the entire context of a word (both left and right of the word) when processing it. It's like reading a sentence both forwards and backwards to fully understand each word's meaning.
Here's how you can use a pre-trained BERT model:
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
# Input text
text = "Hello, how are you?"
# Tokenize input
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])
# Predict hidden states features for each layer
with torch.no_grad():
outputs = model(input_ids)
# Get the embeddings from the last layer
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
print(last_hidden_states.shape)
# Output should be: torch.Size([1, sequence_length, 768])
# where sequence_length is the length of the input sequence including special tokens
This code loads a pre-trained BERT model and uses it to generate embeddings for a given input text.
Interactive Visualization: BERT Embeddings
Let's visualize BERT embeddings. We'll show the first dimension of the embedding for each word in your input:
Each point represents a word, and its y-position represents the value of the first dimension of its BERT embedding. In a real BERT model, each embedding would have 768 dimensions!
Practical Applications
These advanced deep learning techniques have numerous real-world applications:
- Machine Translation: Transformers have significantly improved the quality of language translation.
- Text Summarization: Models can generate concise summaries of long documents.
- Sentiment Analysis: BERT can understand complex sentiments in text with high accuracy.
- Question Answering: Models can understand questions and find answers in large bodies of text.
- Text Generation: GPT models, which use the Transformer architecture, can generate human-like text.
Challenge: Fine-tune BERT for Sentiment Analysis
Now it's your turn! Try to fine-tune a pre-trained BERT model for sentiment analysis. Here are the steps:
- Load a pre-trained BERT model using the Hugging Face transformers library
- Prepare a dataset of labeled sentences (positive/negative sentiment)
- Tokenize the sentences and create input tensors
- Add a classification layer on top of the BERT model
- Fine-tune the model on your dataset
- Evaluate the model's performance on a test set
This challenge will help you apply what you've learned and gain hands-on experience with state-of-the-art NLP models.