Lesson 5: Natural Language Processing (NLP) and Sequence Models

Introduction

Welcome to Lesson 5! Today, we're diving into the fascinating world of Natural Language Processing (NLP) and Sequence Models. By the end of this lesson, you'll understand how computers can understand and generate human language, and you'll create your own sentiment analysis model!

We'll cover three main topics: text preprocessing and word embeddings, Recurrent Neural Networks (RNNs), and sentiment analysis. Don't worry if these terms sound complex - we'll break them down with simple analogies and hands-on examples.

1. Text Preprocessing and Word Embeddings

Before we can analyze text, we need to prepare it for our models. This process is called text preprocessing. It's like translating human language into a form that computers can understand more easily.

Here are the main steps in text preprocessing:

Tokenization: Breaking text into individual words or subwords
Removing stop words and punctuation: Getting rid of very common words that don't carry much meaning
Lemmatization or Stemming: Reducing words to their base or root form

Here's a simple example of text preprocessing using NLTK:


import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

text = "The quick brown fox jumps over the lazy dog."
preprocessed_text = preprocess_text(text)
print(preprocessed_text)

After preprocessing, we convert words into numbers or vectors. This is where word embeddings come in. Word embeddings are like giving each word a set of coordinates in a multi-dimensional space. Words with similar meanings end up close to each other in this space.

Interactive Visualization: Word Embeddings

Let's visualize some word embeddings. Each point represents a word in 2D space:

In reality, word embeddings often have hundreds of dimensions, but we're showing just two for visualization purposes.

2. Recurrent Neural Networks (RNNs) and LSTMs

When dealing with sequences of words, we need a special type of neural network that can handle sequential data. This is where Recurrent Neural Networks (RNNs) come in.

Imagine you're reading a book. As you read each word, you keep a running summary in your head, updating it with each new word. RNNs work similarly, processing words one by one and updating their "memory" at each step.

Long Short-Term Memory (LSTM) networks are a special type of RNN that can remember important information for long periods and forget irrelevant details. They're like having a notepad while reading, where you can write down important points and erase things that turn out to be unimportant.

Here's a simple implementation of an RNN using PyTorch:


import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        _, hidden = self.rnn(x)
        output = self.fc(hidden.squeeze(0))
        return output

# Example usage
input_size = 10  # Size of each input vector
hidden_size = 20
output_size = 2  # Binary classification (positive/negative)

model = SimpleRNN(input_size, hidden_size, output_size)
sample_input = torch.randn(1, 5, input_size)  # Batch size 1, sequence length 5
output = model(sample_input)
print(output)

This code defines a simple RNN that can be used for tasks like sentiment analysis or text generation.

3. Sentiment Analysis and Text Generation

Sentiment analysis is like teaching a computer to understand emotions in text. It's widely used in social media monitoring, customer feedback analysis, and market research.

Here's a simple sentiment analysis model using PyTorch:


import torch
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Assume we have a list of (text, label) pairs
train_data = [
    ("This movie is great!", 1),
    ("I hated this film.", 0),
    ("An absolute masterpiece!", 1),
    ("What a waste of time.", 0)
]

# Build vocabulary
tokenizer = get_tokenizer('basic_english')
def yield_tokens(data_iter):
    for text, _ in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# Text classification model
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        
    def forward(self, text):
        embedded = self.embedding(text)
        return self.fc(embedded)

# Create model instance
vocab_size = len(vocab)
embed_dim = 32
num_class = 2
model = TextClassifier(vocab_size, embed_dim, num_class)

# Example prediction
text = "This movie is amazing"
with torch.no_grad():
    text_encoded = torch.tensor([vocab[token] for token in tokenizer(text)])
    output = model(text_encoded)
    predicted_class = output.argmax(1).item()
print(f"Predicted sentiment: {'Positive' if predicted_class == 1 else 'Negative'}")

This model uses word embeddings and a simple neural network to classify text as positive or negative.

Text generation, on the other hand, is like teaching a computer to write. It can complete sentences, generate stories, or even write code! Advanced models like GPT-3 use transformer architectures for this, which we'll cover in a later lesson.

Interactive Demo: Sentiment Analysis

Try out our simple sentiment analyzer! Enter a sentence and see if it's classified as positive or negative:

Sentiment score: 0 (Neutral or Mixed)

Interactive Demo: Text Generation

Click the button to generate a simple movie review:

Generated text:

Challenge: Build Your Own Sentiment Analyzer

Now it's your turn! Try to build a more advanced sentiment analyzer. Here are some ideas:

Use a pre-trained word embedding model like Word2Vec or GloVe
Implement an LSTM instead of a simple RNN
Try a multi-class classification (e.g., very negative, negative, neutral, positive, very positive)
Use a dataset of real movie reviews or tweets for training
Implement a simple text generation model using an RNN or LSTM

This challenge will help you apply what you've learned and gain hands-on experience with NLP tasks.

Additional Resources

Previous Lesson Next Lesson: Unsupervised Learning and Dimensionality Reduction