Lesson 4: Computer Vision with Convolutional Neural Networks (CNNs)

Introduction

Welcome to Lesson 4! Today, we're diving into the exciting world of Computer Vision using Convolutional Neural Networks (CNNs). By the end of this lesson, you'll understand how computers 'see' images and be able to create your own image classification model!

We'll cover three main topics: image classification, transfer learning, and object detection. Don't worry if these terms sound complicated - we'll break them down with simple analogies and hands-on examples.

1. Image Classification with CNNs

Imagine you're trying to find Waldo in a crowded picture. You don't look at every tiny detail all at once. Instead, you scan the image, looking for key features like Waldo's striped shirt or glasses. CNNs work similarly, scanning images for important features to classify them.

Let's break down how a CNN processes an image:

Convolutional layers scan the image for features (like edges, textures, shapes)
Pooling layers reduce the size of the feature maps, keeping the most important information
Fully connected layers at the end make the final classification decision

Here's a simple CNN for classifying images from the CIFAR10 dataset:


import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

# Define a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 32 * 8 * 8)
        x = self.fc1(x)
        return x

# Load and preprocess the CIFAR10 dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=4, shuffle=True)

# Create the model and define loss function and optimizer
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):  # 5 epochs for demonstration
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 2000 == 1999:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')

This code defines a basic CNN with two convolutional layers, two pooling layers, and a fully connected layer. It then trains the model on the CIFAR10 dataset, which contains 60,000 32x32 color images in 10 classes (such as automobile, ship, bird, etc).

Interactive Visualization: CNN Layers

Let's visualize how the convolutional and pooling layers transform an image. Use the dropdown to switch between layers:

The heatmaps show how each layer extracts different features from the image. Brighter areas indicate stronger activation for that feature.

2. Transfer Learning: Standing on the Shoulders of Giants

Transfer learning is like learning a new language when you already know a similar one. It's easier because you can transfer some of your existing knowledge. In machine learning, we can use pre-trained models on large datasets and fine-tune them for our specific task.

Here's an example of using a pre-trained ResNet model for our CIFAR10 classification task:


import torch
import torchvision.models as models
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

# Load a pre-trained ResNet model
model = models.resnet18(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace the final fully connected layer
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 10)  # 10 classes in CIFAR10

# Load and preprocess the CIFAR10 dataset
transform = transforms.Compose([
    transforms.Resize(224),  # ResNet expects 224x224 images
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=4, shuffle=True)

# Define loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

# Training loop
for epoch in range(5):  # 5 epochs for demonstration
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 2000 == 1999:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')

In this code, we load a pre-trained ResNet18 model, freeze its layers (so we don't modify the pre-trained weights), and only train a new final layer for our specific classification task. This allows us to leverage the powerful features learned by the model on a much larger dataset.

3. Object Detection: Finding Waldo for Real

Object detection takes image classification a step further. Instead of just saying "This image contains a cat," object detection says "There's a cat in the bottom left corner, and a dog in the top right." It's like finding multiple 'Waldos' in a single image!

Here's a simple example using a pre-trained Faster R-CNN model for object detection:


import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.transforms import functional as F
from PIL import Image, ImageDraw

# Load a pre-trained Faster R-CNN model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Load and preprocess an image
image = Image.open('path_to_your_image.jpg')
transform = torchvision.transforms.ToTensor()
img_tensor = transform(image)

# Perform object detection
with torch.no_grad():
    prediction = model([img_tensor])

# Draw bounding boxes on the image
draw = ImageDraw.Draw(image)
for box, label, score in zip(prediction[0]['boxes'], prediction[0]['labels'], prediction[0]['scores']):
    if score > 0.5:  # Only show detections with confidence > 50%
        box = box.tolist()
        draw.rectangle(box, outline='red', width=3)
        draw.text((box[0], box[1]), f'{label.item()}: {score:.2f}', fill='red')

# Display or save the image with bounding boxes
image.show()  # or image.save('output.jpg')

This code loads a pre-trained Faster R-CNN model, performs object detection on an image, and draws bounding boxes around the detected objects. It's a powerful technique used in many real-world applications, from self-driving cars to security systems.

Challenge: Build Your Own Image Classifier

Now it's your turn! Try to build an image classifier for a dataset of your choice. Here are some ideas:

Use the Cats vs Dogs dataset to build a binary classifier
Try classifying different types of flowers using the Flowers Recognition dataset
Build a classifier to recognize different breeds of dogs
Use transfer learning with a different pre-trained model (like VGG or Inception)
Implement data augmentation to improve your model's performance

This challenge will help you apply what you've learned and get hands-on experience with real-world image classification tasks.

Additional Resources

Previous Lesson Next Lesson: Natural Language Processing (NLP) and Sequence Models