Coursework 3: Multi-Layer Perceptron & Backpropagation

CS3317: Artificial Intelligence Canvas MLP Backpropagation NumPy
Extend your CW2 logistic regression into a multi-layer perceptron by implementing backpropagation from scratch. Stack Linear and activation layers, propagate gradients through the chain rule, and add SGD with momentum.

Connection to CW2

The gradient update you wrote in CW2 was already the backward pass of a single linear layer:

dW = X.T @ grad_logits    # = Linear.backward() grad_W
db = grad_logits.sum(0)   # = Linear.backward() grad_b

In CW3 you generalize this to n layers chained together. Each layer's backward() computes its local gradients and passes the upstream gradient to the previous layer — that is the chain rule, i.e., backpropagation.

CW2CW3
Single linear layer + softmaxMultiple layers (Linear → Activation → … → Linear)
Gradient computed inside optimizerGradient computed by each layer's backward()
Vanilla gradient descentSGD with momentum & weight decay
~84% test accuracy~88–90% test accuracy

Learning Objectives

  • Implement forward and backward passes for Linear, ReLU, and Sigmoid layers
  • Chain layer gradients together to form full backpropagation through an MLP
  • Implement SGD with momentum and understand its effect on convergence
  • Explore the impact of network depth, width, learning rate, and activation function on accuracy

Downloads & Canvas Submission

  • Coursework handout (PDF)
  • Starter package
    • cw3_mlp/layers.py — implement Linear, ReLU, Sigmoid
    • cw3_mlp/losses.py — implement CrossEntropyLoss forward & backward
    • cw3_mlp/model.py — implement MLP (build layers, forward, backward)
    • cw3_mlp/optimizer.py — implement SGD with momentum
    • run.py, trainer.py, config.py — provided, do not modify
    • tests/test_cw3.py — numerical gradient checks for all layers

Submitting on Canvas: zip cw3_mlp/ (with outputs/ included) together with your report.pdf.

Tasks (What You Implement)

FileWhat to implement
layers.py Linear.forward(X) — cache X, return XW + b
Linear.backward(grad) — grad_W, grad_b, return grad_input
ReLU.forward(X) / ReLU.backward(grad)
Sigmoid.forward(X) / Sigmoid.backward(grad)
losses.py CrossEntropyLoss.forward(logits, labels)
CrossEntropyLoss.backward()
model.py MLP.__init__(...) — build layer list
MLP.forward(X) — sequential forward
MLP.backward(grad) — reverse-order backward
optimizer.py SGD.__init__(...) — init velocities if momentum > 0
SGD.step() — update W, b with optional momentum & weight decay
SGD.zero_grad()

Do not modify run.py, config.py, or trainer.py.

Setup

cd code
pip install -r requirements.txt
python setup_data.py        # only needed if not done in CW2
cd cw3_mlp

Running Instructions

Quick mode for debugging:

python run.py --quick

Train with default architecture ([128, 64], ReLU, 30 epochs):

python run.py

Custom architecture and hyperparameters:

python run.py --hidden_dims 256 128 --activation relu --learning_rate 0.01
python run.py --hidden_dims 256 128 64 --momentum 0.9

Hyperparameter sweeps (generates plots for your report):

python run.py --sweep hidden_dim      # width: [32, 64, 128, 256, 512]
python run.py --sweep num_layers      # depth: [1, 2, 3, 4]
python run.py --sweep learning_rate   # lr: [0.001, 0.005, 0.01, 0.05, 0.1]

Verify with numerical gradient checks:

cd code
python -m tests.test_cw3

What to Observe

After training, check the outputs in outputs/. You should see:

  • Accuracy improvement over CW2: a 2-layer MLP (784→128→64→10) achieves ~88–90% vs ~84% for logistic regression
  • Depth sweep: deeper networks (2–3 hidden layers) outperform single-layer; very deep networks (4+ layers) may not improve further with vanilla SGD
  • Width sweep: wider layers improve accuracy up to a point; diminishing returns beyond ~256 units on FashionMNIST
  • Momentum: momentum ≥ 0.9 accelerates convergence — the loss decreases faster in early epochs
  • ReLU vs Sigmoid: ReLU typically converges faster and reaches higher accuracy; Sigmoid can suffer from vanishing gradients in deeper networks

Think About the Differences

Guiding questions for your report

  1. Backpropagation: trace the gradient of the loss through a 2-layer MLP step by step. What does each layer's backward() compute?
  2. Activation functions: what happens to the ReLU gradient when the pre-activation value is negative? Why does this not occur with Sigmoid, and why does Sigmoid still suffer from vanishing gradients?
  3. Momentum: sketch the velocity update equation. Why does momentum help when gradients point consistently in the same direction?
  4. Depth vs Width: given a fixed parameter budget, is it better to go wider or deeper? What do your sweep results show?

Submission Checklist

  • layers.py — Linear, ReLU, Sigmoid forward & backward implemented
  • losses.py — CrossEntropyLoss forward & backward implemented
  • model.py — MLP init, forward, backward implemented
  • optimizer.py — SGD with momentum implemented
  • Gradient checks pass: python -m tests.test_cw3
  • outputs/ — contains sweep plots and training summaries
  • report.pdf — using the provided template

Grading Rubric (100 points)

ComponentPoints
Layer implementations (layers.py) — correctness verified by gradient check40
Loss function (losses.py) — forward & backward10
MLP model (model.py) — init, forward, backward20
SGD optimizer (optimizer.py) — step with momentum15
Gradient checks pass (python -m tests.test_cw3)5
Report — hyperparameter exploration, analysis, comparison with CW210
Bonus: weight decay in SGD (capped at 100 total)+10
Total100

Academic Integrity & Notes

  • You must use NumPy only — no PyTorch, no autograd.
  • Discussing high-level ideas is allowed, but your code must be your own.
  • Do not share or copy implementations.