Coursework 3: Multi-Layer Perceptron & Backpropagation

Connection to CW2

The gradient update you wrote in CW2 was already the backward pass of a single linear layer:

dW = X.T @ grad_logits    # = Linear.backward() grad_W
db = grad_logits.sum(0)   # = Linear.backward() grad_b

In CW3 you generalize this to n layers chained together. Each layer's backward() computes its local gradients and passes the upstream gradient to the previous layer — that is the chain rule, i.e., backpropagation.

CW2	CW3
Single linear layer + softmax	Multiple layers (Linear → Activation → … → Linear)
Gradient computed inside optimizer	Gradient computed by each layer's `backward()`
Vanilla gradient descent	SGD with momentum & weight decay
~84% test accuracy	~88–90% test accuracy

Learning Objectives

Implement forward and backward passes for Linear, ReLU, and Sigmoid layers
Chain layer gradients together to form full backpropagation through an MLP
Implement SGD with momentum and understand its effect on convergence
Explore the impact of network depth, width, learning rate, and activation function on accuracy

Downloads & Canvas Submission

Coursework handout (PDF)
Starter package
- cw3_mlp/layers.py — implement Linear, ReLU, Sigmoid
- cw3_mlp/losses.py — implement CrossEntropyLoss forward & backward
- cw3_mlp/model.py — implement MLP (build layers, forward, backward)
- cw3_mlp/optimizer.py — implement SGD with momentum
- run.py, trainer.py, config.py — provided, do not modify
- tests/test_cw3.py — numerical gradient checks for all layers

Submitting on Canvas: zip cw3_mlp/ (with outputs/ included) together with your report.pdf.

Tasks (What You Implement)

File	What to implement
`layers.py`	`Linear.forward(X)` — cache X, return XW + b `Linear.backward(grad)` — grad_W, grad_b, return grad_input `ReLU.forward(X)` / `ReLU.backward(grad)` `Sigmoid.forward(X)` / `Sigmoid.backward(grad)`
`losses.py`	`CrossEntropyLoss.forward(logits, labels)` `CrossEntropyLoss.backward()`
`model.py`	`MLP.__init__(...)` — build layer list `MLP.forward(X)` — sequential forward `MLP.backward(grad)` — reverse-order backward
`optimizer.py`	`SGD.__init__(...)` — init velocities if momentum > 0 `SGD.step()` — update W, b with optional momentum & weight decay `SGD.zero_grad()`

Do not modify run.py, config.py, or trainer.py.

Setup

cd code
pip install -r requirements.txt
python setup_data.py        # only needed if not done in CW2
cd cw3_mlp

Running Instructions

Quick mode for debugging:

python run.py --quick

Train with default architecture ([128, 64], ReLU, 30 epochs):

python run.py

Custom architecture and hyperparameters:

python run.py --hidden_dims 256 128 --activation relu --learning_rate 0.01
python run.py --hidden_dims 256 128 64 --momentum 0.9

Hyperparameter sweeps (generates plots for your report):

python run.py --sweep hidden_dim      # width: [32, 64, 128, 256, 512]
python run.py --sweep num_layers      # depth: [1, 2, 3, 4]
python run.py --sweep learning_rate   # lr: [0.001, 0.005, 0.01, 0.05, 0.1]

Verify with numerical gradient checks:

cd code
python -m tests.test_cw3

What to Observe

After training, check the outputs in outputs/. You should see:

Accuracy improvement over CW2: a 2-layer MLP (784→128→64→10) achieves ~88–90% vs ~84% for logistic regression
Depth sweep: deeper networks (2–3 hidden layers) outperform single-layer; very deep networks (4+ layers) may not improve further with vanilla SGD
Width sweep: wider layers improve accuracy up to a point; diminishing returns beyond ~256 units on FashionMNIST
Momentum: momentum ≥ 0.9 accelerates convergence — the loss decreases faster in early epochs
ReLU vs Sigmoid: ReLU typically converges faster and reaches higher accuracy; Sigmoid can suffer from vanishing gradients in deeper networks

Think About the Differences

Guiding questions for your report

Backpropagation: trace the gradient of the loss through a 2-layer MLP step by step. What does each layer's backward() compute?
Activation functions: what happens to the ReLU gradient when the pre-activation value is negative? Why does this not occur with Sigmoid, and why does Sigmoid still suffer from vanishing gradients?
Momentum: sketch the velocity update equation. Why does momentum help when gradients point consistently in the same direction?
Depth vs Width: given a fixed parameter budget, is it better to go wider or deeper? What do your sweep results show?

Submission Checklist

layers.py — Linear, ReLU, Sigmoid forward & backward implemented
losses.py — CrossEntropyLoss forward & backward implemented
model.py — MLP init, forward, backward implemented
optimizer.py — SGD with momentum implemented
Gradient checks pass: python -m tests.test_cw3
outputs/ — contains sweep plots and training summaries
report.pdf — using the provided template

Grading Rubric (100 points)

Component	Points
Layer implementations (`layers.py`) — correctness verified by gradient check	40
Loss function (`losses.py`) — forward & backward	10
MLP model (`model.py`) — init, forward, backward	20
SGD optimizer (`optimizer.py`) — step with momentum	15
Gradient checks pass (`python -m tests.test_cw3`)	5
Report — hyperparameter exploration, analysis, comparison with CW2	10
Bonus: weight decay in SGD (capped at 100 total)	+10
Total	100

Academic Integrity & Notes

You must use NumPy only — no PyTorch, no autograd.
Discussing high-level ideas is allowed, but your code must be your own.
Do not share or copy implementations.