The gradient update you wrote in CW2 was already the backward pass of a single linear layer:
dW = X.T @ grad_logits # = Linear.backward() grad_W
db = grad_logits.sum(0) # = Linear.backward() grad_b
In CW3 you generalize this to n layers chained together. Each layer's backward() computes its local gradients and passes the upstream gradient to the previous layer — that is the chain rule, i.e., backpropagation.
| CW2 | CW3 |
|---|---|
| Single linear layer + softmax | Multiple layers (Linear → Activation → … → Linear) |
| Gradient computed inside optimizer | Gradient computed by each layer's backward() |
| Vanilla gradient descent | SGD with momentum & weight decay |
| ~84% test accuracy | ~88–90% test accuracy |
cw3_mlp/layers.py — implement Linear, ReLU, Sigmoidcw3_mlp/losses.py — implement CrossEntropyLoss forward & backwardcw3_mlp/model.py — implement MLP (build layers, forward, backward)cw3_mlp/optimizer.py — implement SGD with momentumrun.py, trainer.py, config.py — provided, do not modifytests/test_cw3.py — numerical gradient checks for all layers
Submitting on Canvas: zip cw3_mlp/ (with outputs/ included) together with your report.pdf.
| File | What to implement |
|---|---|
layers.py |
Linear.forward(X) — cache X, return XW + bLinear.backward(grad) — grad_W, grad_b, return grad_inputReLU.forward(X) / ReLU.backward(grad)Sigmoid.forward(X) / Sigmoid.backward(grad)
|
losses.py |
CrossEntropyLoss.forward(logits, labels)CrossEntropyLoss.backward()
|
model.py |
MLP.__init__(...) — build layer listMLP.forward(X) — sequential forwardMLP.backward(grad) — reverse-order backward
|
optimizer.py |
SGD.__init__(...) — init velocities if momentum > 0SGD.step() — update W, b with optional momentum & weight decaySGD.zero_grad()
|
Do not modify run.py, config.py, or trainer.py.
cd code
pip install -r requirements.txt
python setup_data.py # only needed if not done in CW2
cd cw3_mlp
Quick mode for debugging:
python run.py --quick
Train with default architecture ([128, 64], ReLU, 30 epochs):
python run.py
Custom architecture and hyperparameters:
python run.py --hidden_dims 256 128 --activation relu --learning_rate 0.01
python run.py --hidden_dims 256 128 64 --momentum 0.9
Hyperparameter sweeps (generates plots for your report):
python run.py --sweep hidden_dim # width: [32, 64, 128, 256, 512]
python run.py --sweep num_layers # depth: [1, 2, 3, 4]
python run.py --sweep learning_rate # lr: [0.001, 0.005, 0.01, 0.05, 0.1]
Verify with numerical gradient checks:
cd code
python -m tests.test_cw3
After training, check the outputs in outputs/. You should see:
backward() compute?layers.py — Linear, ReLU, Sigmoid forward & backward implementedlosses.py — CrossEntropyLoss forward & backward implementedmodel.py — MLP init, forward, backward implementedoptimizer.py — SGD with momentum implementedpython -m tests.test_cw3outputs/ — contains sweep plots and training summariesreport.pdf — using the provided template| Component | Points |
|---|---|
Layer implementations (layers.py) — correctness verified by gradient check | 40 |
Loss function (losses.py) — forward & backward | 10 |
MLP model (model.py) — init, forward, backward | 20 |
SGD optimizer (optimizer.py) — step with momentum | 15 |
Gradient checks pass (python -m tests.test_cw3) | 5 |
| Report — hyperparameter exploration, analysis, comparison with CW2 | 10 |
| Bonus: weight decay in SGD (capped at 100 total) | +10 |
| Total | 100 |