Homework #6

EE 541: Fall 2025

Assignment Details

Assigned: 20 October
Due: Sunday, 26 October at 23:59

BrightSpace Assignment: Homework 6
Getting Started Guide: View Guide

Problem 1: Backprop by Hand

Consider an MLP with three input nodes, two hidden layers, and three outputs. The hidden layers use the ReLU activation function and the output layer uses softmax. The weights and biases for this MLP are: \[ \mathbf{W}^{(1)} = \begin{bmatrix} 1 & -2 & 1\\ 3 & 4 & -2 \end{bmatrix}, \qquad \mathbf{b}^{(1)} = \begin{bmatrix} 1 \\ -2 \end{bmatrix} \]

\[ \mathbf{W}^{(2)} = \begin{bmatrix} 1 & -2 \\ 3 & 4 \end{bmatrix}, \qquad \mathbf{b}^{(2)} = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \]

\[ \mathbf{W}^{(3)} = \begin{bmatrix} 2 & 2 \\ 3 & -3 \\ 2 & 1 \end{bmatrix}, \qquad \mathbf{b}^{(3)} = \begin{bmatrix} 0 \\ -4 \\ -2 \end{bmatrix} \]

Feedforward Computation: Perform the feedforward calculation for the input vector $\mathbf{x} = [~ +1 ~ -1~ +1 ~]^\textrm{T}$. Fill in the following table. Follow the notation used in the slides, i.e. ${\bf s}^{(l)}$ is the linear activation, ${\bf a}^{(l)} = \underline{h}( {\bf s}^{(l)} )$, and $\dot{\bf a}^{(l)} = \dot{ \underline{h} }( {\bf s}^{(l)} )$.

$l$: 1 2 3

$\mathbf{s}^{(l)}$:

$\mathbf{a}^{(l)}$:

$\dot{ \mathbf{a}}^{(l)}$: (not needed)
Backpropagation Computation: Apply standard SGD backpropagation for the input assuming a multi-category cross-entropy loss function and one-hot labeled target: ${\bf y} = [~0~~0~~1~]^\textrm{T}$. Follow the notation used in the slides, i.e. $\mathbf{\delta}^{(l)} = \nabla_{ \mathbf{s}^{(l)}} C$. Enter the delta values in the table below and provide the updated weights and biases assuming a learning rate $\eta = 0.5$.

$l$: 1 2 3

$\delta^{(l)}$:

$\mathbf{W}^{(l)}$:

$\mathbf{b}^{(l)}$:

Problem 2: Backprop Initialization for Multiclass Classification

The softmax function $\textbf{h}(\cdot)$ takes an $M$-dimensional input vector $\textbf{s}$ and outputs an $M$-dimensional output vector $\textbf{a}$ as \[ \textbf{a} = \textbf{h}(\textbf{s}) = \frac{1}{ \displaystyle \sum_{m=1}^{M} e^{s_m} } \begin{bmatrix} e^{s_1}\\ e^{s_2}\\ \vdots\\ e^{s_{M}} \end{bmatrix} \] and the multiclass cross-entropy cost is given by \[ C = -\sum_{i=1}^{n}{y_i\ln{a_i}} \] where $\textbf{y}$ is a one-hot encoded vector of ground truth labels. Define the error (vector) of the output layer as: \[ \mathbf{\delta} = \nabla_{\textbf{s}}C = \dot{\textbf{A} } \nabla_{\textbf{a}}C \] where $ $ is the matrix of derivatives of softmax, given as \[ \dot{\textbf{A} } = \frac{d {\bf h}({\bf s})} {d {\bf s} } = \begin{bmatrix} \dfrac{\partial a_1}{\partial s_1} & \cdots & \dfrac{\partial a_1}{\partial s_{M}} \\[1em] \vdots & \ddots & \vdots \\[1em] \dfrac{\partial a_{M}}{\partial s_1} & \cdots & \dfrac{\partial a_{M}}{\partial s_{M}} \end{bmatrix}. \] (numerator convention with the left-handed chain rule). Show that $\mathbf{\delta}=\textbf{a}-\textbf{y}$ if $\textbf{y}$ is one-hot.

Problem 3: Logistic regression

Requirements

Use only Python standard library modules, numpy, and matplotlib for this problem.

The MNIST dataset of handwritten digits is one of the earliest and most used datasets to benchmark machine learning classifiers. Each datapoint contains 784 input features – the pixel values from a $28 \times 28$ image – and belongs to one of 10 output classes – represented by the numbers 0-9.

This problem continues your logistic regression experiments from the previous Homework.

Part (a): Logistic “2” detector

Previous HW. No submission required.

Part (b): Softmax classification: gradient descent (GD)

In this part you will use soft-max to peform multi-class classification instead of distinct “one against all” detectors. The target vector \[ [\mathbf{Y}]_l = \begin{cases} 1 & \mathbf{x} \textrm{ is an ``$l$''} \\ 0 & \textrm{else}. \end{cases} \] for $l = 0, \ldots, K-1$. You can alternatively consider a scalar output $Y$ equal to the value in $\{0, 1, \ldots, K-1 \}$ corresponding to the class of input $\mathbf{x}$. Construct a logistic classifier that uses $K$ seperate linear weight vectors $\mathbf{w}_0, \mathbf{w}_1, \ldots, \mathbf{w}_{K-1}$. Compute estimated probabilities for each class given input $\mathbf{x}$ and select the class with the largest score among your $K$ predictors: \[ P[Y = l | \mathbf{x}, \mathbf{w}] = \frac{\exp(\mathbf{w}_l^T \mathbf{x})}{\sum_{i=0}^{K} \exp(\mathbf{w}_{i}^T \mathbf{x})} \]

\[ \hat{Y} = \arg \max_l P[Y = l | \mathbf{x}, \mathbf{w}]. \] Note that the probabilities sum to 1. Use log-loss and optimize with batch gradient descent. The (negative) likelihood function on an N sampling training set is: \[ L(w) = - \frac{1}{N} \sum_{i=1}^N \log P[Y = y^{(i)} | x^{(i)}, w] \] where the sum is over the $N$ points in our training set.

Submit answers to the following.

Use your result from Problem 2 to compute the derivative of the log-likelihood of the softmax. Write the derivative in terms of conditional probabilities, the vector $\mathbf{x}$, and indicator functions (i.e. do not write this expression in terms of exponentials). You need this gradient in subsequent parts of this problem.
Implement batch gradient descent. What learning rate did you use?
Plot log-loss (i.e. learning curve) of the training set and test set on the same figure. On a separate figure plot the accuracy against iteration number of your model on the training set and test set. Plot each as a function of the iteration number.
Compute the final loss and final accuracy for both your training set and test set.

Part (c): Softmax classification: stochastic gradient descent

In this part you will use stochastic gradient descent (SGD) in place of (deterministic) gradient descent above. Test your SGD implmentation using single-point updates and a mini-batch size of 100. You may need to adjust the learning rate to improve performance. You can either: modify the rate by hand or according to some decay scheme or you may choose a single learning rate. You should get a final predictor comparable to that in the previous question.

Submit answers to the following.

Implement SGD with mini-batch size of 1 (i.e. compute the gradient and update weights after each sample). Record the log-loss and accuracy of the training set and test set every 5,000 samples. Plot the sampled log-loss and accuracy values on the same (respective) figures against the batch number. Your plots should start at iteration 0 (i.e. include initial log-loss and accuracy). Your curves should show performance comparable to batch gradient descent. How many iterations did it take to achieve comparable performance with batch gradient descent? How does this number depend on the learning rate? (or learning rate decay schedule if you have a non-constant learning rate).
Compare (to batch gradient descent) the total computational complexity to reach a comparable accuracy on your training set. Note that each iteration of batch gradient descent costs an extra factor of $N$ operations where $N$ is the number data points.
Implement SGD with mini-batch size of 100 (i.e. compute the gradient and update weights with accumulated average after every 100 samples). Record the log-loss and accuracies as above (every 5,000 samples – not 5,000 batches) and create similar plots. Your curves should show performance comparable to batch gradient descent. How many iterations did it take to achieve comparable performance with batch gradient descent? How does this number depend on the learning rate? (or learning rate decay schedule if you have a non-constant learning rate).
Compare the computational complexity to reach comparable performance between the 100 sample mini-batch algorithm, the single-point mini-batch, and batch gradient descent.

Submit your trained weights to Gradescope. Save your weights and bias to an hdf5 file. Use keys W and b for the weights and bias, respectively. W should be a $10 \times 784$ numpy array and b should be $10 \times 1$ – shape: (10,) – numpy array. The code to save the weights is the same as (a) – substituting W for w.

Note: you will not be scored on your models overall accuracy. But a low-score may indicate errors in training or poor optimization.

Submission Instructions

Problems 1-2: Analytical problems

Write your solutions to these homework problems on separate sheets of paper. Show all work and box answers where appropriate. Do not guess.

Problem 3: Logistic Regression

Submit answers to questions, learning curves, accuracy plots, and final loss and accuracy values to BrightSpace. Submit your Python code and trained weights hdf5 file (W: 10×784 matrix, b: 10×1 vector) to Gradescope. A suitably annotated Jupyter notebook with inline analysis is sufficient.

Gradescope: Problem 3

\(l\):	1	2	3
\(\mathbf{s}^{(l)}\):
\(\mathbf{a}^{(l)}\):
\(\dot{ \mathbf{a}}^{(l)}\):			(not needed)

\(l\):	1	2	3
\(\delta^{(l)}\):
\(\mathbf{W}^{(l)}\):
\(\mathbf{b}^{(l)}\):