Homework #6 – Getting Started Guide

1 Backpropagation by Hand

1.1 Network Structure Indexing

Backpropagation requires careful accounting of derivatives through the computational graph. For a network with \(L\) layers, the key variables follow this indexing convention:

\[W^{(l)}_{i,j} = \text{Weight from unit $j$ in layer $l-1$ to unit $i$ in layer $l$}\] \[b^{(l)}_{i} = \text{Bias for unit $i$ in layer $l$}\] \[s^{(l)}_{i} = \text{Linear activation (pre-activation) for unit $i$ in layer $l$}\] \[a^{(l)}_{i} = \text{Activation value for unit $i$ in layer $l$}\] \[\delta^{(l)}_{i} = \text{Error term for unit $i$ in layer $l$}\]

For input layer, \(a^{(0)} = x\).

Note

Notation: \(l\) represents the layer index, with \(l = L\) in the output layer.

1.2 Forward Pass Computation

For a simple 2-layer network with one hidden layer:

\[s^{(1)}_i = \sum_j W^{(1)}_{i,j} x_j + b^{(1)}_i\] \[a^{(1)}_i = \text{ReLU}(s^{(1)}_i) = \max(0, s^{(1)}_i)\] \[s^{(2)}_i = \sum_j W^{(2)}_{i,j} a^{(1)}_j + b^{(2)}_i\] \[a^{(2)}_i = \text{softmax}(s^{(2)})_i = \frac{e^{s^{(2)}_i}}{\sum_k e^{s^{(2)}_k}}\]

Example calculation of softmax:

Code
s = np.array([2.1, -3.0, 1.5])
exp_s = np.exp(s)
softmax = exp_s / np.sum(exp_s)

print(f"Pre-activation values: {s}")
print(f"Exponentials: {exp_s.round(3)}")
print(f"Sum of exponentials: {np.sum(exp_s).round(3)}")
print(f"Softmax probabilities: {softmax.round(3)}")
Pre-activation values: [ 2.1 -3.   1.5]
Exponentials: [8.166 0.05  4.482]
Sum of exponentials: 12.698
Softmax probabilities: [0.643 0.004 0.353]

1.3 Cross-Entropy Loss

For multiclass classification with softmax outputs and one-hot targets:

\[C = -\sum_i y_i \ln a^{(L)}_i\]

Given softmax output \(a^{(L)} = [0.64, 0.004, 0.35]\) and target \(y = [0, 0, 1]\):

Code
a_L = np.array([0.64, 0.004, 0.35])
y = np.array([0, 0, 1])
cross_entropy = -np.sum(y * np.log(a_L))

print(f"Cross-entropy loss: {cross_entropy.round(3)}")
Cross-entropy loss: 1.05

1.4 Backpropagation Equations

1.4.1 Output Layer Error

For softmax with cross-entropy loss, the output layer error simplifies to:

\[\delta^{(L)}_i = a^{(L)}_i - y_i\]

This can be derived by differentiating the cross-entropy loss with respect to the pre-activation values.

Code
delta_L = a_L - y
print(f"Output layer error (δ^(L)): {delta_L}")
Output layer error (δ^(L)): [ 0.64   0.004 -0.65 ]

1.4.2 Hidden Layer Error

For the hidden layers, the error is:

\[\delta^{(l)}_i = \left(\sum_k W^{(l+1)}_{k,i} \delta^{(l+1)}_k\right) \cdot h'(s^{(l)}_i)\]

With ReLU activation, the derivative \(h'(s)\) equals 1 where \(s > 0\) and 0 otherwise.

Code
# Example parameters
s_1 = np.array([1.2, -0.5, 0.8])  # Pre-activations of hidden layer
W_2 = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9]
])  # Weights from hidden to output layer
delta_2 = np.array([0.64, 0.004, -0.65])  # Output layer errors

# Backpropagate to hidden layer
relu_derivative = (s_1 > 0).astype(float)
incoming_gradient = W_2.T @ delta_2
delta_1 = incoming_gradient * relu_derivative

print("Backpropagation calculation for hidden layer:")
print("-" * 50)
print(f"Pre-activations (s¹):      [{s_1[0]:.1f}, {s_1[1]:.1f}, {s_1[2]:.1f}]")
print(f"ReLU derivative:           [{relu_derivative[0]:.0f}, {relu_derivative[1]:.0f}, {relu_derivative[2]:.0f}]")
print(f"Incoming gradient (W²ᵀδ²): [{incoming_gradient[0]:.3f}, {incoming_gradient[1]:.3f}, {incoming_gradient[2]:.3f}]")
print(f"Hidden layer error (δ¹):   [{delta_1[0]:.3f}, {delta_1[1]:.3f}, {delta_1[2]:.3f}]")
print("-" * 50)
Backpropagation calculation for hidden layer:
--------------------------------------------------
Pre-activations (s¹):      [1.2, -0.5, 0.8]
ReLU derivative:           [1, 0, 1]
Incoming gradient (W²ᵀδ²): [-0.389, -0.390, -0.391]
Hidden layer error (δ¹):   [-0.389, -0.000, -0.391]
--------------------------------------------------

Note that \(\delta^{(1)}_1 = 0\) because \(s^{(1)}_1 = -0.5 < 0\), so the ReLU derivative is 0.

1.5 Computing Weight Gradients

The gradient of the cost with respect to each weight is:

\[\frac{\partial C}{\partial W^{(l)}_{i,j}} = \delta^{(l)}_i \cdot a^{(l-1)}_j\]

Code
# Example parameters
a_0 = np.array([0.5, -0.3, 0.2])  # Input activations
delta_1 = np.array([-0.389, 0, -0.391])  # Hidden layer errors

# Calculate weight gradients for first layer
dW1 = np.outer(delta_1, a_0)

print("Weight gradients (∂C/∂W^(1)):")
print(dW1.round(3))
Weight gradients (∂C/∂W^(1)):
[[-0.194  0.117 -0.078]
 [ 0.    -0.     0.   ]
 [-0.196  0.117 -0.078]]

1.6 Computing Bias Gradients

The gradient of the cost with respect to each bias is simply the error term:

\[\frac{\partial C}{\partial b^{(l)}_i} = \delta^{(l)}_i\]

Code
# Bias gradients for first layer
db1 = delta_1

print("Bias gradients (∂C/∂b^(1)):", db1.round(3))
Bias gradients (∂C/∂b^(1)): [-0.389  0.    -0.391]

1.7 Weight and Bias Updates

To update weights and biases with learning rate \(\eta\):

\[W^{(l)}_{i,j} \leftarrow W^{(l)}_{i,j} - \eta \cdot \frac{\partial C}{\partial W^{(l)}_{i,j}}\] \[b^{(l)}_i \leftarrow b^{(l)}_i - \eta \cdot \frac{\partial C}{\partial b^{(l)}_i}\]

Code
# Original weights and learning rate
W1_original = np.array([
    [0.3, 0.2, 0.1],
    [0.4, 0.1, 0.2],
    [0.1, 0.5, 0.3]
])
learning_rate = 0.5

# Update weights
W1_updated = W1_original - learning_rate * dW1

print("Original weights (W^(1)):")
print(W1_original)
print("\nUpdated weights:")
print(W1_updated.round(3))
Original weights (W^(1)):
[[0.3 0.2 0.1]
 [0.4 0.1 0.2]
 [0.1 0.5 0.3]]

Updated weights:
[[0.397 0.142 0.139]
 [0.4   0.1   0.2  ]
 [0.198 0.441 0.339]]

2 Backprop Initialization for Multiclass Classification

Important

The combination of softmax and cross-entropy loss leads to a significant mathematical simplification in the backpropagation algorithm. This relationship eliminates the need for explicitly computing the Jacobian matrix during gradient calculation, resulting in a computationally efficient formula for the output layer error.

2.1 Softmax and Cross-Entropy Relationship

The softmax function maps an M-dimensional input vector to a probability distribution:

\[a = h(s) = \frac{1}{\sum_{m=1}^{M} e^{s_m}} \begin{bmatrix} e^{s_1} \\ e^{s_2} \\ \vdots \\ e^{s_M} \end{bmatrix}\]

The multiclass cross-entropy loss measures the discrepancy between predicted and true distributions:

\[C = -\sum_{i=1}^{n} y_i \ln a_i\]

For a one-hot encoded target vector \(y\) (where only one element equals 1 and all others are 0), this simplifies to:

\[C = -\ln a_t\]

where \(t\) is the index of the true class.

Code
# Example calculation
s = np.array([2.1, -1.0, 0.5])
softmax_output = np.exp(s) / np.sum(np.exp(s))
y_true = np.array([0, 1, 0])  # One-hot encoding

# Cross-entropy loss
cross_entropy = -np.sum(y_true * np.log(softmax_output))

print("Softmax output:", softmax_output.round(3))
print("Cross-entropy loss:", cross_entropy.round(3))
Softmax output: [0.802 0.036 0.162]
Cross-entropy loss: 3.321

2.2 Jacobian of Softmax Function

The Jacobian matrix of softmax gives the sensitivity of each output to changes in each input:

\[\dot{A} = \frac{dh(s)}{ds} = \begin{bmatrix} \frac{\partial a_1}{\partial s_1} & \cdots & \frac{\partial a_1}{\partial s_M} \\ \vdots & \ddots & \vdots \\ \frac{\partial a_M}{\partial s_1} & \cdots & \frac{\partial a_M}{\partial s_M} \end{bmatrix}\]

For softmax, the derivatives exhibit a distinctive pattern:

\[\frac{\partial a_i}{\partial s_j} = \begin{cases} a_i(1-a_i) & \text{if } i = j \\ -a_i a_j & \text{if } i \neq j \end{cases}\]

This structure emerges from the normalization constraint of the softmax function. The diagonal elements (\(i = j\)) represent how each output changes with respect to its corresponding input, while off-diagonal elements (\(i \neq j\)) capture the indirect effects due to the normalization.

The form bears resemblance to the derivative of the sigmoid function \(\sigma'(x) = \sigma(x)(1-\sigma(x))\), reflecting the connection between softmax (multinomial) and sigmoid (binomial) models.

2.3 The Chain Rule for Vector Functions

The error vector for backpropagation is defined as:

\[\delta = \nabla_s C = \dot{A} \nabla_a C\]

For cross-entropy loss, the gradient with respect to activations is:

\[\nabla_a C = \begin{bmatrix} -\frac{y_1}{a_1} \\ -\frac{y_2}{a_2} \\ \vdots \\ -\frac{y_M}{a_M} \end{bmatrix}\]

Code
def softmax_jacobian(a):
    """Compute Jacobian matrix for softmax function"""
    n = len(a)
    J = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            if i == j:
                J[i,j] = a[i] * (1 - a[i])
            else:
                J[i,j] = -a[i] * a[j]
    return J

# Example
a = softmax_output
J = softmax_jacobian(a)

print("Jacobian matrix of softmax:")
print(np.round(J, 3))
Jacobian matrix of softmax:
[[ 0.159 -0.029 -0.13 ]
 [-0.029  0.035 -0.006]
 [-0.13  -0.006  0.136]]

2.4 Analysis of Matrix-Vector Product

Expanding the matrix-vector product for component \(i\):

\[\delta_i = \sum_j \frac{\partial a_j}{\partial s_i} \cdot \left(-\frac{y_j}{a_j}\right)\]

This can be expressed as:

\[\delta_i = \frac{\partial a_i}{\partial s_i} \cdot \left(-\frac{y_i}{a_i}\right) + \sum_{j \neq i} \frac{\partial a_j}{\partial s_i} \cdot \left(-\frac{y_j}{a_j}\right)\]

Substituting the softmax derivatives:

\[\delta_i = a_i(1-a_i) \cdot \left(-\frac{y_i}{a_i}\right) + \sum_{j \neq i} (-a_i a_j) \cdot \left(-\frac{y_j}{a_j}\right)\]

Through algebraic manipulation and properties of one-hot encoded vectors, this expression can be transformed into a remarkably elegant form.

2.5 Denominator vs Numerator Convention

Tip

Technically, order doesn’t matter in 1-D cases, but it matters for vectors since the shapes have to align for matrix multiplication.

Note

This course uses the denominator differentiation convention with the left-handed chain rule, as specified in the problem statement.

In vector calculus, two common conventions exist for organizing partial derivatives in matrices:

Denominator convention (used in this course) defines the Jacobian where element \((i,j)\) is:

\[\left[\frac{dh(s)}{ds}\right]_{i,j} = \frac{\partial h_i(s)}{\partial s_j}\]

For a vector function \(h: \mathbb{R}^n \rightarrow \mathbb{R}^m\), this creates an \(m \times n\) matrix:

\[\frac{dh(s)}{ds} = \begin{bmatrix} \frac{\partial h_1}{\partial s_1} & \frac{\partial h_1}{\partial s_2} & \cdots & \frac{\partial h_1}{\partial s_n} \\ \frac{\partial h_2}{\partial s_1} & \frac{\partial h_2}{\partial s_2} & \cdots & \frac{\partial h_2}{\partial s_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial h_m}{\partial s_1} & \frac{\partial h_m}{\partial s_2} & \cdots & \frac{\partial h_m}{\partial s_n} \end{bmatrix}\]

Numerator convention, by contrast, organizes derivatives where element \((i,j)\) is:

\[\left[\frac{dh(s)}{ds}\right]_{i,j} = \frac{\partial h_j(s)}{\partial s_i}\]

Producing an \(n \times m\) matrix:

\[\frac{dh(s)}{ds} = \begin{bmatrix} \frac{\partial h_1}{\partial s_1} & \frac{\partial h_2}{\partial s_1} & \cdots & \frac{\partial h_m}{\partial s_1} \\ \frac{\partial h_1}{\partial s_2} & \frac{\partial h_2}{\partial s_2} & \cdots & \frac{\partial h_m}{\partial s_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial h_1}{\partial s_n} & \frac{\partial h_2}{\partial s_n} & \cdots & \frac{\partial h_m}{\partial s_n} \end{bmatrix}\]

With the left-handed chain rule, for functions \(f(g(x))\), derivatives are multiplied from left to right:

\[\frac{df}{dx} = \frac{df}{dg} \frac{dg}{dx}\]

The choice of convention affects the arrangement of derivatives in matrices, which is critical when evaluating expressions like \(\dot{A}\nabla_a C\). However, with proper application, both conventions lead to the same mathematical results.

3 Logistic Regression for Multi-class Classification

3.1 From Binary to Multi-class Classification

Multi-class logistic regression with softmax extends binary classification approaches to problems with multiple output classes. Instead of predicting a single probability as in binary logistic regression, softmax produces a probability distribution across \(K\) classes.

For a \(K\)-class problem, the model defines \(K\) score functions:

\[ s_k(x) = w_k^T x, \quad k \in \{0,1,\ldots,K-1\} \]

The softmax transformation converts these scores to a proper probability distribution:

\[ P(Y=k|x) = \frac{e^{s_k(x)}}{\sum_{i=0}^{K-1} e^{s_i(x)}} \]

This formulation creates \(K\) decision boundaries, each defined by the hyperplane where \(s_i(x) = s_j(x)\) for classes \(i\) and \(j\).

Figure 1: Multi-class decision boundaries with weight vectors in 2D feature space. Each region represents the area where one class has the highest probability. The arrows indicate the direction of each class’s weight vector.

The geometry of the softmax classifier reveals how it partitions the feature space into regions corresponding to each class. The weight vectors point in the directions that maximize each class’s score, with decision boundaries occurring where adjacent classes have equal scores.

3.1.1 Numerical Stability Considerations

Warning

The softmax function can encounter numerical instability due to the exponential operation. When input values have large magnitudes, exponentials can overflow floating-point representation limits.

The standard numerical stabilization technique subtracts the maximum score from all scores before exponentiation:

\[ P(Y=k|x) = \frac{e^{s_k - \max_j s_j}}{\sum_{i} e^{s_i - \max_j s_j}} \]

This transformation is mathematically equivalent because the exponential is multiplicative:

\[ \frac{e^{s_k}}{e^{s_1} + \ldots + e^{s_K}} = \frac{e^{s_k - C}}{e^{s_1-C} + \ldots + e^{s_K-C}} \]

This technique is critical for high-dimensional problems like MNIST, where the linear combination of 784 features can produce scores with large magnitudes.

3.2 Cross-Entropy Loss and Gradient

For multi-class classification, the cross-entropy loss measures the divergence between the predicted probability distribution and the one-hot target distribution:

\[ L(w) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{K-1} y_k^{(i)} \log(P(Y=k|x^{(i)})) \]

When \(y\) is one-hot encoded, this simplifies to:

\[ L(w) = -\frac{1}{N}\sum_{i=1}^{N} \log(P(Y=y^{(i)}|x^{(i)})) \]

Note

The gradient computation leverages the result derived earlier:

\[\nabla_{s^{(i)}} L = a^{(i)} - y^{(i)}\]

This remarkably clean form is a direct consequence of combining softmax activation with cross-entropy loss.

To derive the gradients with respect to the weights, we apply the chain rule:

\[ \frac{\partial L}{\partial w_{k,j}} = \frac{1}{N}\sum_{i=1}^{N} (a_k^{(i)} - y_k^{(i)})x_j^{(i)} \]

In matrix form:

\[ \nabla_{W_k} L = \frac{1}{N}\sum_{i=1}^{N} (a_k^{(i)} - y_k^{(i)})x^{(i)} \]

This gradient expression is used directly in the optimization algorithms to update the weight vectors.

3.2.1 Optimization Landscape and Gradient Descent

The loss function \(L(w)\) creates a high-dimensional surface in the parameter space. For multi-class logistic regression with \(K\) classes and \(d\) features, this space has \(K \times d\) dimensions plus \(K\) bias terms.

(a) Synthetic 2D visualization of a smooth loss surface with a global minimum and its gradient directions.
(b)
Figure 2

The negative gradient vectors (shown in red) always point in the direction of steepest descent on the loss surface. Gradient descent optimization follows these vectors to approach the global minimum.

3.3 Batch vs. Stochastic Gradient Descent Trajectories

Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD) exhibit fundamentally different convergence behaviors due to the nature of their gradient computations.

Figure 3: Simulated convergence paths for Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD). BGD follows a straight, smooth path while SGD fluctuates due to gradient noise.

Key differences between optimization methods:

  1. Gradient Estimation:
    • BGD: Computes the exact gradient over the entire dataset
    • SGD: Estimates the gradient using a single sample or mini-batch
  2. Convergence Characteristics:
    • BGD: Smooth, deterministic path
    • SGD: Noisy, stochastic path that may overshoot but can escape poor local minima
  3. Computational Efficiency:
    • BGD: \(O(N)\) per update
    • SGD: \(O(1)\) per update (pure) or \(O(B)\) for mini-batch

3.3.1 Mini-Batch SGD and Computational Considerations

Mini-batch SGD represents a compromise between batch and stochastic methods. Key computational properties:

Method Update Rule Computation/Update Memory Updates/Epoch Variance
Batch GD \(W \leftarrow W - \eta \nabla L(W)\) \(O(ND)\) \(O(ND)\) 1 None
SGD \(W \leftarrow W - \eta \nabla L_i(W)\) \(O(D)\) \(O(D)\) \(N\) High
Mini-batch \(W \leftarrow W - \eta \nabla L_B(W)\) \(O(BD)\) \(O(BD)\) \(\lceil N/B \rceil\) Medium

Where: - \(N\) = dataset size - \(D\) = dimensionality - \(B\) = mini-batch size - \(\eta\) = learning rate

The variance of the mini-batch gradient estimate decreases with batch size:

\[ \text{Var}(\nabla L_B) \approx \frac{\sigma^2}{B} \]

where \(\sigma^2\) is the variance of the per-sample gradients.

This explains why larger mini-batches allow for more stable convergence and often support larger learning rates.

3.3.2 Computational Complexity and Memory Requirements

The different optimization approaches have distinct computational profiles:

Method Time Complexity Memory Requirements Total Computation to ε-Accuracy
BGD \(\mathcal{O}(NKD)\) per iteration \(\mathcal{O}(ND + KD)\) \(\mathcal{O}(NKD/\varepsilon)\)
Mini-batch SGD \(\mathcal{O}(BKD)\) per iteration \(\mathcal{O}(BD + KD)\) \(\mathcal{O}(KD/\varepsilon^2)\)
Pure SGD \(\mathcal{O}(KD)\) per iteration \(\mathcal{O}(D + KD)\) \(\mathcal{O}(KD/\varepsilon^2)\)

Where: - \(N\) is the dataset size - \(K\) is the number of classes - \(D\) is the input dimensionality - \(B\) is the mini-batch size - \(\varepsilon\) is the target accuracy

For MNIST with \(N=60,000\), \(D=784\), and \(K=10\), these differences are substantial:

+------------------------+--------------------------+------------+---------------------+
| Method                 | Computation per Update   | Memory     |   Updates per Epoch |
+========================+==========================+============+=====================+
| Batch GD               | 470,400,000              | 47,047,840 |                   1 |
+------------------------+--------------------------+------------+---------------------+
| Mini-batch SGD (B=100) | 784,000                  | 86,240     |                 600 |
+------------------------+--------------------------+------------+---------------------+
| Pure SGD (B=1)         | 7,840                    | 8,624      |               60000 |
+------------------------+--------------------------+------------+---------------------+

The efficiency advantage of stochastic methods typically increases with dataset size, making them particularly valuable for large datasets like MNIST.

3.3.3 Mini-Batch Selection and Optimization

Selecting the appropriate mini-batch size involves a balance between computational efficiency and statistical accuracy:

The tradeoffs between different batch sizes can be summarized in the following table:

Batch Size Gradient Variance Hardware Utilization Memory Requirements Updates per Epoch Appropriate Use Cases
B=1 (Pure SGD) Highest Lowest Minimal Maximum (60,000) Online learning, memory constraints
B=10 High Low Very low 6,000 Quick exploration, limited hardware
B=100 Medium Good Moderate 600 Balanced approach for most cases
B=1000 Low Excellent High 60 GPU optimization, stable gradients
B=60000 (BGD) Lowest Excellent Highest 1 Convex problems, exact gradients

When selecting a mini-batch size for MNIST, several factors come into play:

  1. Statistical Efficiency: Larger batches provide more accurate gradient estimates with lower variance. The variance of the gradient estimate decreases as \(\mathcal{O}(1/B)\) with batch size \(B\).

  2. Computational Efficiency: Larger batches enable better hardware utilization through parallelization, particularly on GPUs or specialized accelerators.

  3. Memory Requirements: Batch size directly impacts memory usage, with larger batches requiring proportionally more memory.

  4. Update Frequency: Smaller batches allow for more frequent parameter updates per epoch, which can accelerate early learning.

Typical recommendations for MNIST:

  • Pure SGD (B=1): Provides maximum update frequency but high variance and poor hardware utilization
  • Small mini-batches (B=10-32): Good for initial exploration or limited hardware
  • Medium mini-batches (B=64-256): Balanced approach with good hardware utilization
  • Large mini-batches (B=500+): Better for later stages of training or when using specialized hardware

3.3.4 Comparative Efficiency of Optimization Methods

The theoretical and practical efficiency of different optimization methods can be quantified:

For proper comparison between methods, it’s crucial to consider both algorithmic and computational efficiency. While SGD may require more iterations to reach the same accuracy as BGD, each iteration requires significantly less computation. The figure illustrates this fundamental trade-off.

The comparative advantages become more pronounced with increasing dataset size:

  1. Theoretical Convergence Rates:
    • For strongly convex functions, BGD achieves \(\mathcal{O}(1/t)\) convergence rate
    • SGD achieves \(\mathcal{O}(1/\sqrt{t})\) convergence rate
  2. Computational Considerations:
    • BGD: One parameter update requires \(\mathcal{O}(ND)\) operations
    • SGD: One parameter update requires \(\mathcal{O}(D)\) operations
  3. Real-world Implications:
    • For large datasets, SGD often reaches acceptable accuracy with orders of magnitude less computation
    • BGD’s deterministic trajectory may converge to slightly better final solutions

3.4 Learning Rate Dynamics

The learning rate determines the size of parameter updates and dramatically affects convergence behavior:

Learning rate selection involves several critical considerations:

  1. Too small: Convergence is unnecessarily slow, potentially requiring excessive iterations.
  2. Too large: Causes oscillatory behavior or divergence when updates overshoot the minimum.
  3. Optimal range: Typically scales with dataset characteristics and model architecture.

The Polyak-Ruppert averaging theorem suggests the optimal learning rate for SGD scales as \(\eta \propto \frac{1}{\sqrt{t}}\) where \(t\) is the iteration number. This theoretical result motivates common decay schedules:

\[\eta_t = \frac{\eta_0}{\sqrt{1 + \alpha t}}\]

where \(\eta_0\) is the initial rate and \(\alpha\) controls decay speed.

For MNIST with softmax classification, appropriate learning rates typically fall within: - Batch GD: 0.1 - 1.0 - Mini-batch SGD (batch size 100): 0.01 - 0.2 - Pure SGD (batch size 1): 0.001 - 0.05

3.4.1 Convergence Analysis

Evaluating optimization performance requires tracking multiple metrics throughout training:

For MNIST with softmax regression, the convergence dynamics typically exhibit:

  1. Training vs. Test Gap: The difference between training and test performance provides insight into generalization. For softmax regression on MNIST, this gap is often small due to the model’s limited capacity.

  2. Epoch-to-Sample Conversion: Comparing batch and stochastic methods requires careful attention to units. One epoch of BGD corresponds to N/B updates in mini-batch SGD, where N is the dataset size and B is the batch size.

  3. Convergence Rate Analysis: Theoretically, SGD and BGD exhibit different convergence rates:

    • BGD: \(\mathcal{O}(1/t)\) convergence rate with appropriate learning rate
    • SGD: \(\mathcal{O}(1/\sqrt{t})\) convergence rate, slower in theory but more updates per epoch
  4. Computational Efficiency Metric: A comprehensive comparison incorporates both the convergence rate and computational cost:

    \[\text{Efficiency} = \frac{\text{Optimization Progress}}{\text{Computational Cost}}\]

    For MNIST with 60,000 training samples, BGD requires 60,000 times more computation per update than pure SGD. This often makes SGD more efficient despite its noisier trajectory.

3.4.2 Convergence Theorems and Guarantees

The theoretical foundations of optimization methods provide insights into their behavior:

  1. Convexity and Convergence: For convex functions (like softmax regression), gradient descent with appropriate learning rate is guaranteed to converge to the global minimum.

  2. SGD Convergence Theorem: For strongly convex functions, SGD with decreasing learning rate \(\eta_t = \frac{\eta_0}{1+\lambda t}\) converges in expectation to the optimal solution:

    \[\mathbb{E}[\|w_t - w^*\|^2] \leq \frac{C}{t}\]

    where \(C\) is a constant depending on the initial distance and gradient variance.

  3. Mini-batch Effect: Theoretically, using a mini-batch of size \(B\) reduces the gradient variance by a factor of \(\frac{1}{B}\), which can allow for proportionally larger learning rates.

These theoretical guarantees provide confidence in the convergence of the optimization methods used for softmax classification on MNIST.

3.5 Weight Visualization and Model Interpretation

Visualizing the learned weights provides insight into the features the model has identified for each class:

The weight visualization reveals that:

  1. Each weight vector forms a template-like representation of the corresponding digit
  2. Positive weights (warmer colors) indicate features that increase the score for that class
  3. Negative weights (cooler colors) represent features that decrease the score
  4. The templates highlight the discriminative features rather than reconstructing the digits

This visualization demonstrates that multi-class logistic regression learns a set of linear filters, each optimized to detect one class against all others.

3.6 Practical Implementation

For effective implementation of softmax regression on MNIST, consider the following practical strategies:

  1. Normalization: Scale pixel values to the [0,1] range by dividing by 255.

  2. Weight Initialization: Initialize weights with small random values to break symmetry:

    W = np.random.randn(n_classes, n_features) * 0.01
    b = np.zeros(n_classes)
  3. Learning Rate Schedule: Implement a decaying learning rate to balance initial progress and final convergence:

    def learning_rate(epoch, initial_rate=0.1, decay=0.95):
        return initial_rate * (decay ** epoch)
  4. Shuffling: Randomize sample order for SGD to prevent cyclical patterns:

    indices = np.random.permutation(len(X_train))
    X_shuffled = X_train[indices]
    y_shuffled = y_train[indices]
  5. Evaluation Strategy: Track multiple metrics on both training and validation sets:

    • Loss (cross-entropy)
    • Accuracy (percentage of correct predictions)
    • Confusion matrix (to identify class-specific performance)
  6. Early Stopping: Monitor validation performance to prevent overfitting and unnecessary computation.

These strategies help achieve efficient and effective training while avoiding common pitfalls.

3.6.1 Model Persistence and Evaluation

Proper model persistence enables later analysis and reuse:

# Example format for model persistence
model_data = {
    'W': W,                # Weight matrix (10 × 784)
    'b': b,                # Bias vector (10,)
    'n_features': 784,     # Number of input features
    'n_classes': 10,       # Number of output classes
    'hyperparameters': {   # Training configuration
        'learning_rate': learning_rate,
        'batch_size': batch_size,
        'n_iterations': n_iterations
    },
    'performance': {       # Evaluation metrics
        'train_accuracy': train_accuracy,
        'test_accuracy': test_accuracy,
        'train_loss': train_loss,
        'test_loss': test_loss
    }
}

The HDF5 format specified in the assignment provides an efficient container for these model parameters, preserving numerical precision and array structures.

For comprehensive evaluation, compute the confusion matrix:

The confusion matrix reveals systematic patterns in model errors, providing insight into similar digit pairs that the model struggles to differentiate. This information can guide potential model improvements or feature engineering efforts.