EE 541 - Unit 6
Fall 2025
Model: \(p = \sigma(z)\) where \(z = \mathbf{w}^T\mathbf{x} + b\)
Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
Loss (binary cross-entropy): \[\mathcal{L} = -[y\log p + (1-y)\log(1-p)]\]
Derivative of sigmoid: \[\begin{align} \sigma'(z) &= \frac{d}{dz}\left(\frac{1}{1+e^{-z}}\right)\\ &= \frac{e^{-z}}{(1+e^{-z})^2}\\ &= \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}}\\ &= \sigma(z)(1-\sigma(z)) \end{align}\]
Loss gradient w.r.t. \(z\): \[\begin{align} \frac{\partial \mathcal{L}}{\partial z} &= \frac{\partial \mathcal{L}}{\partial p} \cdot \frac{\partial p}{\partial z}\\ &= \left[-\frac{y}{p} + \frac{1-y}{1-p}\right] \cdot \sigma'(z)\\ &= \left[-\frac{y}{p} + \frac{1-y}{1-p}\right] \cdot p(1-p)\\ &= -(y - p) \end{align}\]
Weight gradient: \[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} = (p - y)\mathbf{x}\]
Bias gradient: \[\frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial b} = p - y\]
Setup:
Forward pass:
\[z = 0.5(2) + (-0.3)(1) + 0 = 0.7\]
\[p = \sigma(0.7) = \frac{1}{1+e^{-0.7}} = 0.668\]
Loss: \[\begin{align} \mathcal{L} &= -[1 \cdot \log(0.668) + 0 \cdot \log(0.332)]\\ &= -\log(0.668)\\ &= 0.404 \end{align}\]
Gradient computation:
\[\frac{\partial \mathcal{L}}{\partial z} = 0.668 - 1 = -0.332\]
\[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = -0.332 \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} -0.664 \\ -0.332 \end{bmatrix}\]
\[\frac{\partial \mathcal{L}}{\partial b} = -0.332\]
Numerical check (finite difference with \(\epsilon = 10^{-5}\)):
For \(w_1\): \[\frac{\mathcal{L}(w_1 + \epsilon) - \mathcal{L}(w_1 - \epsilon)}{2\epsilon} = -0.664\]
Matches analytical gradient
Two-layer network:
\[\mathbf{x} \xrightarrow{W^{(1)}} \mathbf{z}^{(1)} \xrightarrow{\sigma} \mathbf{a}^{(1)} \xrightarrow{W^{(2)}} \mathbf{z}^{(2)} \xrightarrow{\sigma} \hat{y}\]
Forward pass equations:
Layer 1: \[\mathbf{z}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}\] \[\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})\]
Layer 2: \[\mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}\] \[\hat{y} = \sigma(\mathbf{z}^{(2)})\]
Loss: \[\mathcal{L} = -[y\log \hat{y} + (1-y)\log(1-\hat{y})]\]
Question: How to compute \(\frac{\partial \mathcal{L}}{\partial W^{(1)}}\)?
Loss depends on \(W^{(1)}\) through path: \[\mathcal{L} \leftarrow \hat{y} \leftarrow \mathbf{z}^{(2)} \leftarrow \mathbf{a}^{(1)} \leftarrow \mathbf{z}^{(1)} \leftarrow W^{(1)}\]
Need systematic method for arbitrary depth
Symbolic differentiation:
Expand \(\mathcal{L}\) as function of inputs, differentiate
Example with 3 layers: \[\mathcal{L}(W^{(1)}, W^{(2)}, W^{(3)}) = f^{(3)}(W^{(3)}f^{(2)}(W^{(2)}f^{(1)}(W^{(1)}\mathbf{x})))\]
Applying chain rule symbolically: \[\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial f^{(3)}}{\partial f^{(2)}} \cdot \frac{\partial f^{(2)}}{\partial f^{(1)}} \cdot \frac{\partial f^{(1)}}{\partial W^{(1)}}\]
Problem: Expression size grows exponentially
Numerical differentiation:
For each parameter \(w_{ij}\): \[\frac{\partial \mathcal{L}}{\partial w_{ij}} \approx \frac{\mathcal{L}(w_{ij} + \epsilon) - \mathcal{L}(w_{ij} - \epsilon)}{2\epsilon}\]
Requires 2 forward passes per parameter
Example network: - Layer 1: 784 → 256 (200,704 parameters) - Layer 2: 256 → 128 (32,768 parameters) - Layer 3: 128 → 10 (1,280 parameters)
Total parameters: \(n = 234{,}752\)
Cost per gradient evaluation: - Numerical: \(2n\) forward passes = 469,504 - Backpropagation: 1 forward + 1 backward ≈ 2 forward passes
Speedup: \(\frac{2n}{2} = n = 234{,}752\times\)
Modern networks: - ResNet-50: 25M parameters - GPT-3: 175B parameters - Numerical differentiation completely infeasible
Must reuse intermediate computations
Both approaches fail: - Symbolic: Exponential expression growth - Numerical: Linear cost in parameters - Need: Constant cost independent of parameter count
Univariate chain rule:
For \(y = f(g(x))\): \[\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\]
Example: \(y = \sigma(wx + b)\) with \(w=2, b=1, x=0.5\)
Forward: \[z = 2(0.5) + 1 = 2\] \[y = \sigma(2) = 0.881\]
Derivative: \[\frac{dy}{dx} = \sigma'(2) \cdot 2 = 0.881(1-0.881) \cdot 2 = 0.210\]
Multivariate chain rule:
For \(f: \mathbb{R}^m \to \mathbb{R}\) and \(\mathbf{g}: \mathbb{R}^n \to \mathbb{R}^m\):
\[\frac{\partial f}{\partial x_i} = \sum_{j=1}^m \frac{\partial f}{\partial y_j} \frac{\partial y_j}{\partial x_i}\]
where \(\mathbf{y} = \mathbf{g}(\mathbf{x})\)
Matrix form:
\[\nabla_{\mathbf{x}} f = \left(\frac{\partial \mathbf{y}}{\partial \mathbf{x}}\right)^T \nabla_{\mathbf{y}} f\]
where \(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}\) is the Jacobian
Concrete example:
\(f(\mathbf{y}) = \mathbf{y}^T\mathbf{y}\), \(\mathbf{y} = W\mathbf{x}\) where
\[W = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\]
Forward: \[\mathbf{y} = \begin{bmatrix} 0.5 + 0.6 \\ -0.2 + 0.8 \end{bmatrix} = \begin{bmatrix} 1.1 \\ 0.6 \end{bmatrix}\] \[f = 1.1^2 + 0.6^2 = 1.57\]
Gradient: \[\nabla_{\mathbf{y}} f = 2\mathbf{y} = \begin{bmatrix} 2.2 \\ 1.2 \end{bmatrix}\] \[\nabla_{\mathbf{x}} f = W^T \nabla_{\mathbf{y}} f = \begin{bmatrix} 0.5 & -0.2 \\ 0.3 & 0.4 \end{bmatrix} \begin{bmatrix} 2.2 \\ 1.2 \end{bmatrix} = \begin{bmatrix} 0.86 \\ 1.14 \end{bmatrix}\]
Jacobian transpose propagates gradients backward
Forward computation:
Starting from \(\mathbf{a}^{(0)} = \mathbf{x}\):
For \(l = 1, 2, \ldots, L\): \[\begin{align} \mathbf{z}^{(l)} &= W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\\ \mathbf{a}^{(l)} &= \sigma^{(l)}(\mathbf{z}^{(l)}) \end{align}\]
Finally: \(\mathcal{L} = \mathcal{L}(\mathbf{a}^{(L)}, \mathbf{y})\)
Must store all \(\mathbf{a}^{(l)}\) and \(\mathbf{z}^{(l)}\) for backward pass
Storage requirement: \(O(\sum_{l=1}^L n_l)\) for all activations
Backward computation:
Starting from \(\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}}\):
For \(l = L, L-1, \ldots, 1\): \[\begin{align} \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} &= \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(l)}} \odot \sigma'^{(l)}(\mathbf{z}^{(l)})\\ \frac{\partial \mathcal{L}}{\partial W^{(l)}} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} (\mathbf{a}^{(l-1)})^T\\ \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(l-1)}} &= (W^{(l)})^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} \end{align}\]
Same graph, opposite direction
Complexity of backward ≈ Complexity of forward
Each layer implements two operations:
forward(input)
: 1. Compute output from input 2. Cache input and intermediate values 3. Return output
backward(grad_output)
: 1. Receive gradient w.r.t. output 2. Compute gradient w.r.t. input 3. Compute gradient w.r.t. parameters 4. Return gradient w.r.t. input
Example: Linear layer
class Linear:
def forward(self, a_prev):
self.a_prev = a_prev # cache
self.z = W @ a_prev + b
return self.z
def backward(self, grad_z):
grad_W = grad_z @ a_prev.T
grad_b = grad_z
grad_a_prev = W.T @ grad_z
return grad_a_prev
Example: Sigmoid layer
class Sigmoid:
def forward(self, z):
self.a = 1 / (1 + np.exp(-z))
return self.a
def backward(self, grad_a):
grad_z = grad_a * self.a * (1 - self.a)
return grad_z
Each module only needs its own inputs/outputs
No module needs global network structure
Forward: \(\mathbf{x} \xrightarrow{W^{(1)}} \mathbf{h} \xrightarrow{W^{(2)}} \hat{\mathbf{y}} \to \mathcal{L}\)
Store intermediate activations
Backward: \(\frac{\partial \mathcal{L}}{\partial W^{(1)}} \xleftarrow{} \frac{\partial \mathcal{L}}{\partial W^{(2)}} \xleftarrow{} \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}\)
Update (\(\forall l = 1, \ldots, L\)): \[W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial W^{(l)}}\]
\[\mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}}\]
Chain rule applied systematically from output to input
Network with \(L\) layers:
Layer index: \(l \in \{1, 2, \ldots, L\}\)
Pre-activation: \[\mathbf{z}^{(l)} = W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]
Post-activation: \[\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})\]
Dimensions: - \(W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}\) - \(\mathbf{a}^{(l)}, \mathbf{z}^{(l)}, \mathbf{b}^{(l)} \in \mathbb{R}^{n_l}\)
Boundary conditions:
Input: \(\mathbf{a}^{(0)} = \mathbf{x} \in \mathbb{R}^{n_0}\)
Output: \(\mathbf{a}^{(L)} = \hat{\mathbf{y}} \in \mathbb{R}^{n_L}\)
Loss: \(\mathcal{L} = \mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})\)
Objective:
Compute \(\frac{\partial \mathcal{L}}{\partial W^{(l)}}\) and \(\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}}\) for all \(l = 1, \ldots, L\)
Efficiently: Cost \(O(\text{forward pass})\), not \(O(n \cdot \text{forward pass})\)
Architecture: Two-layer scalar network
\[x \xrightarrow{w_1} s_1 \xrightarrow{\sigma} a_1 \xrightarrow{w_2} s_2 \xrightarrow{\sigma} \hat{y}\]
One scalar value per layer
Parameters: - \(w_1 = 0.5\), \(b_1 = 0\) - \(w_2 = -1\), \(b_2 = 0.5\)
Data: - Input: \(x = 2\) - Target: \(y = 1\)
Goal: Hand-compute all gradients to see pattern
Forward values computed left to right
Layer 1:
\[s_1 = w_1 x + b_1 = 0.5(2) + 0 = 1\]
\[a_1 = \sigma(1) = \frac{1}{1+e^{-1}} = 0.731\]
Layer 2:
\[s_2 = w_2 a_1 + b_2 = -1(0.731) + 0.5 = -0.231\]
\[\hat{y} = \sigma(-0.231) = 0.442\]
Loss (binary cross-entropy, \(y=1\)):
\[\mathcal{L} = -\log(\hat{y}) = -\log(0.442) = 0.816\]
Sigmoid evaluation at specific pre-activation values
Gradient w.r.t. \(\hat{y}\):
\[\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} = -\frac{1}{0.442} = -2.262\]
Through sigmoid (\(\sigma'(z) = \sigma(z)(1-\sigma(z))\)):
\[\frac{\partial \hat{y}}{\partial s_2} = 0.442(1-0.442) = 0.247\]
Chain rule:
\[\frac{\partial \mathcal{L}}{\partial s_2} = -2.262 \times 0.247 = -0.558\]
Weight gradient:
\[\frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial s_2} \times a_1 = -0.558 \times 0.731 = -0.408\]
Gradient flows back through weight:
\[\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial s_2} \times w_2 = -0.558 \times (-1) = 0.558\]
Through sigmoid:
\[\frac{\partial a_1}{\partial s_1} = a_1(1-a_1) = 0.731(0.269) = 0.197\]
\[\frac{\partial \mathcal{L}}{\partial s_1} = 0.558 \times 0.197 = 0.110\]
Weight gradient:
\[\frac{\partial \mathcal{L}}{\partial w_1} = 0.110 \times x = 0.110 \times 2 = 0.220\]
All gradients: - \(\frac{\partial \mathcal{L}}{\partial w_2} = -0.408\), \(\frac{\partial \mathcal{L}}{\partial b_2} = -0.558\) - \(\frac{\partial \mathcal{L}}{\partial w_1} = 0.220\), \(\frac{\partial \mathcal{L}}{\partial b_1} = 0.110\)
Gradients propagate backward, multiplying at each stage
Problem 1: Initialize gradient w.r.t. pre-activation
How to compute \(\frac{\partial \mathcal{L}}{\partial s^{(L)}}\) at output layer?
Scalar example: \(\frac{\partial \mathcal{L}}{\partial s_2} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \sigma'(s_2) = -2.269(0.442)(0.558) = -0.560\)
Problem 2: Propagate gradient backward one layer
Given \(\frac{\partial \mathcal{L}}{\partial s^{(l+1)}}\), compute \(\frac{\partial \mathcal{L}}{\partial s^{(l)}}\)
Scalar example: \(\frac{\partial \mathcal{L}}{\partial s_1} = \left(\frac{\partial \mathcal{L}}{\partial s_2} \cdot w_2\right) \cdot \sigma'(s_1) = (-0.560)(-1)(0.196) = 0.110\)
Problem 3: Convert to parameter gradients
Given \(\frac{\partial \mathcal{L}}{\partial s^{(l)}}\), compute \(\frac{\partial \mathcal{L}}{\partial w_l}\) and \(\frac{\partial \mathcal{L}}{\partial b_l}\)
Scalar example: \(\frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial s_2} \cdot a_1 = -0.560(0.731) = -0.409\)
Define \(\delta^{(l)} \equiv \frac{\partial \mathcal{L}}{\partial s^{(l)}}\)
Problem 1 gives \(\delta^{(L)}\): \[\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \sigma'(s^{(L)})\]
Problem 2 gives recursion: \[\delta^{(l)} = \delta^{(l+1)} \cdot w_{l+1} \cdot \sigma'(s^{(l)})\]
Problem 3 gives gradients: \[\frac{\partial \mathcal{L}}{\partial w_l} = \delta^{(l)} \cdot a^{(l-1)}\] \[\frac{\partial \mathcal{L}}{\partial b_l} = \delta^{(l)}\]
Forward pass computes values, backward pass computes gradients using chain rule
Define: \(\delta^{(l)} = \frac{\partial \mathcal{L}}{\partial s^{(l)}}\) (gradient w.r.t. pre-activation)
From our computations: - \(\delta^{(2)} = -0.558\) - \(\delta^{(1)} = 0.110\)
Recursion:
\[\delta^{(1)} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial s_1} = (\delta^{(2)} \cdot w_2) \cdot \sigma'(s_1)\]
General backward recursion:
\[\delta^{(l)} = \delta^{(l+1)} \cdot w_{l+1} \cdot \sigma'(s^{(l)})\]
Weight gradients:
\[\frac{\partial \mathcal{L}}{\partial w_l} = \delta^{(l)} \cdot a^{(l-1)}\] \[\frac{\partial \mathcal{L}}{\partial b_l} = \delta^{(l)}\]
Architecture: Two hidden units
\[x \to \begin{cases} h_1 = \sigma(w_1 x + b_1) \\ h_2 = \sigma(w_2 x + b_2) \end{cases} \to y = w_3 h_1 + w_4 h_2\]
Parameters: \(w_1 = 0.5, w_2 = -0.3, w_3 = 1, w_4 = 1\), all biases zero
Data: \(x = 1\), target \(t = 0.5\)
Forward:
\(h_1 = \sigma(0.5) = 0.622\)
\(h_2 = \sigma(-0.3) = 0.426\)
\(y = 0.622 + 0.426 = 1.048\)
Loss: \(\mathcal{L} = \frac{1}{2}(1.048 - 0.5)^2 = 0.150\)
Backward: \(\frac{\partial \mathcal{L}}{\partial y} = 1.048 - 0.5 = 0.548\)
Multiple gradient paths converge at \(x\)
Multivariate chain rule:
Two paths from \(\mathcal{L}\) to \(x\):
\[\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial h_1}\frac{\partial h_1}{\partial x} + \frac{\partial \mathcal{L}}{\partial h_2}\frac{\partial h_2}{\partial x}\]
Path 1 contribution:
\[\frac{\partial \mathcal{L}}{\partial h_1} = \frac{\partial \mathcal{L}}{\partial y} \cdot w_3 = 0.548 \cdot 1 = 0.548\]
\[\frac{\partial h_1}{\partial x} = \sigma'(w_1 x) \cdot w_1 = 0.622(0.378) \cdot 0.5 = 0.118\]
\[\text{Contribution}_1 = 0.548 \cdot 0.118 = 0.065\]
Path 2 contribution:
\[\frac{\partial \mathcal{L}}{\partial h_2} = 0.548 \cdot 1 = 0.548\]
\[\frac{\partial h_2}{\partial x} = 0.426(0.574) \cdot (-0.3) = -0.073\]
\[\text{Contribution}_2 = 0.548 \cdot (-0.073) = -0.040\]
Total gradient:
\[\frac{\partial \mathcal{L}}{\partial x} = 0.065 + (-0.040) = 0.025\]
Analytical gradient matches numerical verification
When multiple paths converge, gradients add
Loss and activation functions:
Loss (binary cross-entropy): \[\mathcal{L} = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]\]
Sigmoid activation: \[\sigma(s) = \frac{1}{1+e^{-s}}\]
Derivatives:
Loss derivative: \[\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\]
Sigmoid derivative: \[\sigma'(s) = \sigma(s)(1-\sigma(s))\]
Backpropagation formulas:
Define: \(\delta^{(l)} = \frac{\partial \mathcal{L}}{\partial s^{(l)}}\)
Output layer: \[\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \sigma'(s^{(L)})\]
Hidden layers: \[\delta^{(l)} = \delta^{(l+1)} \cdot w_{l+1} \cdot \sigma'(s^{(l)})\]
Parameter updates: \[w \leftarrow w - \eta \delta \cdot a^{(l-1)}\] \[b \leftarrow b - \eta \delta\]
Each gradient: upstream \(\times\) local derivative
Scalar to vector: Binary classification (1 output) → Multi-class classification (K outputs)
Sigmoid: \(\hat{y} = \sigma(s) \in (0,1)\) (probability of class 1)
Softmax: \(\hat{\mathbf{y}} = \text{softmax}(\mathbf{s}) \in \mathbb{R}^K\), \(\sum_i \hat{y}_i = 1\) (probability distribution over K classes)
Dimensions:
Parameters:
Layer 1: - \(W^{(1)} \in \mathbb{R}^{4 \times 3}\): 12 weights - \(\mathbf{b}^{(1)} \in \mathbb{R}^4\): 4 biases
Layer 2: - \(W^{(2)} \in \mathbb{R}^{2 \times 4}\): 8 weights - \(\mathbf{b}^{(2)} \in \mathbb{R}^2\): 2 biases
Total: 26 parameters
Forward pass:
\[\mathbf{s}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}, \quad \mathbf{a}^{(1)} = \sigma(\mathbf{s}^{(1)})\] \[\mathbf{s}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}, \quad \mathbf{a}^{(2)} = \text{softmax}(\mathbf{s}^{(2)})\]
Fully connected: every unit in one layer connects to every unit in next layer
Compact notation: \[\mathbf{z}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}\]
Expanded element-wise (3 → 4 example): \[\begin{bmatrix} z_1^{(1)}\\ z_2^{(1)}\\ z_3^{(1)}\\ z_4^{(1)} \end{bmatrix} = \begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)}\\ w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)}\\ w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)}\\ w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)} \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} + \begin{bmatrix} b_1^{(1)}\\ b_2^{(1)}\\ b_3^{(1)}\\ b_4^{(1)} \end{bmatrix}\]
Shows which weights affect which outputs
Element \(z_i^{(1)}\) depends on:
All inputs: \(x_1, x_2, x_3\)
One row of \(W^{(1)}\): \(w_{i1}^{(1)}, w_{i2}^{(1)}, w_{i3}^{(1)}\)
One bias: \(b_i^{(1)}\)
\[z_i^{(1)} = \sum_{j=1}^3 w_{ij}^{(1)} x_j + b_i^{(1)}\]
Gradient \(\frac{\partial \mathcal{L}}{\partial w_{ij}^{(1)}}\) affects only \(z_i^{(1)}\)
Rest of backprop: how changes in \(z_i^{(1)}\) affect loss \(\mathcal{L}\)
Layer 1:
Pre-activation: \[\mathbf{s}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}\]
Dimensions: \[[4 \times 1] = [4 \times 3][3 \times 1] + [4 \times 1]\]
Activation (element-wise): \[\mathbf{a}^{(1)} = \sigma(\mathbf{s}^{(1)})\] \[[4 \times 1] = \sigma([4 \times 1])\]
Layer 2:
Pre-activation: \[\mathbf{s}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}\]
Dimensions: \[[2 \times 1] = [2 \times 4][4 \times 1] + [2 \times 1]\]
Softmax (probability distribution): \[\mathbf{a}^{(2)} = \text{softmax}(\mathbf{s}^{(2)})\]
where \([\text{softmax}(\mathbf{s})]_i = \frac{e^{s_i}}{\sum_j e^{s_j}}\)
Note: Each output depends on all inputs, i.e., not element-wise activation
Example computation:
\[W^{(1)} = \begin{bmatrix} 0.1 & 0.2 & -0.1 \\ 0.3 & -0.2 & 0.1 \\ -0.1 & 0.1 & 0.2 \\ 0.2 & 0.3 & -0.3 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 1.0 \\ 0.5 \\ -0.5 \end{bmatrix}\]
\[\mathbf{s}^{(1)} = \begin{bmatrix} 0.1(1.0) + 0.2(0.5) - 0.1(-0.5) \\ 0.3(1.0) - 0.2(0.5) + 0.1(-0.5) \\ -0.1(1.0) + 0.1(0.5) + 0.2(-0.5) \\ 0.2(1.0) + 0.3(0.5) - 0.3(-0.5) \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.1 \\ 0.0 \\ 0.2 \end{bmatrix}\]
\[= \begin{bmatrix} 0.25 \\ 0.05 \\ -0.05 \\ 0.55 \end{bmatrix}\]
\[\mathbf{a}^{(1)} = \begin{bmatrix} \sigma(0.25) \\ \sigma(0.05) \\ \sigma(-0.05) \\ \sigma(0.55) \end{bmatrix} = \begin{bmatrix} 0.562 \\ 0.513 \\ 0.488 \\ 0.634 \end{bmatrix}\]
Dimension compatibility essential for matrix multiplication
Problem 2 (scalar): Propagate \(\delta\) backward one layer
\[\delta^{(l)} = \delta^{(l+1)} \cdot w_{l+1} \cdot \sigma'(s^{(l)})\]
Three scalar multiplications
Problem 2 (vector): Propagate \(\boldsymbol{\delta}\) backward
\[\boldsymbol{\delta}^{(l)} = \, ???\]
Components: - \(\boldsymbol{\delta}^{(l+1)} \in \mathbb{R}^{n_{l+1}}\) (vector) - \(W^{(l+1)} \in \mathbb{R}^{n_{l+1} \times n_l}\) (matrix) - \(\sigma'(\mathbf{s}^{(l)}) \in \mathbb{R}^{n_l}\) (vector)
How do these combine?
Problem 3 (scalar): Weight gradient
\[\frac{\partial \mathcal{L}}{\partial w_l} = \delta^{(l)} \cdot a^{(l-1)}\]
Problem 3 (vector): Weight gradient matrix
\[\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \, ???\]
Dimensions: \(\mathbb{R}^{n_l \times n_{l-1}}\) required
Scalar → Vector translation:
Multiplication (\(\cdot\)) → Matrix multiply or element-wise (\(\odot\))?
Example (3 → 4 → 2 network):
Output layer: \(\boldsymbol{\delta}^{(2)} \in \mathbb{R}^2\) computed
Hidden layer: \(\boldsymbol{\delta}^{(1)} \in \mathbb{R}^4\) = ?
Inputs available: - \(\boldsymbol{\delta}^{(2)} = [d_1^{(2)}, d_2^{(2)}]^T\) (2 × 1) - \(W^{(2)} \in \mathbb{R}^{2 \times 4}\) - \(\sigma'(\mathbf{s}^{(1)}) \in \mathbb{R}^4\) (4 × 1)
Need output: 4 × 1 vector
Matrix dimensions suggest: \((W^{(2)})^T \boldsymbol{\delta}^{(2)}\) gives 4 × 1
But how does \(\sigma'\) combine? Element-wise or matrix?
Need systematic framework for vector derivatives
Jacobian matrix: All partial derivatives
For \(\mathbf{y} = f(\mathbf{x})\) where \(\mathbf{y} \in \mathbb{R}^m\), \(\mathbf{x} \in \mathbb{R}^n\):
\[J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}\]
Linear layer:
\[\mathbf{s} = W\mathbf{a} + \mathbf{b}\]
\[\frac{\partial \mathbf{s}}{\partial \mathbf{a}} = W\]
Element-wise activation:
\[a_i = \sigma(s_i)\]
\[\frac{\partial \mathbf{a}}{\partial \mathbf{s}} = \text{diag}(\sigma'(s_1), \ldots, \sigma'(s_n))\]
Diagonal because \(\frac{\partial a_i}{\partial s_j} = 0\) for \(i \neq j\)
Jacobian structure reflects dependencies
Scalar chain rule:
For \(\mathcal{L}(f(g(x)))\): \[\frac{d\mathcal{L}}{dx} = \frac{d\mathcal{L}}{df} \cdot \frac{df}{dg} \cdot \frac{dg}{dx}\]
Vector chain rule:
For \(\mathcal{L}(\mathbf{a}(\mathbf{s}(\mathbf{x})))\) where \(\mathcal{L}: \mathbb{R}^n \to \mathbb{R}\):
\[\nabla_{\mathbf{x}} \mathcal{L} = \left(\frac{\partial \mathbf{s}}{\partial \mathbf{x}}\right)^T \left(\frac{\partial \mathbf{a}}{\partial \mathbf{s}}\right)^T \nabla_{\mathbf{a}} \mathcal{L}\]
Transpose appears from chain rule!
For our network:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(l-1)}} = \left(\frac{\partial \mathbf{s}^{(l)}}{\partial \mathbf{a}^{(l-1)}}\right)^T \frac{\partial \mathcal{L}}{\partial \mathbf{s}^{(l)}}\]
\[= (W^{(l)})^T \boldsymbol{\delta}^{(l)}\]
where \(\boldsymbol{\delta}^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{s}^{(l)}}\)
Dimension verification:
Forward (layer \(l\)): \[\mathbf{s}^{(l)} = W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\] \[[n_l \times 1] = [n_l \times n_{l-1}][n_{l-1} \times 1] + [n_l \times 1]\]
Backward: \[\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(l-1)}} = (W^{(l)})^T \boldsymbol{\delta}^{(l)}\] \[[n_{l-1} \times 1] = [n_{l-1} \times n_l][n_l \times 1]\]
Example: Layer 2 → Layer 1
\(W^{(2)} \in \mathbb{R}^{2 \times 4}\), \(\boldsymbol{\delta}^{(2)} \in \mathbb{R}^{2}\)
\[\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(1)}} = (W^{(2)})^T \boldsymbol{\delta}^{(2)}\] \[[4 \times 1] = [4 \times 2][2 \times 1]\]
Transpose ensures dimension compatibility
Denominator convention (used here):
For \(\mathbf{y} = f(\mathbf{x})\) where \(\mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m\):
Jacobian matrix: \[\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}\]
Chain rule (left-handed):
For \(\mathcal{L} = g(f(\mathbf{x}))\): \[\nabla_\mathbf{x} \mathcal{L} = \left(\frac{\partial \mathbf{y}}{\partial \mathbf{x}}\right)^T \nabla_\mathbf{y} \mathcal{L}\]
Transpose appears naturally
Numerator convention (alternative):
Jacobian: \[\frac{\partial \mathbf{y}}{\partial \mathbf{x}^T} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1}\\ \vdots & \ddots & \vdots\\ \frac{\partial y_1}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{n \times m}\]
Chain rule (right-handed): \[\nabla_\mathbf{x}^T \mathcal{L} = \nabla_\mathbf{y}^T \mathcal{L} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}^T}\]
Both conventions are bookkeeping for chain rule
See: Matrix Calculus (Wikipedia)
Denominator convention more common in optimization
Hadamard product (element-wise multiply): \(\odot\)
\[\mathbf{a} \odot \mathbf{b} = \begin{bmatrix} a_1 b_1 \\ a_2 b_2 \\ \vdots \\ a_n b_n \end{bmatrix}\]
Example:
\[\begin{bmatrix} 2 \\ 3 \\ -1 \end{bmatrix} \odot \begin{bmatrix} 0.5 \\ 0.2 \\ 0.8 \end{bmatrix} = \begin{bmatrix} 1.0 \\ 0.6 \\ -0.8 \end{bmatrix}\]
Connection to diagonal matrix:
\[\mathbf{a} \odot \mathbf{b} = \text{diag}(\mathbf{a}) \mathbf{b} = \text{diag}(\mathbf{b}) \mathbf{a}\]
where \(\text{diag}(\mathbf{a})\) puts vector on diagonal:
\[\text{diag}([a_1, a_2, a_3]^T) = \begin{bmatrix} a_1 & 0 & 0 \\ 0 & a_2 & 0 \\ 0 & 0 & a_3 \end{bmatrix}\]
Used in backprop when applying activation derivatives element-wise
Define: \(\boldsymbol{\delta}^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{s}^{(l)}} \in \mathbb{R}^{n_l}\)
Vector of error signals at layer \(l\)
Output layer (softmax + cross-entropy):
For \(\mathbf{y}\) one-hot and \(\mathcal{L} = -\log(a_k^{(L)})\) where \(y_k = 1\):
\[\boldsymbol{\delta}^{(L)} = \mathbf{a}^{(L)} - \mathbf{y}\]
Hidden layers:
\[\boldsymbol{\delta}^{(l)} = (W^{(l+1)})^T \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{s}^{(l)})\]
where \(\odot\) is Hadamard (element-wise) product
Convenient since element-wise activation has diagonal Jacobian (zero off-diagonal)
Dimension check:
\[[n_l \times 1] = [n_l \times n_{l+1}][n_{l+1} \times 1] \odot [n_l \times 1]\]
Matrix-vector multiply, then element-wise multiply
Example: 2-layer network
Output delta (\(n_2 = 2\)): \[\boldsymbol{\delta}^{(2)} = \begin{bmatrix} 0.509 \\ -0.509 \end{bmatrix}\]
(from \(\mathbf{a}^{(2)} - \mathbf{y}\) with \(\mathbf{y} = [0, 1]^T\))
Hidden delta (\(n_1 = 4\)):
\[(W^{(2)})^T \boldsymbol{\delta}^{(2)} = \begin{bmatrix} 0.5 & -0.2 \\ -0.3 & 0.4 \\ 0.2 & -0.1 \\ 0.1 & 0.3 \end{bmatrix} \begin{bmatrix} 0.509 \\ -0.509 \end{bmatrix} = \begin{bmatrix} 0.357 \\ 0.051 \\ 0.153 \\ -0.102 \end{bmatrix}\]
\[\sigma'(\mathbf{s}^{(1)}) = \mathbf{a}^{(1)} \odot (1 - \mathbf{a}^{(1)}) = \begin{bmatrix} 0.246 \\ 0.250 \\ 0.250 \\ 0.232 \end{bmatrix}\]
\[\boldsymbol{\delta}^{(1)} = \begin{bmatrix} 0.357 \\ 0.051 \\ 0.153 \\ -0.102 \end{bmatrix} \odot \begin{bmatrix} 0.246 \\ 0.250 \\ 0.250 \\ 0.232 \end{bmatrix} = \begin{bmatrix} 0.088 \\ 0.013 \\ 0.038 \\ -0.024 \end{bmatrix}\]
Delta propagates backward, weight gradients computed via outer products
Scalar to matrix derivation:
Layer computation: \(s_i^{(l)} = \sum_{j=1}^{n_{l-1}} w_{ij}^{(l)} a_j^{(l-1)} + b_i^{(l)}\)
Chain rule: \[\frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} = \frac{\partial \mathcal{L}}{\partial s_i^{(l)}} \cdot \frac{\partial s_i^{(l)}}{\partial w_{ij}^{(l)}} = \delta_i^{(l)} \cdot a_j^{(l-1)}\]
Matrix form (all elements): \[\left[\frac{\partial \mathcal{L}}{\partial W^{(l)}}\right]_{ij} = \delta_i^{(l)} \cdot a_j^{(l-1)}\]
Recognize outer product:
\[\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T\]
Dimensions: \([n_l \times 1][1 \times n_{l-1}] = [n_l \times n_{l-1}]\) ✓
Bias gradient:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}\]
Setup:
x = np.array([1.0, 0.5, -0.5])
y = np.array([0, 1]) # one-hot
W1 = np.array([[0.1, 0.2, -0.1],
[0.3, -0.2, 0.1],
[-0.1, 0.1, 0.2],
[0.2, 0.3, -0.3]])
b1 = np.array([0.1, -0.1, 0.0, 0.2])
W2 = np.array([[0.5, -0.3, 0.2, 0.1],
[-0.2, 0.4, -0.1, 0.3]])
b2 = np.array([0.0, 0.1])
Forward:
\[\mathbf{s}^{(1)} = \begin{bmatrix} 0.25 \\ 0.05 \\ -0.05 \\ 0.55 \end{bmatrix}, \quad \mathbf{a}^{(1)} = \begin{bmatrix} 0.562 \\ 0.513 \\ 0.488 \\ 0.634 \end{bmatrix}\]
\[\mathbf{s}^{(2)} = \begin{bmatrix} 0.421 \\ 0.384 \end{bmatrix}, \quad \mathbf{a}^{(2)} = \begin{bmatrix} 0.509 \\ 0.491 \end{bmatrix}\]
Loss: \(\mathcal{L} = -\log(0.491) = 0.711\)
Backward:
\[\boldsymbol{\delta}^{(2)} = \begin{bmatrix} 0.509 \\ -0.509 \end{bmatrix}\]
\[\boldsymbol{\delta}^{(1)} = (W^{(2)})^T \boldsymbol{\delta}^{(2)} \odot \sigma'(\mathbf{s}^{(1)})\] \[= \begin{bmatrix} 0.357 \\ 0.051 \\ 0.153 \\ -0.102 \end{bmatrix} \odot \begin{bmatrix} 0.246 \\ 0.250 \\ 0.250 \\ 0.232 \end{bmatrix} = \begin{bmatrix} 0.088 \\ 0.013 \\ 0.038 \\ -0.024 \end{bmatrix}\]
Gradients:
\[\nabla W^{(2)} = \begin{bmatrix} 0.509 \\ -0.509 \end{bmatrix} \begin{bmatrix} 0.562 & 0.513 & 0.488 & 0.634 \end{bmatrix}\] \[= \begin{bmatrix} 0.286 & 0.261 & 0.248 & 0.323 \\ -0.286 & -0.261 & -0.248 & -0.323 \end{bmatrix}\]
\[\nabla W^{(1)} = \begin{bmatrix} 0.088 \\ 0.013 \\ 0.038 \\ -0.024 \end{bmatrix} \begin{bmatrix} 1.0 & 0.5 & -0.5 \end{bmatrix}\] \[= \begin{bmatrix} 0.088 & 0.044 & -0.044 \\ 0.013 & 0.007 & -0.007 \\ 0.038 & 0.019 & -0.019 \\ -0.024 & -0.012 & 0.012 \end{bmatrix}\]
Finite difference check:
For parameter \(w_{ij}\): \[\frac{\partial \mathcal{L}}{\partial w_{ij}} \approx \frac{\mathcal{L}(w_{ij} + \epsilon) - \mathcal{L}(w_{ij} - \epsilon)}{2\epsilon}\]
Check \(W^{(2)}_{11}\) (row 1, column 1):
Analytical: \(\nabla W^{(2)}_{11} = 0.286\)
Numerical with \(\epsilon = 10^{-5}\):
Original \(W^{(2)}_{11} = 0.5\)
\(\mathcal{L}(W^{(2)}_{11} = 0.50001) = 0.711003\)
\(\mathcal{L}(W^{(2)}_{11} = 0.49999) = 0.710997\)
\[\frac{0.711003 - 0.710997}{0.00002} = 0.286\]
Analytical and numerical match
Analytical and numerical gradients match to machine precision
Single sample: \(\mathbf{x} \in \mathbb{R}^d\)
\[\mathbf{s} = W\mathbf{x} + \mathbf{b}\]
Batch of \(m\) samples: \(X \in \mathbb{R}^{m \times d}\)
Samples as rows: \[X = \begin{bmatrix} — \mathbf{x}_1^T — \\ — \mathbf{x}_2^T — \\ \vdots \\ — \mathbf{x}_m^T — \end{bmatrix}\]
Forward pass (layer \(l\)):
\[S^{(l)} = A^{(l-1)} (W^{(l)})^T + \mathbf{1}_m (\mathbf{b}^{(l)})^T\]
where \(A^{(l-1)} \in \mathbb{R}^{m \times n_{l-1}}\)
Bias broadcast: add \(\mathbf{b}^{(l)}\) to each row
Activation: Element-wise on matrix
\[A^{(l)} = \sigma(S^{(l)})\]
Dimension check:
\[S^{(l)}: [m \times n_l] = [m \times n_{l-1}][n_{l-1} \times n_l] + [m \times n_l]\]
Backward pass:
Delta for batch: \(\Delta^{(l)} \in \mathbb{R}^{m \times n_l}\)
Each row is delta for one sample
Weight gradient (average over batch):
\[\nabla W^{(l)} = \frac{1}{m} (\Delta^{(l)})^T A^{(l-1)}\]
\[[n_l \times n_{l-1}] = \frac{1}{m} [n_l \times m][m \times n_{l-1}]\]
Bias gradient:
\[\nabla \mathbf{b}^{(l)} = \frac{1}{m} \mathbf{1}_m^T \Delta^{(l)}\]
\[[n_l \times 1] = \frac{1}{m} [1 \times m][m \times n_l]\]
Sum deltas across batch, average
For each sample \(i = 1, \ldots, m\) in batch:
Forward pass: \[\mathbf{a}_i^{(l)} = \sigma(W^{(l)}\mathbf{a}_i^{(l-1)} + \mathbf{b}^{(l)})\]
Backward pass: \[\boldsymbol{\delta}_i^{(l)} = (W^{(l+1)})^T \boldsymbol{\delta}_i^{(l+1)} \odot \sigma'(\mathbf{z}_i^{(l)})\]
Accumulate gradients: \[\nabla W^{(l)} = \frac{1}{m} \sum_{i=1}^m \boldsymbol{\delta}_i^{(l)} (\mathbf{a}_i^{(l-1)})^T\]
\[\nabla \mathbf{b}^{(l)} = \frac{1}{m} \sum_{i=1}^m \boldsymbol{\delta}_i^{(l)}\]
Single SGD update after processing batch:
\[W^{(l)} \leftarrow W^{(l)} - \eta \nabla W^{(l)}\] \[\mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta \nabla \mathbf{b}^{(l)}\]
Matrix form (all samples stacked):
\[\nabla W^{(l)} = \frac{1}{m} \Delta^{(l)T} A^{(l-1)}\]
where \(\Delta^{(l)} \in \mathbb{R}^{m \times n_l}\) stacks all \(\boldsymbol{\delta}_i^{(l)}\) as rows
Same result, different notation
Batch of 3 samples:
\[X = \begin{bmatrix} 1.0 & 0.5 & -0.5 \\ 0.8 & -0.3 & 0.4 \\ -0.6 & 0.7 & 0.2 \end{bmatrix} \in \mathbb{R}^{3 \times 3}\]
\[Y = \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 2}\]
Forward (Layer 1, \(W^{(1)} \in \mathbb{R}^{4 \times 3}\)):
\[S^{(1)} = X (W^{(1)})^T + \mathbf{1}_3 (\mathbf{b}^{(1)})^T\]
\[[3 \times 4] = [3 \times 3][3 \times 4] + [3 \times 4]\]
\[S^{(1)} = \begin{bmatrix} 0.25 & 0.05 & -0.05 & 0.55 \\ 0.23 & 0.14 & -0.02 & 0.41 \\ -0.11 & -0.07 & 0.15 & 0.08 \end{bmatrix}\]
\[A^{(1)} = \sigma(S^{(1)})\]
Backward:
Output delta (3 samples × 2 classes): \[\Delta^{(2)} = A^{(2)} - Y = \begin{bmatrix} 0.509 & -0.509 \\ -0.492 & 0.492 \\ 0.503 & -0.503 \end{bmatrix}\]
Weight gradient:
\[\nabla W^{(2)} = \frac{1}{3} (\Delta^{(2)})^T A^{(1)}\]
\[[2 \times 4] = \frac{1}{3} [2 \times 3][3 \times 4]\]
Average gradient across 3 samples
Why batch: - GPU matrix multiply: \(O(mn_ln_{l-1})\) for batch - 3 separate forwards: \(3 \times O(n_ln_{l-1})\) - Same asymptotic cost, but batching uses optimized BLAS - Memory access patterns favor contiguous batch data
Input: Network \(\{W^{(l)}, \mathbf{b}^{(l)}\}_{l=1}^L\), data \((\mathbf{x}, \mathbf{y})\)
Output: Gradients \(\{\nabla W^{(l)}, \nabla \mathbf{b}^{(l)}\}_{l=1}^L\)
Forward pass:
a^(0) ← x
for l = 1 to L:
s^(l) ← W^(l) a^(l-1) + b^(l)
a^(l) ← σ^(l)(s^(l))
store: a^(l-1), s^(l), a^(l)
Backward pass:
δ^(L) ← ∇_a^(L) L(a^(L), y)
for l = L down to 1:
δ^(l) ← (W^(l+1))^T δ^(l+1) ⊙ σ'(s^(l))
Gradient computation:
for l = 1 to L:
∇W^(l) ← δ^(l) (a^(l-1))^T
∇b^(l) ← δ^(l)
Single forward pass + single backward pass computes all gradients
Historical context:
Rumelhart, Hinton, Williams (1986) paper: “Learning representations by back-propagating errors”
Algorithm existed earlier (Werbos 1974, others) but 1986 paper demonstrated effectiveness for neural networks
Name origin:
Gradients propagate backward through network:
Direction: Output \(\to\) Input (opposite of data flow)
Data flow:
Forward: \(\mathbf{x} \to \mathbf{a}^{(1)} \to \mathbf{a}^{(2)} \to \cdots \to \mathbf{a}^{(L)} \to \mathcal{L}\)
Backward: \(\nabla_{\mathbf{x}} \mathcal{L} \leftarrow \boldsymbol{\delta}^{(1)} \leftarrow \boldsymbol{\delta}^{(2)} \leftarrow \cdots \leftarrow \boldsymbol{\delta}^{(L)} \leftarrow 1\)
Gradients propagate in reverse order (output \(\to\) input)
Common misconception: “Backpropagation of errors”
Correct: Backpropagation of gradients
Delta values \(\boldsymbol{\delta}^{(l)}\) are not errors—they are partial derivatives \(\frac{\partial \mathcal{L}}{\partial \mathbf{s}^{(l)}}\)
Forward pass (single sample):
Layer \(l\): Matrix multiply \(W^{(l)}\mathbf{a}^{(l-1)}\)
Cost: \(O(n_l \cdot n_{l-1})\) multiplications
Activation: \(O(n_l)\) (element-wise)
Total forward: \(\sum_{l=1}^L O(n_l \cdot n_{l-1})\)
Backward pass:
Delta propagation: \((W^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}\)
Cost: \(O(n_l \cdot n_{l+1})\) (same as forward)
Weight gradient: \(\boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T\)
Cost: \(O(n_l \cdot n_{l-1})\) (outer product)
Total backward: \(\sum_{l=1}^L O(n_l \cdot n_{l-1})\)
Result: Backward cost = Forward cost
Including both: \(\approx\) 2× forward pass
Example: 3-layer network
\(784 \to 256 \to 128 \to 10\)
Forward operations: - Layer 1: \(784 \times 256 = 200{,}704\) - Layer 2: \(256 \times 128 = 32{,}768\) - Layer 3: \(128 \times 10 = 1{,}280\) - Total: \(234{,}752\) multiplications
Backward operations: - Delta 3→2: \(128 \times 10 = 1{,}280\) - Delta 2→1: \(256 \times 128 = 32{,}768\) - Delta 1→0: \(784 \times 256 = 200{,}704\) - Gradients: Same counts as forward - Total: \(2 \times 234{,}752 = 469{,}504\)
Total: \(3 \times\) forward pass operations
Compare to numerical differentiation: \(234{,}752\) forward passes
Speedup: \(\frac{234{,}752}{3} \approx 78{,}000\times\)
Parameters: \(\sum_{l=1}^L n_l \cdot n_{l-1}\)
Must store \(W^{(l)}\) and \(\mathbf{b}^{(l)}\) for all layers
Gradients: Same size as parameters
Store \(\nabla W^{(l)}\) and \(\nabla \mathbf{b}^{(l)}\)
Activations: \(\sum_{l=1}^L n_l \cdot m\)
Store \(\mathbf{a}^{(l)}\) and \(\mathbf{s}^{(l)}\) for batch size \(m\)
Needed for backward pass
Memory scaling:
Parameters: Fixed (independent of batch size)
Activations: Linear in batch size
Activations dominate for large batches
Example: ResNet-50, batch 32, ImageNet (224×224 input)
Parameters: - \(W^{(l)}\), \(\mathbf{b}^{(l)}\): 25M parameters - Storage: \(25 \times 10^6 \times 4\) bytes = 100 MB
Gradients: - \(\nabla W^{(l)}\), \(\nabla \mathbf{b}^{(l)}\): 25M values - Storage: 100 MB
Activations (batch 32):
Activation memory varies by layer (spatial dimensions decrease, channels increase):
Forward activations: ~200 MB, Backward gradients: ~200 MB
Total memory (batch 32): 100 + 100 + 400 = 600 MB
Scaling with batch size: - Batch 64: Activations ≈ 800 MB, total ≈ 1000 MB - Batch 128: Activations ≈ 1600 MB, total ≈ 1800 MB - Batch 256: Activations ≈ 3200 MB, total ≈ 3400 MB
For large batches, activations dominate memory usage
Standard backpropagation:
Store all activations \(\mathbf{a}^{(l)}\), \(\mathbf{z}^{(l)}\) during forward
Memory: \(O(L \cdot n \cdot m)\) for \(L\) layers, hidden size \(n\), batch size \(m\)
Example: ResNet-50, batch 32, ImageNet (224×224 input)
Activation memory varies by layer (spatial dimensions decrease, channels increase):
Total forward activations (all 50 layers): ~200 MB
Backward activation gradients (same size): ~200 MB
Combined activation memory: ~400 MB
Parameters: 25M × 4 bytes = 100 MB
Gradients: 100 MB
Total: ~600 MB (activations dominate)
Gradient checkpointing:
Checkpointing analysis:
For \(L\) layers with \(\sqrt{L}\) checkpoints:
Memory: \(O(\sqrt{L} \cdot n \cdot m)\)
Computation: Each of \(\sqrt{L}\) segments recomputed once during backward
Cost: \(1\) forward \(+ \sqrt{L}\) recompute \(+ 1\) backward
Example: 100 layers
Standard: 100 layers stored, cost = 1F + 1B
Checkpointing: 10 checkpoints stored, 10 segments recomputed
Recomputation: 10 forward passes (one per segment)
Cost: 1F (initial) + 10F (recompute) + 1B = 11F + 1B
Memory reduction: \(\frac{100}{10} = 10\times\)
Time increase: \(\frac{11F + 1B}{1F + 1B} \approx \frac{11 + 2}{1 + 2} = \frac{13}{3} \approx 4.3\times\)
Used when: - Very deep networks (\(L > 100\)) - Large batch sizes - Limited GPU memory - Trade: 4× longer training for 10× less memory
Algorithm unchanged—only \(\sigma'(\mathbf{s}^{(l)})\) differs per layer
Sigmoid: \(\sigma(s) = \frac{1}{1+e^{-s}}\)
\[\sigma'(s) = \sigma(s)(1-\sigma(s))\]
Can compute from stored activation: \(a(1-a)\)
Tanh: \(\sigma(s) = \tanh(s)\)
\[\tanh'(s) = 1 - \tanh^2(s)\]
Also from activation: \(1 - a^2\)
ReLU: \(\sigma(s) = \max(0, s)\)
\[\text{ReLU}'(s) = \begin{cases} 1 & s > 0 \\ 0 & s \leq 0 \end{cases}\]
Indicator function (discontinuous at 0)
Leaky ReLU: \(\sigma(s) = \max(\alpha s, s)\) with \(\alpha = 0.01\)
\[\text{LReLU}'(s) = \begin{cases} 1 & s > 0 \\ \alpha & s \leq 0 \end{cases}\]
ReLU advantages: - Sparse gradients (50% zeros) - No vanishing gradient for active units (\(s > 0\)) - Cheap to compute
Element-wise activations (sigmoid, tanh, ReLU):
\[a_i^{(l)} = h(s_i^{(l)})\]
Each output depends only on corresponding input
Jacobian is diagonal: \[\frac{\partial a_i^{(l)}}{\partial s_j^{(l)}} = \begin{cases} h'(s_i^{(l)}) & i = j \\ 0 & i \neq j \end{cases}\]
Delta update uses Hadamard product: \[\boldsymbol{\delta}^{(l)} = (W^{(l+1)})^T \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{s}^{(l)})\]
Softmax (non-element-wise):
\[a_i = \frac{e^{s_i}}{\sum_j e^{s_j}}\]
Each output depends on all inputs
Jacobian is full matrix: \[\frac{\partial a_i}{\partial s_j} = \begin{cases} a_i(1 - a_i) & i = j \\ -a_i a_j & i \neq j \end{cases}\]
Delta initialization requires matrix-vector product: \[\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{s}^{(L)}} h(\mathbf{s}^{(L)}) \cdot \nabla_{\mathbf{a}^{(L)}} \mathcal{L}\]
Special case (softmax + cross-entropy): \[\boldsymbol{\delta}^{(L)} = \mathbf{a}^{(L)} - \mathbf{y}\]
Jacobian multiplication simplifies to subtraction
Standard
Regularized
Objective: \[\min_W \mathcal{L} = \min_W \frac{1}{m}\sum_{i=1}^m \ell(\hat{\mathbf{y}}^{(i)}, \mathbf{y}^{(i)})\]
Gradient: \[\nabla W^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T\]
Objective: \[\min_W \mathcal{L}_{\text{reg}} = \frac{1}{m}\sum_{i=1}^m \ell(\hat{\mathbf{y}}^{(i)}, \mathbf{y}^{(i)}) + \lambda R(W)\]
Gradient: \[\nabla W^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T + \lambda \nabla_W R(W^{(l)})\]
Common regularizers: \(R_{L2}(W) = \|W\|_2\), \(R_{L1}(W) = \|W\|_1\)
What remains unchanged:
Why: Regularization term \(R(W)\) depends only on weights, not activations or biases
L2 regularization (weight decay):
\[\mathcal{L}_{\text{total}} = \mathcal{L} + \lambda \sum_l \|W^{(l)}\|_F^2\]
Frobenius norm: \(\|W\|_F^2 = \sum_{i,j} w_{ij}^2\)
Gradient: \[\frac{\partial \mathcal{L}_{\text{total}}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial W^{(l)}} + 2\lambda W^{(l)}\]
Modified update: \[W^{(l)} \leftarrow W^{(l)} - \eta\left(\boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T + 2\lambda W^{(l)}\right)\]
Equivalent to weight decay: \(W^{(l)} \leftarrow (1 - 2\eta\lambda)W^{(l)} - \eta \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T\)
L1 regularization:
\[\mathcal{L}_{\text{total}} = \mathcal{L} + \lambda \sum_l \|W^{(l)}\|_1\]
L1 norm: \(\|W\|_1 = \sum_{i,j} |w_{ij}|\)
Subgradient: \[\frac{\partial \mathcal{L}_{\text{total}}}{\partial w_{ij}} = \frac{\partial \mathcal{L}}{\partial w_{ij}} + \lambda \text{sign}(w_{ij})\]
Modified update: \[w_{ij} \leftarrow w_{ij} - \eta\left(\frac{\partial \mathcal{L}}{\partial w_{ij}} + \lambda \text{sign}(w_{ij})\right)\]
Promotes sparsity: weights shrink toward zero
Bias terms typically not regularized
Architecture: MNIST digit classifier
\[784 \to 256 \to 128 \to 64 \to 32 \to 10\]
Input: 28×28 grayscale images = 784 pixels
Output: 10 class probabilities
Parameters per layer:
Biases: \(256 + 128 + 64 + 32 + 10 = 490\)
Total: \(244{,}032\) weights + \(490\) biases = \(244{,}522\) parameters
Backprop computes all 244,522 gradients in \(\approx\) 2 forward passes
Delta magnitude at layer \(l\):
\[\|\boldsymbol{\delta}^{(l)}\| \approx \|\boldsymbol{\delta}^{(L)}\| \prod_{k=l+1}^L \|W^{(k)}\| \|\sigma'(\mathbf{s}^{(k)})\|\]
Rough approximation ignoring vector structure
Sigmoid network:
Maximum derivative: \(\max_s \sigma'(s) = 0.25\) at \(s=0\)
Typical values: \(\sigma'(s) \approx 0.2\) in practice
After 10 layers: \((0.2)^{10} \approx 10^{-7}\)
Gradients vanish exponentially with depth
Early layers receive near-zero gradients, fail to train
ReLU network:
Derivative: \(\text{ReLU}'(s) \in \{0, 1\}\)
Active units (\(s > 0\)): gradient = 1 (no shrinking)
Dead units (\(s \leq 0\)): gradient = 0 (propagation stops)
Better gradient flow if units stay active
Sigmoid/tanh: Exponential decay
ReLU: Better but still degrades
Nodes: Operations (MatMul, Add, Sigmoid, etc.)
Edges: Tensors flowing between operations
Example: 2-layer network decomposed
Forward: \(\mathbf{x} \to W^{(1)}\mathbf{x} \to + \mathbf{b}^{(1)} \to \sigma \to W^{(2)} \to + \mathbf{b}^{(2)} \to \sigma \to \mathcal{L}\)
Each arrow represents data dependency
Directed: Information flows one direction (forward)
Acyclic: No loops (feedforward network)
Topological order: Can order nodes so all inputs come before outputs
Forward pass: Traverse in topological order
Backward pass: Traverse in reverse topological order
Benefits: - Modular: Each operation self-contained - Composable: Build complex functions from primitives - Automatic: Can differentiate any graph mechanically
Each node: Implements forward and backward
class Operation:
def forward(self, inputs):
# Compute output from inputs
# Cache inputs for backward
self.cache = inputs
return self.compute(inputs)
def backward(self, grad_output):
# Compute gradient w.r.t. inputs
# Use cached values
return self.local_gradient(
grad_output, self.cache)
No global knowledge required
Node only needs: - How to compute output from inputs (forward) - How to compute input gradients from output gradient (backward)
Chain rule automatic:
Graph framework connects local gradients
Node A → Node B: \(\frac{\partial \mathcal{L}}{\partial \text{input}_A} = \frac{\partial \text{output}_B}{\partial \text{input}_A}^T \frac{\partial \mathcal{L}}{\partial \text{output}_B}\)
Framework handles composition
Forward: \(\mathbf{y} = A\mathbf{x}\)
Input: Matrix \(A \in \mathbb{R}^{m \times n}\), vector \(\mathbf{x} \in \mathbb{R}^n\)
Output: \(\mathbf{y} \in \mathbb{R}^m\)
Backward: Given \(\frac{\partial \mathcal{L}}{\partial \mathbf{y}}\), compute gradients w.r.t. inputs
Gradient w.r.t. \(\mathbf{x}\): \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = A^T \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\]
Dimension: \([n \times 1] = [n \times m][m \times 1]\)
Gradient w.r.t. \(A\): \[\frac{\partial \mathcal{L}}{\partial A} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \mathbf{x}^T\]
Dimension: \([m \times n] = [m \times 1][1 \times n]\)
Implementation:
Numerical example:
\[A = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\]
Forward: \[\mathbf{y} = \begin{bmatrix} 0.5(1) + 0.3(2) \\ -0.2(1) + 0.4(2) \end{bmatrix} = \begin{bmatrix} 1.1 \\ 0.6 \end{bmatrix}\]
Backward (assume \(\frac{\partial \mathcal{L}}{\partial \mathbf{y}} = \begin{bmatrix} 0.8 \\ -0.5 \end{bmatrix}\)):
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \begin{bmatrix} 0.5 & -0.2 \\ 0.3 & 0.4 \end{bmatrix} \begin{bmatrix} 0.8 \\ -0.5 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.04 \end{bmatrix}\]
\[\frac{\partial \mathcal{L}}{\partial A} = \begin{bmatrix} 0.8 \\ -0.5 \end{bmatrix} \begin{bmatrix} 1 & 2 \end{bmatrix} = \begin{bmatrix} 0.8 & 1.6 \\ -0.5 & -1.0 \end{bmatrix}\]
Forward: \(z = x + y\)
Inputs: Scalars, vectors, or matrices (same shape)
Output: Same shape as inputs
Backward: Given \(\frac{\partial \mathcal{L}}{\partial z}\)
Addition distributes gradient to both inputs:
\[\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial z}\] \[\frac{\partial \mathcal{L}}{\partial y} = \frac{\partial \mathcal{L}}{\partial z}\]
Gradient copies to each input (no modification)
Broadcast addition: \(\mathbf{z} = \mathbf{x} + b\) where \(b\) scalar
Forward: Add \(b\) to each element of \(\mathbf{x}\)
Backward: \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}}\] \[\frac{\partial \mathcal{L}}{\partial b} = \sum_i \frac{\partial \mathcal{L}}{\partial z_i}\]
Gradient w.r.t. scalar sums gradients from all elements
Gradient distribution rule: - Fork (1→many): Copy gradient - Join (many→1): Sum gradients
Forward: \(\mathbf{a} = \sigma(\mathbf{s})\)
Apply function element-wise: \(a_i = \sigma(s_i)\)
Backward: Given \(\frac{\partial \mathcal{L}}{\partial \mathbf{a}}\)
\[\frac{\partial \mathcal{L}}{\partial \mathbf{s}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}} \odot \sigma'(\mathbf{s})\]
Hadamard product (element-wise multiply)
Each \(\frac{\partial \mathcal{L}}{\partial s_i} = \frac{\partial \mathcal{L}}{\partial a_i} \cdot \sigma'(s_i)\)
Different activations:
Sigmoid: \(\sigma'(s) = \sigma(s)(1-\sigma(s))\) (can compute from \(a\))
ReLU: \(\sigma'(s) = \mathbb{1}_{s > 0}\) (needs \(s\), not just \(a\))
Implementation:
Fork (one input, multiple outputs):
Single value used by multiple operations
Example: \(x \to f(x)\) and \(x \to g(x)\)
Backward: Gradients from all paths sum
\[\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial f} \frac{\partial f}{\partial x} + \frac{\partial \mathcal{L}}{\partial g} \frac{\partial g}{\partial x}\]
Each path contributes independently
Join (multiple inputs, one output):
Multiple values combine into single operation
Example: \(z = f(x, y)\)
Backward: Each input gets its gradient
\[\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial z} \frac{\partial z}{\partial x}\] \[\frac{\partial \mathcal{L}}{\partial y} = \frac{\partial \mathcal{L}}{\partial z} \frac{\partial z}{\partial y}\]
Separate gradients for each input
Skip connections, attention mechanisms: Multiple paths converge
Static graph: Structure fixed before execution
Build graph once, execute many times
TensorFlow 1.x, compiled frameworks
Optimization possible (graph fusion, memory planning)
Cannot change based on data
Dynamic graph: Structure built during forward pass
Graph structure can depend on data
PyTorch, TensorFlow 2.x (eager mode)
Example: Variable-length sequences
RNN processing sentence: Graph depth = sequence length
Different inputs \(\to\) different graphs
Backward pass: Follow actual graph from forward
Only differentiate paths that were executed
Benefits: - Control flow (if/while in forward) - Data-dependent architecture - Easier debugging (standard Python)
Cost: Less optimization opportunity
PyTorch example:
import torch
# Forward pass (builds graph)
x = torch.tensor([1.0, 2.0],
requires_grad=True)
W = torch.tensor([[0.5, 0.3],
[0.2, -0.4]],
requires_grad=True)
y = W @ x
loss = y.sum()
# Backward pass (automatic)
loss.backward()
# Gradients computed
print(x.grad) # ∂L/∂x
print(W.grad) # ∂L/∂W
Framework tracks all operations on tensors
Builds computation graph automatically
Backward computes all gradients
User never writes backward: Framework does it
Only need to define forward computation
How it works:
Each operation on tensor: 1. Computes forward result 2. Registers itself in computation graph 3. Stores backward function
# Conceptual implementation
class Tensor:
def __init__(self, data, grad_fn=None):
self.data = data
self.grad = None
self.grad_fn = grad_fn # Backward
def backward(self):
self.grad = 1.0 # Start
# Traverse graph backward
for op in reversed(graph):
op.backward()
Modern frameworks: - PyTorch: Dynamic graph, eager execution - JAX: Functional, JIT compilation - TensorFlow 2.x: Eager by default
All use reverse-mode autodiff (backpropagation)
Forward pass defines computation
Backward pass automatic
Gradients computed correctly by framework
Operator fusion:
Multiple operations combined into single kernel
Example: Linear + ReLU \(\to\) Fused Linear-ReLU
Reduces memory traffic, faster execution
Memory planning:
Reuse memory for intermediate tensors
Tensor \(A\) computed at step 5, last used at step 8
Reuse memory after step 8 for new tensors
Gradient checkpointing:
Trade computation for memory
Don’t store all activations
Recompute during backward as needed
Example: Fused operations
Separate:
z = W @ x # Write to memory
a = sigmoid(z) # Read z, write a
Fused:
a = fused_linear_sigmoid(W, x)
# Compute z element-wise, apply sigmoid
# Never write z to global memory
10-30% speedup typical
Fusion more beneficial for complex operations (less memory bandwidth bound)
import numpy as np
class TwoLayerNet:
def __init__(self, input_dim, hidden_dim,
output_dim):
# Xavier initialization
self.W1 = np.random.randn(
hidden_dim, input_dim
) * np.sqrt(2 / input_dim)
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(
output_dim, hidden_dim
) * np.sqrt(2 / hidden_dim)
self.b2 = np.zeros(output_dim)
# Store forward pass values
self.cache = {}
Example: 784 → 256 → 10 network
def forward(self, X):
# X: (batch_size, input_dim)
self.cache['A0'] = X
# Layer 1
self.cache['S1'] = X @ self.W1.T + self.b1
self.cache['A1'] = sigmoid(self.cache['S1'])
# Layer 2
self.cache['S2'] = self.cache['A1'] @ \
self.W2.T + self.b2
self.cache['A2'] = softmax(self.cache['S2'])
return self.cache['A2']
def sigmoid(s):
return 1 / (1 + np.exp(-s))
def softmax(s):
# Subtract max for numerical stability
s_shift = s - np.max(s, axis=1,
keepdims=True)
exp_s = np.exp(s_shift)
return exp_s / np.sum(exp_s, axis=1,
keepdims=True)
def backward(self, y_true):
m = y_true.shape[0]
# Output layer (softmax + cross-entropy)
delta2 = (self.cache['A2'] - y_true) / m
self.dW2 = delta2.T @ self.cache['A1']
self.db2 = np.sum(delta2, axis=0)
# Hidden layer
A1 = self.cache['A1']
delta1 = (delta2 @ self.W2) * \
A1 * (1 - A1)
self.dW1 = delta1.T @ self.cache['A0']
self.db1 = np.sum(delta1, axis=0)
def update_parameters(self, learning_rate=0.1):
self.W2 -= learning_rate * self.dW2
self.b2 -= learning_rate * self.db2
self.W1 -= learning_rate * self.dW1
self.b1 -= learning_rate * self.db1
Complete: 30 lines for forward + backward + update
All 203,530 gradients computed in one backward pass
# Generate synthetic data
np.random.seed(42)
n_samples = 1000
n_features = 20
n_classes = 5
# Random features
X = np.random.randn(n_samples, n_features)
# Random labels (one-hot)
labels = np.random.randint(0, n_classes,
n_samples)
y = np.zeros((n_samples, n_classes))
y[np.arange(n_samples), labels] = 1
# Initialize network
net = TwoLayerNet(
input_dim=n_features,
hidden_dim=50,
output_dim=n_classes
)
# Train
losses = []
for epoch in range(500):
# Forward
predictions = net.forward(X)
loss = -np.mean(np.sum(
y * np.log(predictions + 1e-10),
axis=1
))
losses.append(loss)
# Backward
net.backward(y)
# Update
net.update_parameters(learning_rate=0.1)
Loss: \(1.609 \to 1.456\) (random labels, cannot fit perfectly)
Finite difference approximation:
\[\frac{\partial \mathcal{L}}{\partial w} \approx \frac{\mathcal{L}(w + \epsilon) - \mathcal{L}(w - \epsilon)}{2\epsilon}\]
Centered difference: \(O(\epsilon^2)\) error
One-sided: \(O(\epsilon)\) error
Implementation:
def numerical_gradient(f, x, eps=1e-5):
"""Compute gradient via finite differences"""
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'])
while not it.finished:
idx = it.multi_index
old_value = x[idx]
x[idx] = old_value + eps
f_plus = f(x)
x[idx] = old_value - eps
f_minus = f(x)
grad[idx] = (f_plus - f_minus) / (2*eps)
x[idx] = old_value # Restore
it.iternext()
return grad
Choice of \(\epsilon\):
Too large: \(f(w \pm \epsilon) \approx f(w) \pm \epsilon f'(w) + \frac{\epsilon^2}{2}f''(w)\)
Error: \(O(\epsilon^2)\) from Taylor expansion
Too small: Numerical cancellation
\(f(w + \epsilon) - f(w - \epsilon)\) loses precision
Typical values: \(\epsilon = 10^{-5}\) to \(10^{-7}\)
def gradient_check(net, X, y, eps=1e-5):
# Compute analytical gradient
net.forward(X)
net.backward(y)
analytic_dW1 = net.dW1.copy()
# Compute numerical gradient
def loss_fn(W):
net.W1 = W.reshape(net.W1.shape)
pred = net.forward(X)
return -np.mean(np.sum(
y * np.log(pred + 1e-10), axis=1
))
W1_flat = net.W1.flatten()
numeric_dW1 = numerical_gradient(
loss_fn, W1_flat, eps
).reshape(net.W1.shape)
# Compare
diff = np.linalg.norm(
analytic_dW1 - numeric_dW1
)
scale = np.linalg.norm(analytic_dW1) + \
np.linalg.norm(numeric_dW1)
relative_error = diff / scale
return relative_error
Relative error metric: \(\frac{\|\nabla_{\text{analytic}} - \nabla_{\text{numeric}}\|}{\|\nabla_{\text{analytic}}\| + \|\nabla_{\text{numeric}}\|}\)
Scale-invariant comparison
All parameters: relative error < \(10^{-8}\)
Implementation correct
Detailed verification: \(W_1^{(1)}\) (row 1, column 1 of \(W^{(1)}\))
Analytical gradient (from backprop):
Result: \(\frac{\partial \mathcal{L}}{\partial W_1^{(1)}} = -0.00234\)
Numerical gradient (finite difference):
eps = 1e-5
old_val = net.W1[0, 0]
# Perturb +epsilon
net.W1[0, 0] = old_val + eps
loss_plus = cross_entropy(net.forward(X), y)
# Perturb -epsilon
net.W1[0, 0] = old_val - eps
loss_minus = cross_entropy(net.forward(X), y)
# Restore and compute gradient
net.W1[0, 0] = old_val
numerical = (loss_plus - loss_minus) / (2*eps)
Result: \(-0.00234\)
Analytical and numerical gradients match to \(10^{-8}\)
Buggy implementation: Missing sigmoid derivative
def backward_buggy(self, y_true):
m = y_true.shape[0]
delta2 = (self.cache['A2'] - y_true) / m
self.dW2 = delta2.T @ self.cache['A1']
self.db2 = np.sum(delta2, axis=0)
# BUG: Missing activation derivative
delta1 = (delta2 @ self.W2) # Wrong!
# Should be:
# delta1 = (delta2 @ self.W2) * A1 * (1-A1)
self.dW1 = delta1.T @ self.cache['A0']
self.db1 = np.sum(delta1, axis=0)
Gradient check result:
error = gradient_check(buggy_net, X, y)
print(f"Relative error: {error:.2e}")
# Output: Relative error: 7.34e-01
Error \(\sim 0.7\): Implementation wrong
Correct implementation: Error < \(10^{-7}\)
Gradient check immediately detects missing derivative
Network: 784 → 256 → 128 → 10 (MNIST classifier)
Forward pass (single sample):
Layer 1: \(256 \times 784 = 200{,}704\) operations
Layer 2: \(128 \times 256 = 32{,}768\) operations
Layer 3: \(10 \times 128 = 1{,}280\) operations
Total: \(234{,}752\) operations
Backward pass:
Delta propagation: Same matrix operations (transposed)
Gradient computation: Outer products, same cost
Total: \(\approx 2 \times\) forward pass = \(469{,}504\) operations
Complete training step: \(\approx 704{,}256\) operations
Batch size \(m = 32\): \(32 \times 704{,}256 = 22{,}536{,}192\) operations
Modern GPUs: ~10 TFLOPS \(\to\) 0.002 ms per batch
Backward cost ≈ 2× forward, not \(n\)× for \(n\) parameters
Parameters (independent of batch size):
784→256→128→10 network:
Gradients: Same as parameters = 940 KB
Activations (scales with batch \(m\)):
Per sample: \(256 + 128 + 10 = 394\) values
Store pre- and post-activation: \(2 \times 394 = 788\) values
Batch \(m = 32\): \(788 \times 32 = 25{,}216\) values × 4 bytes = 101 KB
Total memory (batch 32):
Parameters + Gradients + Activations = 940 + 940 + 101 = 1,981 KB ≈ 2 MB
Scaling: GPT-2 (1.5B parameters, batch 32, seq 1024)
Parameters: 1.5B × 4 = 6 GB
Activations: ~50 GB (dominates)
Activations dominate memory for batch size > 298
Finite differences: Perturb each parameter individually
For network with \(n\) parameters:
\[\frac{\partial \mathcal{L}}{\partial w_i} \approx \frac{\mathcal{L}(w_i + \epsilon) - \mathcal{L}(w_i - \epsilon)}{2\epsilon}\]
Requires 2 forward passes per parameter
Total: \(2n\) forward passes
Example: 784→256→128→10 network
Parameters: \(n = 235{,}146\)
Forward passes: \(2 \times 235{,}146 = 470{,}292\)
Backpropagation: 1 forward + 1 backward pass
Backward cost ≈ 2× forward
Total: Equivalent to 3 forward passes
Speedup: \(\frac{470{,}292}{3} = 156{,}764\times\)
For 1M parameter network: 1M× speedup
Speedup linear in parameter count
Forward mode: Propagate derivatives forward with computation
For \(n\) inputs, 1 output: \(n\) passes required (one per input)
Efficient when \(n_{\text{inputs}} \ll n_{\text{outputs}}\)
Reverse mode (backpropagation): Propagate derivatives backward
For \(n\) inputs, 1 output: 1 pass required
Efficient when \(n_{\text{inputs}} \gg n_{\text{outputs}}\)
Neural networks:
Inputs: Million parameters
Outputs: Single scalar loss
Reverse mode optimal: 1 pass vs 1M passes
Example: Function \(f: \mathbb{R}^n \to \mathbb{R}\)
Forward mode: \(O(n)\) passes to compute \(\nabla f\)
Reverse mode: \(O(1)\) passes to compute \(\nabla f\)
Counter-example: \(f: \mathbb{R} \to \mathbb{R}^m\)
Forward mode: \(O(1)\) passes
Reverse mode: \(O(m)\) passes
Reverse mode (backprop) optimal for supervised learning
Experiment: Train 10-layer sigmoid network
Monitor \(\|\nabla W^{(l)}\|\) at each layer during training
Network: 100→100→…→100→10 (10 layers, 100 units each)
Data: MNIST (784→100 via projection)
Training: 1000 iterations, batch size 64, learning rate 0.01
Observation:
Early layers (1-3): Gradient norm \(\sim 10^{-6}\)
Late layers (8-10): Gradient norm \(\sim 10^{-2}\)
Ratio: \(10^4\) difference across network
Early layers barely update
Cause:
\[\|\nabla W^{(1)}\| \propto \prod_{l=2}^{10} \|W^{(l)}\| \|\sigma'(\mathbf{s}^{(l)})\|\]
Sigmoid: \(\max |\sigma'(s)| = 0.25\)
Product of 9 terms: \((0.25)^9 \approx 4 \times 10^{-6}\)
Gradients vanish exponentially with depth for sigmoid
Experiment: Same 10-layer network with different activations
Compare gradient flow: - Sigmoid: \(\sigma'(s) = \sigma(s)(1-\sigma(s))\), max = 0.25 - Tanh: \(\tanh'(s) = 1 - \tanh^2(s)\), max = 1.0 - ReLU: \(\text{ReLU}'(s) = \mathbb{1}_{s>0}\), value = 1 when active
Measurement: \(\|\nabla W^{(1)}\|\) after 100 training steps
Results: - Sigmoid: \(3.2 \times 10^{-7}\) (vanished) - Tanh: \(1.4 \times 10^{-4}\) (small but usable) - ReLU: \(2.1 \times 10^{-2}\) (healthy)
Training loss (after 1000 steps): - Sigmoid: 2.15 (barely improved from 2.30) - Tanh: 1.82 (modest improvement) - ReLU: 0.45 (converged)
ReLU preserves gradients through depth, making deep networks trainable
ReLU: Best gradient preservation through depth
Experiment: Train 784→256→128→10 network on MNIST
Hardware: NVIDIA V100 GPU (16 GB)
Batch sizes: 32, 64, 128, 256
Measurements (milliseconds per training step):
Batch | Forward | Backward | Total | Throughput |
---|---|---|---|---|
32 | 0.12 | 0.24 | 0.36 | 89 samples/ms |
64 | 0.18 | 0.35 | 0.53 | 121 samples/ms |
128 | 0.31 | 0.58 | 0.89 | 144 samples/ms |
256 | 0.58 | 1.12 | 1.70 | 151 samples/ms |
Backward consistently \(\approx\) 2× forward time
Throughput peaks at batch 256 (GPU saturation)
Memory (batch 256):
Parameters: 0.9 MB
Gradients: 0.9 MB
Activations: 0.4 MB
Total: 2.2 MB (tiny for modern GPU)
Backward time ≈ 2× forward (matches theory)
LMS adaptive filter (linear):
\[\hat{y}_t = \mathbf{w}_t^T \mathbf{x}_t\]
\[\mathbf{w}_{t+1} = \mathbf{w}_t + \mu (y_t - \hat{y}_t) \mathbf{x}_t\]
Gradient: \(\frac{\partial \hat{y}_t}{\partial \mathbf{w}} = \mathbf{x}_t\) (trivial for linear model)
MLP with online backpropagation:
\[\hat{y}_t = f(\mathbf{x}_t; \Theta_t)\]
\[\Theta_{t+1} = \Theta_t - \eta \nabla_\Theta \mathcal{L}(f(\mathbf{x}_t; \Theta_t), y_t)\]
Gradient: \(\frac{\partial f}{\partial \Theta}\) requires backpropagation (nontrivial for deep, nonlinear model)
Same conceptual algorithm:
What changes: Step 3 complexity
LMS: Gradient is input \(\mathbf{x}_t\) (immediate)
MLP: Gradient computed via chain rule through layers (backpropagation)
Backpropagation = systematic gradient computation for nonlinear, multilayer adaptive filtering
Enables same stochastic gradient descent framework, but for deep architectures
See EE 483 (adaptive algorithms), EE 503 (stochastic optimization)
Standard batch training: Stationary \(P(Y|X)\)
Accumulate gradients over dataset: \[\Theta \leftarrow \Theta - \eta \frac{1}{m}\sum_{i=1}^m \nabla_\Theta \mathcal{L}(f(\mathbf{x}_i; \Theta), y_i)\]
Converges to minimize \(\mathbb{E}[\mathcal{L}]\) over fixed distribution
Assumption: Data distribution doesn’t change
Online adaptation: Time-varying \(P_t(Y|X)\)
Update after each sample (no accumulation): \[\Theta_{t+1} = \Theta_t - \eta \nabla_\Theta \mathcal{L}(f(\mathbf{x}_t; \Theta_t), y_t)\]
Tracks changing distribution if learning rate chosen properly
Learning rate controls tracking speed vs stability tradeoff
Example deployment scenario:
Train on historical data (batch mode)
Deploy in changing environment (online adaptation)
Strategy: Freeze early layers, adapt final layer only
\[\Theta_{\text{final}}^{(t+1)} = \Theta_{\text{final}}^{(t)} - \eta \nabla_{\Theta_{\text{final}}} \mathcal{L}_t\]
Why this works:
Early layers: General feature extraction (stable across environments)
Final layer: Task-specific mapping (may drift with distribution shift)
Fewer parameters to adapt → faster convergence, less overfitting to noise
Backpropagation enables: Computing \(\nabla_{\Theta_{\text{final}}} \mathcal{L}_t\) efficiently even when final layer is deep in network