Deep Neural Network Learning Process
Click play to watch the network learn to classify overlapping data
Use video controls to play, pause, or scrub through the training process
EE 541 - Unit 1
Fall 2025
Deep Neural Network Learning Process
Click play to watch the network learn to classify overlapping data
Use video controls to play, pause, or scrub through the training process
“Learning is any process by which a system improves performance from experience.”
“A computer program is said to learn from experience E with respect to task T and performance measure P, if its performance at T, as measured by P, improves with experience E.”
\[\text{Learning} = \{\mathcal{T}, \mathcal{P}, \mathcal{E}\}\]
“All models are wrong, but some are useful”
“Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration”
Seek economical descriptions of phenomena
Where do you stop?
Worrying Selectively
It is inappropriate to be concerned about mice when there are tigers abroad
Two-Moons Dataset
Tests whether a model can learn curved decision boundaries. Two interleaving half-circles that cannot be separated by any straight line.
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10),
max_iter=1000,
random_state=42
)
mlp.fit(X_train, y_train)
print(f"Training accuracy: {mlp.score(X_train, y_train):.3f}")
print(f"Test accuracy: {mlp.score(X_test, y_test):.3f}")
Training accuracy: 0.979
Test accuracy: 0.950
Find \(h^* \in \mathcal{H}\) that minimizes:
\[\mathbb{E}_{(\mathbf{x},y) \sim P}[\mathcal{L}(h(\mathbf{x}), y)]\]
But we only have access to:
\[\frac{1}{N}\sum_{i=1}^N \mathcal{L}(h(\mathbf{x}_i), y_i)\]
The Generalization Gap
Minimize error on unseen data
using only observed samples
This gap is the generalization problem
Representation Determines Learnability
The choice of representation can make learning tractable or impossible. Deep learning learns representations automatically.
GOOD
GOOD
BAD
?
Can we classify the unknown pattern?
x = 0111111011100100000010000
001011111111101111001110
Label: “GOOD”
The same pattern can be represented as:
A hypothesis class can succeed or fail based on the choice of representation.
\[\mathbf{x} = \begin{bmatrix} 0111111011100100000010000 \\ 001011111111101111001110 \end{bmatrix}\]
\[y \in \{-1, +1\}\]
\[\hat{y} = \text{sign}(\boldsymbol{\theta} \cdot \mathbf{x}) = \text{sign}(\theta_1 x_1 + \cdots + \theta_{49} x_{49})\]
Definition: Linear Function
A function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) is linear if \(f(\mathbf{x}) = \boldsymbol{\theta}^\top \mathbf{x} + b\) for some \(\boldsymbol{\theta} \in \mathbb{R}^d\) and \(b \in \mathbb{R}\).
The decision boundary \(\{\mathbf{x} : f(\mathbf{x}) = 0\}\) is a hyperplane.
where:
\[\mathcal{H}_{\text{linear}}: h(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)\] \[\mathcal{H}_{\text{neural}}: h(\mathbf{x}) = h_2(\mathbf{W}_2 \cdot h_1(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)\]
\[y = h\left(\sum_{i=1}^n w_i x_i + b\right) = h(\mathbf{w}^T\mathbf{x} + b)\]
where \(h\) is an activation function:
\[\mathbf{w}^* = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\]
Solution: \[\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
\[\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla_{\mathbf{w}}\mathcal{L}(\mathbf{w}_t)\]
Iterative Optimization Principle
Gradient descent navigates the loss landscape by repeatedly moving in the direction of steepest descent. For convex problems, this guarantees convergence to the global minimum. For neural networks, we settle for local minima that generalize well.
\[\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]
“Data is the new oil”
But like oil, it must be refined to have value
The Data Value Equation: \[\text{Model Performance} = f(\text{Data Quality}, \text{Data Quantity})\]
The Power of Data Representation
The right representation can transform an impossible problem into a trivial one
As \(d \to \infty\):
d= 1: 0.050000 in outer shell
d= 2: 0.097500 in outer shell
d= 3: 0.142625 in outer shell
d= 10: 0.401263 in outer shell
d= 100: 0.994079 in outer shell
d=1000: 1.000000 in outer shell
Training on augmented data: \[\min_\theta \sum_{i=1}^N \sum_{j=1}^M \mathcal{L}(f_\theta(T_j(x_i)), y_i)\]
where \(T_j\) are augmentation transforms
Modern methods combine paradigms: GPT-4 uses unsupervised pre-training on text, supervised fine-tuning on tasks, and reinforcement learning from human feedback (RLHF).
Given: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\)
Learn: \(f: \mathcal{X} \to \mathcal{Y}\)
Minimize: \(\mathcal{L}(f(\mathbf{x}), y)\)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic data
np.random.seed(42)
X = np.random.randn(1000, 10)
w_true = np.random.randn(10)
y = (X @ w_true + np.random.randn(1000)*0.1 > 0).astype(int)
# Standard supervised pipeline
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
Train accuracy: 0.990
Test accuracy: 0.995
Given: \(\mathcal{D} = \{\mathbf{x}_i\}_{i=1}^N\)
Find: Hidden patterns, structure, representations
Finding Structure in Data
Transform unsupervised → supervised by creating pretext tasks
# Example: Simple masked prediction
def create_masked_task(sequence, mask_prob=0.15):
"""Create self-supervised task from sequence"""
masked = sequence.copy()
labels = np.full_like(sequence, -1)
mask_indices = np.random.random(len(sequence)) < mask_prob
masked[mask_indices] = 0 # [MASK] token
labels[mask_indices] = sequence[mask_indices]
return masked, labels
# Example sequence
sequence = np.array([1, 4, 2, 8, 3, 7, 5, 9])
masked_input, targets = create_masked_task(sequence)
print(f"Original: {sequence}")
print(f"Masked: {masked_input}")
print(f"Targets: {targets}")
Original: [1 4 2 8 3 7 5 9]
Masked: [1 0 0 0 3 0 5 9]
Targets: [-1 4 2 8 -1 7 -1 -1]
Foundation Models and Self-Supervision
Self-supervised learning powers modern foundation models like GPT and BERT
Components:
Objective: Maximize expected cumulative reward \[J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]\]
Hybrid Learning Approaches
Modern approaches often combine paradigms for better performance
Multilayer Perceptron (MLP): Fully connected feedforward network
Defn: Deep Neural Network
A neural network with more than one hidden layer. Depth enables hierarchical feature learning: early layers learn simple features, deeper layers learn complex abstractions.
At neuron \(i\) in layer \(l\):
\[a_i^{(l)} = h\left(\left[\mathbf{w}_i^{(l)}\right]^\top \mathbf{a}^{(l-1)} + b_i^{(l)}\right)\]
where:
\[\mathbf{a}^{(l)} = h\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)\]
A feedforward network with:
Can approximate any continuous function on compact subset of \(\mathbb{R}^n\) to arbitrary accuracy
class Layer:
def forward(self, x):
# Store for backward pass
self.x = x
# Linear transformation
self.z = np.dot(x, self.W) + self.b
# Apply activation
self.a = self.activation(self.z)
return self.a
def backward(self, grad_output):
# Chain rule through activation
grad_z = grad_output * \
self.activation_derivative(self.z)
# Parameter gradients
self.grad_W = np.dot(self.x.T, grad_z)
self.grad_b = np.sum(grad_z, axis=0)
# Input gradient for previous layer
grad_input = np.dot(grad_z, self.W.T)
return grad_input
In \(d\) dimensions with \(n\) parameters:
def sgd(w, grad, lr=0.01):
return w - lr * grad
def sgd_momentum(w, grad, velocity, lr=0.01, beta=0.9):
velocity = beta * velocity + lr * grad
return w - velocity, velocity
def adam(w, grad, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
return w - lr * m_hat / (np.sqrt(v_hat) + eps), m, v
“Dense networks contain sparse subnetworks that can train to comparable accuracy from the same initialization”
\[\text{Parameters: } 100M \to 10M\] \[\text{Performance: } 95\% \to 94.5\%\]
WARNING: Data Contamination
Never touch test data until final evaluation - validation data guides all decisions
Why does SGD find generalizing solutions?
Networks can memorize random labels perfectly,
yet SGD finds patterns when labels are real
Why does overparameterization help?
10x more parameters than samples should overfit,
but often improves test accuracy
What is the role of depth?
Shallow wide networks have same capacity,
but deep networks generalize better
How do transformers generalize?
No convolutions, no recurrence,
yet state-of-the-art on vision and language
import numpy as np
class Conv2D:
def __init__(self, in_channels, out_channels, kernel_size=3):
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
# Initialize filters
self.filters = np.random.randn(
out_channels, in_channels, kernel_size, kernel_size
) * 0.1
self.bias = np.zeros(out_channels)
def forward(self, x):
batch, in_c, height, width = x.shape
out_h = height - self.kernel_size + 1
out_w = width - self.kernel_size + 1
output = np.zeros((batch, self.out_channels, out_h, out_w))
# Convolution operation
for b in range(batch):
for oc in range(self.out_channels):
for h in range(out_h):
for w in range(out_w):
# Extract patch
patch = x[b, :, h:h+self.kernel_size, w:w+self.kernel_size]
# Convolve with filter
output[b, oc, h, w] = np.sum(patch * self.filters[oc]) + self.bias[oc]
return output
class MaxPool2D:
def __init__(self, pool_size=2):
self.pool_size = pool_size
def forward(self, x):
batch, channels, height, width = x.shape
out_h = height // self.pool_size
out_w = width // self.pool_size
output = np.zeros((batch, channels, out_h, out_w))
for h in range(out_h):
for w in range(out_w):
h_start = h * self.pool_size
w_start = w * self.pool_size
pool_region = x[:, :, h_start:h_start+self.pool_size,
w_start:w_start+self.pool_size]
output[:, :, h, w] = np.max(pool_region, axis=(2, 3))
return output
# Example usage
x = np.random.randn(1, 3, 32, 32) # Batch=1, RGB, 32x32
conv = Conv2D(3, 16, kernel_size=3)
pool = MaxPool2D(pool_size=2)
x = conv.forward(x)
print(f"After conv: {x.shape}") # (1, 16, 30, 30)
x = np.maximum(0, x) # ReLU
x = pool.forward(x)
print(f"After pool: {x.shape}") # (1, 16, 15, 15)
After conv: (1, 16, 30, 30)
After pool: (1, 16, 15, 15)
\[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \mathcal{L}(f_\theta(x_i), y_i)\]
1. Become one with the data Look at your data. Plot it. Understand its distribution, outliers, patterns.
2. Set up end-to-end pipeline Get a simple model training before complexity.
3. Overfit a single batch If you can’t overfit 10 examples, something is broken.
4. Verify loss at initialization Check loss matches expected value (e.g., \(\log(n_{classes})\) for classification).
5. Add complexity gradually Start simple, add one thing at a time.
# The debugging progression
def debug_training():
# Step 1: Overfit one example
single_x = X[0:1]
single_y = y[0:1]
for _ in range(100):
loss = train_step(single_x, single_y)
assert loss < 0.01, "Can't overfit single"
# Step 2: Overfit small batch
batch_x = X[0:10]
batch_y = y[0:10]
for _ in range(500):
loss = train_step(batch_x, batch_y)
assert loss < 0.1, "Can't overfit batch"
# Step 3: Check with real data
# Only now move to full dataset
return "Ready for full training"
# Tracking experiments
experiment_config = {
'model': 'resnet18',
'dataset': 'cifar10',
'batch_size': 128,
'lr': 0.1,
'epochs': 100,
'seed': 42,
'timestamp': '2025-01-15-14:30'
}
# Always set seeds for reproducibility
def set_all_seeds(seed=42):
np.random.seed(seed)
# torch.manual_seed(seed)
# torch.cuda.manual_seed_all(seed)
# random.seed(seed)
# Log everything
def log_metrics(epoch, train_loss, val_loss, val_acc):
metrics = {
'epoch': epoch,
'train_loss': train_loss,
'val_loss': val_loss,
'val_acc': val_acc,
'lr': get_current_lr(),
'timestamp': time.time()
}
# Write to file, tensorboard, wandb, etc.
return metrics
# System Python - Don't do this
python install torch
# Error: requires numpy>=1.19
pip install numpy==1.20
# Breaks: opencv requires numpy==1.18
Dependency conflicts are inevitable
Each project needs:
# Download Miniconda (minimal) or Anaconda (full)
# miniconda.anaconda.com
# After installation, verify:
conda --version
conda info
# Update conda itself
conda update -n base conda
Binary package management
Cross-platform
Channel system - conda-forge
: Community packages - pytorch
: Official PyTorch builds - nvidia
: CUDA toolkit
Environment files
environment.yml
for reproducibility# Activate your environment first
conda activate ee541
# Core scientific stack
conda install numpy scipy matplotlib pandas
# Jupyter for notebooks
conda install jupyter ipykernel
# Register kernel for Jupyter
python -m ipykernel install --user --name ee541 --display-name "Python (ee541)"
# PyTorch - SELECT BASED ON YOUR SYSTEM
# CPU only
conda install pytorch torchvision torchaudio cpuonly -c pytorch
# CUDA 11.8 (NVIDIA GPU)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Mac M1/M2/M3 (Metal Performance Shaders)
conda install pytorch torchvision torchaudio -c pytorch
# Additional ML tools
conda install scikit-learn
conda install -c conda-forge tensorboard
GPU Configuration Check
nvidia-smi
# From terminal with environment active
(ee541) $ jupyter notebook
# Opens browser at localhost:8888
# Navigate to your work directory
Shift+Enter
: Run cell, move to nextCtrl+Enter
: Run cell, stayEsc
: Command modeEnter
: Edit modeA/B
: Insert cell above/belowVersion Control
import sys
import platform
print(f"Python: {sys.version}")
print(f"Platform: {platform.platform()}")
packages = {
'numpy': None,
'torch': None,
'torchvision': None,
'matplotlib': None,
'jupyter': None,
'sklearn': 'scikit-learn'
}
for import_name, pip_name in packages.items():
try:
module = __import__(import_name)
version = getattr(module, '__version__', 'installed')
print(f"✓ {import_name}: {version}")
# Special check for PyTorch GPU
if import_name == 'torch':
import torch
if torch.cuda.is_available():
print(f" GPU: CUDA ({torch.version.cuda})")
print(f" Device: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
print(f" GPU: MPS (Mac)")
else:
print(f" GPU: Not available")
except ImportError:
package = pip_name or import_name
print(f"✗ {import_name}: Not installed")
print(f" Install with: conda install {package}")
Python: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ]
Platform: macOS-15.6.1-arm64-arm-64bit
✓ numpy: 1.26.4
✓ torch: 2.5.1
GPU: MPS (Mac)
✓ torchvision: 0.20.1
✓ matplotlib: 3.10.6
✓ jupyter: installed
✓ sklearn: 1.6.1
Task: Classify clothing items into 10 categories
Architecture: Simple 2-layer network
Training: 4 epochs, Adam optimizer
Files:
Minimal.ipynb
: Core training loopVisualize-tensorboard.ipynb
: Monitoring
T-shirt
Trouser
Pullover
Dress
Coat
Sandal
Shirt
Sneaker
Bag
Boot
# Minimal.ipynb - Key components
# 1. Data Loading
train_loader = DataLoader(train_set, batch_size=100, shuffle=True)
test_loader = DataLoader(test_set, batch_size=100, shuffle=False)
# 2. Model Definition
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hidden = nn.Linear(784, 128)
self.output = nn.Linear(128, 10)
def forward(self, x):
x = F.relu(self.hidden(x))
return self.output(x)
# 3. Training Loop
for epoch in range(num_epochs):
for images, labels in train_loader:
# Forward pass
outputs = model(images.view(-1, 784))
loss = loss_func(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
Training Performance Benchmarks
Complete training in ~5 minutes on CPU, <1 minute on GPU
Observations:
Embedding Insight: Classes form distinct clusters in feature space
Common Confusions: Shirt ↔︎ T-shirt, Pullover ↔︎ Coat
Python and NumPy for Neural Networks
Next week: Array operations and automatic differentiation