
EE 541 - Unit 1
Spring 2026
“Learning is any process by which a system improves performance from experience.”
\[\text{Learning System} = (\mathcal{T}, \mathcal{P}, \mathcal{E})\]
Learning occurs when: \[\mathcal{P}_{\text{after}}(\mathcal{T}, \mathcal{E}) > \mathcal{P}_{\text{before}}(\mathcal{T})\]



“All models are wrong, but some are useful”
“Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration”
Box’s warning: More parameters ≠ better science
Question: Is 0.2% → 0.1% worth 25M parameters?
Worrying Selectively
It is inappropriate to be concerned about mice when there are tigers abroad




Two-Moons Dataset
Tests whether a model can learn curved decision boundaries. Two interleaving half-circles that cannot be separated by any straight line.
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10),
max_iter=1000,
random_state=42
)
mlp.fit(X_train, y_train)
print(f"Training accuracy: {mlp.score(X_train, y_train):.3f}")
print(f"Test accuracy: {mlp.score(X_test, y_test):.3f}")Training accuracy: 0.979
Test accuracy: 0.950

Find \(h^* \in \mathcal{H}\) that minimizes:
\[\mathbb{E}_{(\mathbf{x},y) \sim P}[\mathcal{L}(h(\mathbf{x}), y)]\]
But we only have access to:
\[\frac{1}{N}\sum_{i=1}^N \mathcal{L}(h(\mathbf{x}_i), y_i)\]
Generalization Gap
Minimize error on unseen data using only observed samples
This gap defines machine learning



Representation Determines Learnability
The choice of representation can make learning tractable or impossible. Deep learning learns representations automatically.




GOOD
GOOD
BAD
?
Can we classify the unknown pattern?

x = 0111111011100100000010000
001011111111101111001110
Label: “GOOD”
The same pattern can be represented as:
A hypothesis class can succeed or fail based on the choice of representation.
\[\mathbf{x} = \begin{bmatrix} 0111111011100100000010000 \\ 001011111111101111001110 \end{bmatrix}\]
\[y \in \{-1, +1\}\]
\[\hat{y} = \text{sign}(\mathbf{w} \cdot \mathbf{x}) = \text{sign}(w_1 x_1 + \cdots + w_{49} x_{49})\]
Definition: Linear Function
A function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) is linear if \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b\) for some \(\mathbf{w} \in \mathbb{R}^d\) and \(b \in \mathbb{R}\). The decision boundary \(\{\mathbf{x} : f(\mathbf{x}) = 0\}\) is a hyperplane.
where:

\[\mathcal{H}_{\text{linear}}: h(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)\] \[\mathcal{H}_{\text{neural}}: h(\mathbf{x}) = h_2(\mathbf{W}_2 \cdot h_1(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)\]
\[y = h\left(\sum_{i=1}^n w_i x_i + b\right) = h(\mathbf{w}^T\mathbf{x} + b)\]
where \(h\) is an activation function:


Why Nonlinearity Matters
Without activation functions, stacking layers is pointless: \(f(\mathbf{W}_2 \mathbf{W}_1 \mathbf{x}) = f(\mathbf{W} \mathbf{x})\) where \(\mathbf{W} = \mathbf{W}_2\mathbf{W}_1\)
Later topic: Gradient flow and vanishing gradients during backpropagation
\[\mathbf{w}^* = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\]
Solution: \[\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
\[\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla_{\mathbf{w}}\mathcal{L}(\mathbf{w}_t)\]

Iterative Optimization Principle
Gradient descent navigates the loss landscape by repeatedly moving in the direction of steepest descent. For convex problems, this guarantees convergence to the global minimum. For neural networks, we settle for local minima that generalize well.
\[\text{MSE} = \text{Bias}^2 + \text{Variance} + \sigma^2\]
Bias: Error from wrong model assumptions
Variance: Error from sensitivity to training data
Irreducible error (\(\sigma^2\)): Noise inherent in data

Tradeoff: Complex models reduce bias but increase variance

Increasing complexity: bias decreases, variance increases
Learning = Function Approximation
Representation Matters
Generalization is the Goal
“Data is the new oil”
But like oil, it must be refined to have value
\[\text{Model Performance} = f(\text{Data Quality}, \text{Data Quantity})\]
Note: Specific percentages vary by application, but quality improvement consistently outperforms quantity alone.


Representation Determines Learnability
Concentric circles: linearly inseparable in Cartesian coordinates, but trivially separable by radius in polar coordinates. Deep learning automates this search for effective representations.
As \(d \to \infty\):
d= 1: 0.050000 in outer shell
d= 2: 0.097500 in outer shell
d= 3: 0.142625 in outer shell
d= 10: 0.401263 in outer shell
d= 100: 0.994079 in outer shell
d=1000: 1.000000 in outer shell



Setup (2014-2017):
What went wrong:
The data:
Historical hires: 85% male, 15% female
Model learned: male-coded patterns = higher rating
System scrapped in 2018.

Why this matters:
The model did exactly what it was trained to do - replicate patterns in historical data.
The problem: historical data reflected real-world bias.
Clean data ≠ unbiased data
Training on augmented data: \[\min_\theta \sum_{i=1}^N \sum_{j=1}^M \mathcal{L}(f_\theta(T_j(x_i)), y_i)\]
where \(T_j\) are augmentation transforms


Modern methods combine paradigms: GPT-4 uses unsupervised pre-training on text, supervised fine-tuning on tasks, and reinforcement learning from human feedback (RLHF).
Given: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\)
Learn: \(f: \mathcal{X} \to \mathcal{Y}\)
Minimize: \(\mathcal{L}(f(\mathbf{x}), y)\)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic data
np.random.seed(42)
X = np.random.randn(1000, 10)
w_true = np.random.randn(10)
y = (X @ w_true + np.random.randn(1000)*0.1 > 0).astype(int)
# Standard supervised pipeline
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")Train accuracy: 0.990
Test accuracy: 0.995



Given: \(\mathcal{D} = \{\mathbf{x}_i\}_{i=1}^N\)
Find: Hidden patterns, structure, representations


Finding Structure in Data
Transform unsupervised → supervised by creating pretext tasks
# Example: Simple masked prediction
def create_masked_task(sequence, mask_prob=0.15):
"""Create self-supervised task from sequence"""
masked = sequence.copy()
labels = np.full_like(sequence, -1)
mask_indices = np.random.random(len(sequence)) < mask_prob
masked[mask_indices] = 0 # [MASK] token
labels[mask_indices] = sequence[mask_indices]
return masked, labels
# Example sequence
sequence = np.array([1, 4, 2, 8, 3, 7, 5, 9])
masked_input, targets = create_masked_task(sequence)
print(f"Original: {sequence}")
print(f"Masked: {masked_input}")
print(f"Targets: {targets}")Original: [1 4 2 8 3 7 5 9]
Masked: [1 0 0 0 3 0 5 9]
Targets: [-1 4 2 8 -1 7 -1 -1]
Foundation Models and Self-Supervision
Self-supervised learning powers modern foundation models like GPT and BERT
Components:
Objective: Maximize expected cumulative reward \[J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]\]



Label efficiency on CIFAR-10 (target: 90% accuracy):
Transfer learning with self-supervised pretraining: 50× reduction in labeled data

Hybrid Learning Approaches
Modern approaches often combine paradigms for better performance

Multilayer Perceptron (MLP): Fully connected feedforward network
Defn: Deep Neural Network
A neural network with more than one hidden layer. Depth enables hierarchical feature learning: early layers learn simple features, deeper layers learn complex abstractions.
At neuron \(i\) in layer \(l\):
\[a_i^{(l)} = h\left(\left[\mathbf{w}_i^{(l)}\right]^\top \mathbf{a}^{(l-1)} + b_i^{(l)}\right)\]
where:

\[\mathbf{a}^{(l)} = h\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)\]
A feedforward network with:
can approximate any continuous function on compact subset of \(\mathbb{R}^n\) to arbitrary accuracy
Critical word: CAN
The theorem guarantees such networks exist. Finding them through training is different.
Preview
Detailed treatment later: approximation theory, width vs depth, practical training implications

Universal approximation guarantees this works, but:
Why depth matters: Practical networks need efficient representations

class Layer:
def forward(self, x):
# Store for backward pass
self.x = x
# Linear transformation
self.z = np.dot(x, self.W) + self.b
# Apply activation
self.a = self.activation(self.z)
return self.a
def backward(self, grad_output):
# Chain rule through activation
grad_z = grad_output * \
self.activation_derivative(self.z)
# Parameter gradients
self.grad_W = np.dot(self.x.T, grad_z)
self.grad_b = np.sum(grad_z, axis=0)
# Input gradient for previous layer
grad_input = np.dot(grad_z, self.W.T)
return grad_input


In \(d\) dimensions with \(n\) parameters:
def sgd(w, grad, lr=0.01):
return w - lr * grad
def sgd_momentum(w, grad, velocity, lr=0.01, beta=0.9):
velocity = beta * velocity + lr * grad
return w - velocity, velocity
def adam(w, grad, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
return w - lr * m_hat / (np.sqrt(v_hat) + eps), m, v
“Dense networks contain sparse subnetworks that can train to comparable accuracy from the same initialization”
\[\text{Parameters: } 100M \to 10M\] \[\text{Performance: } 95\% \to 94.5\%\]
Why this matters:
Storage:
Speed:
Training:

Detailed treatment: Network pruning and efficient architectures





WARNING: Data Contamination
Never touch test data until final evaluation - validation data guides all decisions
Why does SGD find generalizing solutions?
Networks can memorize random labels perfectly,
yet SGD finds patterns when labels are real
Why does overparameterization help?
10x more parameters than samples should overfit,
but often improves test accuracy
What is the role of depth?
Shallow wide networks have same capacity,
but deep networks generalize better
How do transformers generalize?
No convolutions, no recurrence,
yet state-of-the-art on vision and language

Note: No single theory fully explains deep learning generalization. Active research area.
Setup (2018-2021):
Deployment (2021):
What happened:
Training data came from stable market. Deployment happened during rapid market shift. Model kept predicting pre-COVID prices.

The problem:
All your validation tools assume the future looks like the past. When the world changes, models trained on historical data fail.
Train/val/test all from 2018-2020: Model learns pre-COVID patterns
Deploy in 2021: COVID changed everything
Result: Model is wrong, but doesn't know it's wrong
This is not a rare edge case - markets shift, user behavior changes, new products emerge. Distribution shift is common.
Empirical validation:
Defensive engineering:
Course approach:
Many fundamental questions remain open research problems.


import numpy as np
class Conv2D:
def __init__(self, in_channels, out_channels, kernel_size=3):
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
# Initialize filters
self.filters = np.random.randn(
out_channels, in_channels, kernel_size, kernel_size
) * 0.1
self.bias = np.zeros(out_channels)
def forward(self, x):
batch, in_c, height, width = x.shape
out_h = height - self.kernel_size + 1
out_w = width - self.kernel_size + 1
output = np.zeros((batch, self.out_channels, out_h, out_w))
# Convolution operation
for b in range(batch):
for oc in range(self.out_channels):
for h in range(out_h):
for w in range(out_w):
# Extract patch
patch = x[b, :, h:h+self.kernel_size, w:w+self.kernel_size]
# Convolve with filter
output[b, oc, h, w] = np.sum(patch * self.filters[oc]) + self.bias[oc]
return output
class MaxPool2D:
def __init__(self, pool_size=2):
self.pool_size = pool_size
def forward(self, x):
batch, channels, height, width = x.shape
out_h = height // self.pool_size
out_w = width // self.pool_size
output = np.zeros((batch, channels, out_h, out_w))
for h in range(out_h):
for w in range(out_w):
h_start = h * self.pool_size
w_start = w * self.pool_size
pool_region = x[:, :, h_start:h_start+self.pool_size,
w_start:w_start+self.pool_size]
output[:, :, h, w] = np.max(pool_region, axis=(2, 3))
return output
# Example usage
x = np.random.randn(1, 3, 32, 32) # Batch=1, RGB, 32x32
conv = Conv2D(3, 16, kernel_size=3)
pool = MaxPool2D(pool_size=2)
x = conv.forward(x)
print(f"After conv: {x.shape}") # (1, 16, 30, 30)
x = np.maximum(0, x) # ReLU
x = pool.forward(x)
print(f"After pool: {x.shape}") # (1, 16, 15, 15)After conv: (1, 16, 30, 30)
After pool: (1, 16, 15, 15)
ResNet-50 (2015):
MobileNetV2 (2018):
EfficientNet-B0 (2019):
Architecture design matters: EfficientNet achieves better accuracy than ResNet-50 with far fewer parameters.
\[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \mathcal{L}(f_\theta(x_i), y_i)\]
1. Become one with the data Look at your data. Plot it. Understand its distribution, outliers, patterns.
2. Set up end-to-end pipeline Get a simple model training before complexity.
3. Overfit a single batch If you can’t overfit 10 examples, something is broken.
4. Verify loss at initialization Check loss matches expected value (e.g., \(\log(n_{classes})\) for classification).
5. Add complexity gradually Start simple, add one thing at a time.
# The debugging progression
def debug_training():
# Step 1: Overfit one example
single_x = X[0:1]
single_y = y[0:1]
for _ in range(100):
loss = train_step(single_x, single_y)
assert loss < 0.01, "Can't overfit single"
# Step 2: Overfit small batch
batch_x = X[0:10]
batch_y = y[0:10]
for _ in range(500):
loss = train_step(batch_x, batch_y)
assert loss < 0.1, "Can't overfit batch"
# Step 3: Check with real data
# Only now move to full dataset
return "Ready for full training"
Setup:
Deployment at Hospital B:
What went wrong:
Hospital A used one X-ray machine model with specific image characteristics. Hospital B used different equipment. Model learned machine artifacts, not disease patterns.
Example artifacts learned:

Shortcut learning: Standard debugging looked fine:
Problem only appeared on different hospital equipment. Models exploit spurious correlations (disease + specific machine) as shortcuts instead of learning actual medical patterns.

What this means for this course:
CPU (your laptop):
GPU (Colab/Kaggle free tier):
Multi-GPU (cloud):
Course approach: CPU is viable for most work. GPU accelerates but isn’t required.
# Tracking experiments
experiment_config = {
'model': 'resnet18',
'dataset': 'cifar10',
'batch_size': 128,
'lr': 0.1,
'epochs': 100,
'seed': 42,
'timestamp': '2025-01-15-14:30'
}
# Always set seeds for reproducibility
def set_all_seeds(seed=42):
np.random.seed(seed)
# torch.manual_seed(seed)
# torch.cuda.manual_seed_all(seed)
# random.seed(seed)
# Log everything
def log_metrics(epoch, train_loss, val_loss, val_acc):
metrics = {
'epoch': epoch,
'train_loss': train_loss,
'val_loss': val_loss,
'val_acc': val_acc,
'lr': get_current_lr(),
'timestamp': time.time()
}
# Write to file, tensorboard, wandb, etc.
return metrics



What these scales cost:
GPT-2 (1.5B params, 2019):
GPT-3 (175B params, 2020):
PaLM (540B params, 2022):
Scale is not just about bigger numbers - it’s about fundamentally different resource requirements.

# System Python - Don't do this
python install torch
# Error: requires numpy>=1.19
pip install numpy==1.20
# Breaks: opencv requires numpy==1.18Conflicts are inevitable. Each project needs:

# Download Miniconda (minimal) or Anaconda (full)
# miniconda.anaconda.com
# After installation, verify:
conda --version
conda info
# Update conda itself
conda update -n base condaBinary package management
Cross-platform
Channel system
conda-forge: Community packagespytorch: Official PyTorch buildsnvidia: CUDA toolkitEnvironment files
environment.yml for reproducibility# Activate your environment first
conda activate ee541
# Core scientific stack
conda install numpy scipy matplotlib pandas
# Jupyter for notebooks
conda install jupyter ipykernel
# Register kernel for Jupyter
python -m ipykernel install --user --name ee541 --display-name "Python (ee541)"
# PyTorch - SELECT BASED ON YOUR SYSTEM
# CPU only
conda install pytorch torchvision torchaudio cpuonly -c pytorch
# CUDA 11.8 (NVIDIA GPU)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Mac M1/M2/M3 (Metal Performance Shaders)
conda install pytorch torchvision torchaudio -c pytorch
# Additional ML tools
conda install scikit-learn
conda install -c conda-forge tensorboardGPU Configuration Check
nvidia-smi# From terminal with environment active
(ee541) $ jupyter notebook
# Opens browser at localhost:8888
# Navigate to your work directoryShift+Enter: Run cell, move to nextCtrl+Enter: Run cell, stayEsc: Command modeEnter: Edit modeA/B: Insert cell above/below

Version Control
import sys
import platform
print(f"Python: {sys.version}")
print(f"Platform: {platform.platform()}")
packages = {
'numpy': None,
'torch': None,
'torchvision': None,
'matplotlib': None,
'jupyter': None,
'sklearn': 'scikit-learn'
}
for import_name, pip_name in packages.items():
try:
module = __import__(import_name)
version = getattr(module, '__version__', 'installed')
print(f"✓ {import_name}: {version}")
# Special check for PyTorch GPU
if import_name == 'torch':
import torch
if torch.cuda.is_available():
print(f" GPU: CUDA ({torch.version.cuda})")
print(f" Device: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
print(f" GPU: MPS (Mac)")
else:
print(f" GPU: Not available")
except ImportError:
package = pip_name or import_name
print(f"✗ {import_name}: Not installed")
print(f" Install with: conda install {package}")Python: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ]
Platform: macOS-26.2-arm64-arm-64bit
✓ numpy: 1.26.4
✓ torch: 2.5.1
GPU: MPS (Mac)
✓ torchvision: 0.20.1
✓ matplotlib: 3.10.6
✓ jupyter: installed
✓ sklearn: 1.6.1
Task: Classify clothing items into 10 categories
Architecture: Simple 2-layer network
Training: 4 epochs, Adam optimizer
Files:
1-fashion-mnist.ipynb: Dataset exploration2-minimal-pytorch.ipynb: Core training loop3-feature-visualization.ipynb: TensorBoard monitoring
T-shirt

Trouser

Pullover

Dress

Coat

Sandal

Shirt

Sneaker

Bag

Boot
# Minimal.ipynb - Key components
# 1. Data Loading
train_loader = DataLoader(train_set, batch_size=100, shuffle=True)
test_loader = DataLoader(test_set, batch_size=100, shuffle=False)
# 2. Model Definition
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hidden = nn.Linear(784, 128)
self.output = nn.Linear(128, 10)
def forward(self, x):
x = F.relu(self.hidden(x))
return self.output(x)
# 3. Training Loop
for epoch in range(num_epochs):
for images, labels in train_loader:
# Forward pass
outputs = model(images.view(-1, 784))
loss = loss_func(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()Training Performance Benchmarks
Complete training in ~5 minutes on CPU, <1 minute on GPU

Observations:

Embedding Insight: Classes form distinct clusters in feature space


Common Confusions: Shirt ↔︎ T-shirt, Pullover ↔︎ Coat
Python and NumPy for Neural Networks
Next week: Array operations and automatic differentiation