Computational Deep Learning

EE 541 - Unit 1

Dr. Brandon Franzke

Spring 2025

Deep Learning in Action

Neural Network Learning Visualization

<p style="font-size: 14px; color: #666;">
    Click play to watch the neural network learn to classify the data. 
    Use the controls to pause, step through frames, or replay.
</p>
<input id="_anim_slider7192d3442ea948dda1e1d96c2f20dd23" type="range" class="anim-slider"
       name="points" min="0" max="1" step="1" value="0"
       oninput="anim7192d3442ea948dda1e1d96c2f20dd23.set_frame(parseInt(this.value));">
<div class="anim-buttons">
  <button title="Decrease speed" aria-label="Decrease speed" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.slower()">
      <i class="fa fa-minus"></i></button>
  <button title="First frame" aria-label="First frame" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.first_frame()">
    <i class="fa fa-fast-backward"></i></button>
  <button title="Previous frame" aria-label="Previous frame" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.previous_frame()">
      <i class="fa fa-step-backward"></i></button>
  <button title="Play backwards" aria-label="Play backwards" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.reverse_animation()">
      <i class="fa fa-play fa-flip-horizontal"></i></button>
  <button title="Pause" aria-label="Pause" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.pause_animation()">
      <i class="fa fa-pause"></i></button>
  <button title="Play" aria-label="Play" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.play_animation()">
      <i class="fa fa-play"></i></button>
  <button title="Next frame" aria-label="Next frame" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.next_frame()">
      <i class="fa fa-step-forward"></i></button>
  <button title="Last frame" aria-label="Last frame" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.last_frame()">
      <i class="fa fa-fast-forward"></i></button>
  <button title="Increase speed" aria-label="Increase speed" onclick="anim7192d3442ea948dda1e1d96c2f20dd23.faster()">
      <i class="fa fa-plus"></i></button>
</div>
<form title="Repetition mode" aria-label="Repetition mode" action="#n" name="_anim_loop_select7192d3442ea948dda1e1d96c2f20dd23"
      class="anim-state">
  <input type="radio" name="state" value="once" id="_anim_radio1_7192d3442ea948dda1e1d96c2f20dd23"
         checked>
  <label for="_anim_radio1_7192d3442ea948dda1e1d96c2f20dd23">Once</label>
  <input type="radio" name="state" value="loop" id="_anim_radio2_7192d3442ea948dda1e1d96c2f20dd23"
         >
  <label for="_anim_radio2_7192d3442ea948dda1e1d96c2f20dd23">Loop</label>
  <input type="radio" name="state" value="reflect" id="_anim_radio3_7192d3442ea948dda1e1d96c2f20dd23"
         >
  <label for="_anim_radio3_7192d3442ea948dda1e1d96c2f20dd23">Reflect</label>
</form>

What is Machine Learning?

What is Machine Learning?

Machine Learning: A Formal Framework

Herbert Simon (1983)

“Learning is any process by which a system improves performance from experience.”

Tom Mitchell (1997)

“A computer program is said to learn from experience E with respect to task T and performance measure P, if its performance at T, as measured by P, improves with experience E.”

Our Framework

\[\text{Learning} = \{\mathcal{T}, \mathcal{P}, \mathcal{E}\}\]

Example: Email Spam Filter

  • Task (\(\mathcal{T}\)): Classify emails as spam/not spam
  • Performance (\(\mathcal{P}\)): % correctly classified
  • Experience (\(\mathcal{E}\)): Database of labeled emails

Example: Self-Driving Car

  • \(\mathcal{T}\): Navigate roads safely
  • \(\mathcal{P}\): Miles without intervention
  • \(\mathcal{E}\): Hours of human driving data

Generalization is the Goal of Machine Learning

  • Do not care about performance on the dataset we have
  • Do care about performance on similar data that has no labels
  • Accuracy/Generalization trade-off (bias-variance trade):
    • Optimizing accuracy to the extreme reduces capability to generalize

The Paradigm Shift: From Rules to Learning

The Fundamental Shift

Classical: Theory-Driven

Modern: Data-Driven

Course Philosophy: Models and Parsimony

George Box (1976)

“All models are wrong, but some are useful”

“Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration”

Occam’s Razor in ML

Seek economical descriptions of phenomena

Example: MNIST Classification

  • Nearest neighbor: 3% error, \(\mathcal{O}(n)\) inference
  • Linear classifier: 8% error, \(\mathcal{O}(d)\) inference
  • 2-layer network: 2% error, 50K parameters
  • ConvNet (LeNet-5): 0.8% error, 60K parameters
  • ResNet-50: 0.2% error, 25M parameters

Where do you stop?

Worrying Selectively

It is inappropriate to be concerned about mice when there are tigers abroad

  • Start simple
  • Add complexity purposefully
  • Validate empirically

Detection, Estimation, Regression

Course Progression: Building Complexity

Outline

  1. Learning Fundamentals
    • Function approximation and hypothesis classes
    • Bias-variance tradeoff
  2. Data and Learning Paradigms
    • Supervised, unsupervised, reinforcement
    • Self-supervised methods
  3. Neural Architecture
    • From perceptron to deep networks
    • Universal approximation
    • Loss landscapes and gradient descent
    • Why networks generalize
  4. Implementation
    • Python/PyTorch setup
    • Fashion-MNIST: Two-layer network

Linear Classification: Success and Failure

Two-Moons Dataset

Tests whether a model can learn curved decision boundaries. Two interleaving half-circles that cannot be separated by any straight line.

Nonlinear Classification: Neural Network Solution

Code
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

mlp = MLPClassifier(
    hidden_layer_sizes=(10, 10), 
    max_iter=1000, 
    random_state=42
)
mlp.fit(X_train, y_train)

print(f"Training accuracy: {mlp.score(X_train, y_train):.3f}")
print(f"Test accuracy: {mlp.score(X_test, y_test):.3f}")
Training accuracy: 0.979
Test accuracy: 0.950

Learning Fundamentals

The Learning Problem

Given

  • Training data: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\)
  • Hypothesis class: \(\mathcal{H}\)
  • Loss function: \(\mathcal{L}\)

Goal

Find \(h^* \in \mathcal{H}\) that minimizes:

\[\mathbb{E}_{(\mathbf{x},y) \sim P}[\mathcal{L}(h(\mathbf{x}), y)]\]

But we only have access to:

\[\frac{1}{N}\sum_{i=1}^N \mathcal{L}(h(\mathbf{x}_i), y_i)\]

The Generalization Gap

Minimize error on unseen data
using only observed samples

This gap is the generalization problem

Example Task: “2s” Detector

Example: Digit Recognition

Input Space

  • Raw pixels: \(\mathbf{x} \in \{0,255\}^{784}\)
  • Normalized: \(\mathbf{x} \in [0,1]^{784}\)
  • Binary: \(\mathbf{x} \in \{0,1\}^{784}\)

Output Space

  • Classification: \(y \in \{0,1,...,9\}\)
  • One-hot: \(\mathbf{y} \in \{0,1\}^{10}\)
  • Probability: \(\mathbf{y} \in [0,1]^{10}\)

Representation Matters

Representation Determines Learnability

The choice of representation can make learning tractable or impossible. Deep learning learns representations automatically.

Example: The Data Domain

GOOD

GOOD

BAD

?

The choice of how to represent input is very important

Can we classify the unknown pattern?

Many Ways to Represent the Same Data

Binary Representation

x = 0111111011100100000010000
    001011111111101111001110

Label: “GOOD”

Key Insight

The same pattern can be represented as:

  • Raw pixels
  • Binary vectors (length d = 49)
  • Feature vectors
  • Learned representations

A hypothesis class can succeed or fail based on the choice of representation.

Data Representation and Hypothesis Class

Representation: Binary vectors, length \(d = 49\)

\[\mathbf{x} = \begin{bmatrix} 0111111011100100000010000 \\ 001011111111101111001110 \end{bmatrix}\]

\[y \in \{-1, +1\}\]

Hypothesize mapping data to label using linear classifier:

\[\hat{y} = \text{sign}(\boldsymbol{\theta} \cdot \mathbf{x}) = \text{sign}(\theta_1 x_1 + \cdots + \theta_{49} x_{49})\]

Definition: Linear Function

A function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) is linear if \(f(\mathbf{x}) = \boldsymbol{\theta}^\top \mathbf{x} + b\) for some \(\boldsymbol{\theta} \in \mathbb{R}^d\) and \(b \in \mathbb{R}\).
The decision boundary \(\{\mathbf{x} : f(\mathbf{x}) = 0\}\) is a hyperplane.

where:

  • \(\boldsymbol{\theta}\): Parameters to learn (weight vector)
  • \(\hat{y}\): Predicted label (sign of linear combination)

Hypothesis Classes: Linear vs Nonlinear

\[\mathcal{H}_{\text{linear}}: h(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)\] \[\mathcal{H}_{\text{neural}}: h(\mathbf{x}) = h_2(\mathbf{W}_2 \cdot h_1(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)\]

The Perceptron: Simplest Neural Unit

Mathematical Model

\[y = h\left(\sum_{i=1}^n w_i x_i + b\right) = h(\mathbf{w}^T\mathbf{x} + b)\]

where \(h\) is an activation function:

  • Step: \(h(z) = \begin{cases} 1 & z \geq 0 \\ 0 & z < 0 \end{cases}\)
  • Sigmoid: \(h(z) = \frac{1}{1 + e^{-z}}\)
  • ReLU: \(h(z) = \max(0, z)\)

Two Learning Paradigms

Explicit (Closed-form)

\[\mathbf{w}^* = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\]

Solution: \[\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

  • One-shot computation
  • Requires matrix inversion
  • Memory intensive for large data

Iterative (Gradient-based)

\[\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla_{\mathbf{w}}\mathcal{L}(\mathbf{w}_t)\]

def sgd_step(w, x, y, learning_rate=0.01):
    prediction = np.dot(w, x)
    error = prediction - y
    gradient = error * x
    w_new = w - learning_rate * gradient
    return w_new
  • Sequential updates
  • Scales to large datasets
  • Foundation of deep learning

Gradient Descent Visualization

Iterative Optimization Principle

Gradient descent navigates the loss landscape by repeatedly moving in the direction of steepest descent. For convex problems, this guarantees convergence to the global minimum. For neural networks, we settle for local minima that generalize well.

The Bias-Variance Tradeoff

\[\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

EE 541 Core Principles

Theoretical Framework

  1. Learning = Function Approximation
    • From data to predictions
    • Hypothesis class defines possibilities
  2. Representation Matters
    • Same data, different encodings
    • Deep learning learns representations
  3. Generalization is the Goal
    • Not memorization
    • Balance complexity with data

Implementation Strategy

  1. Start Simple
    • Linear models as baselines
    • Add complexity purposefully
  2. Iterate and Validate
    • Gradient descent scales
    • Monitor train vs test error
  3. EE 541 Progression
    • MMSE → Regression → Neural Nets
    • Theory + PyTorch implementation

The Data Story

Data: The Critical Resource

Clive Humby (2006)

“Data is the new oil”

But like oil, it must be refined to have value

The Data Value Equation: \[\text{Model Performance} = f(\text{Data Quality}, \text{Data Quantity})\]

Processing Multiplies Value

  • Raw data: 10% usable
  • Cleaned data: 40% usable
  • Curated data: 90% usable

Representation: The Hidden Multiplier

The Power of Data Representation

The right representation can transform an impossible problem into a trivial one

The Curse of Dimensionality

Mathematical Reality

As \(d \to \infty\):

  • All points become equidistant
  • Volume concentrates at surface
  • Gaussian looks like uniform
  • Nearest neighbors aren’t “near”
Code
import numpy as np

def volume_ratio(d, epsilon=0.95):
    """Fraction of hypercube volume in outer shell"""
    return 1 - epsilon**d

dimensions = [1, 2, 3, 10, 100, 1000]
for d in dimensions:
    ratio = volume_ratio(d)
    print(f"d={d:4}: {ratio:.6f} in outer shell")
d=   1: 0.050000 in outer shell
d=   2: 0.097500 in outer shell
d=   3: 0.142625 in outer shell
d=  10: 0.401263 in outer shell
d= 100: 0.994079 in outer shell
d=1000: 1.000000 in outer shell

Label Noise Degrades Performance More Than Limited Data

Data Augmentation: Creating Synthetic Data

Standard Augmentations

  • Geometric: Rotation, flip, crop, scale
  • Photometric: Brightness, contrast, color
  • Noise: Gaussian, dropout, cutout
  • Advanced: Mixup, CutMix, AutoAugment

Mathematical View

Training on augmented data: \[\min_\theta \sum_{i=1}^N \sum_{j=1}^M \mathcal{L}(f_\theta(T_j(x_i)), y_i)\]

where \(T_j\) are augmentation transforms

Core Learning Paradigms

Three Paradigms: Supervised, Unsupervised, Reinforcement

Modern methods combine paradigms: GPT-4 uses unsupervised pre-training on text, supervised fine-tuning on tasks, and reinforcement learning from human feedback (RLHF).

Supervised Learning: The Foundation

Problem Formulation

Given: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\)

Learn: \(f: \mathcal{X} \to \mathcal{Y}\)

Minimize: \(\mathcal{L}(f(\mathbf{x}), y)\)

Core Tasks

  • Classification: \(y \in \{1, ..., C\}\)
  • Regression: \(y \in \mathbb{R}^d\)
  • Structured Prediction: \(y \in \mathcal{Y}_{\text{complex}}\)

Modern Applications

  • Medical diagnosis from images
  • Speech recognition
  • Machine translation
  • Time series forecasting
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate synthetic data
np.random.seed(42)
X = np.random.randn(1000, 10)
w_true = np.random.randn(10)
y = (X @ w_true + np.random.randn(1000)*0.1 > 0).astype(int)

# Standard supervised pipeline
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))

print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
Train accuracy: 0.990
Test accuracy: 0.995

Supervised Learning: Example Task

Linear Regression Task

Training Phase

Training Phase

Inference Phase

Inference Phase

Unsupervised Learning: Finding Structure

No Labels, Just Data

Given: \(\mathcal{D} = \{\mathbf{x}_i\}_{i=1}^N\)

Find: Hidden patterns, structure, representations

Key Methods

  • Clustering: K-means, DBSCAN, hierarchical
  • Dimensionality Reduction: PCA, t-SNE, UMAP
  • Density Estimation: GMM, KDE
  • Representation Learning: Autoencoders

Unsupervised Learning System

Unsupervised Learning: Finding Hidden Structure

Finding Structure in Data

Self-Supervised Learning: The Revolution

Creating Supervision from Data

Transform unsupervised → supervised by creating pretext tasks

Key Innovations

  • Language Models: Predict next word (GPT)
  • Masked Modeling: Predict masked parts (BERT)
  • Contrastive Learning: Similar/different pairs (SimCLR)

Why It Works

  • Unlimited labeled data (self-generated)
  • Learns general representations
  • Transfer learning to downstream tasks
Code
# Example: Simple masked prediction
def create_masked_task(sequence, mask_prob=0.15):
    """Create self-supervised task from sequence"""
    masked = sequence.copy()
    labels = np.full_like(sequence, -1)
    
    mask_indices = np.random.random(len(sequence)) < mask_prob
    masked[mask_indices] = 0  # [MASK] token
    labels[mask_indices] = sequence[mask_indices]
    
    return masked, labels

# Example sequence
sequence = np.array([1, 4, 2, 8, 3, 7, 5, 9])
masked_input, targets = create_masked_task(sequence)

print(f"Original: {sequence}")
print(f"Masked:   {masked_input}")
print(f"Targets:  {targets}")
Original: [1 4 2 8 3 7 5 9]
Masked:   [1 0 0 0 3 0 5 9]
Targets:  [-1  4  2  8 -1  7 -1 -1]

Foundation Models and Self-Supervision

Self-supervised learning powers modern foundation models like GPT and BERT

Reinforcement Learning: Learning by Doing

Sequential Decision Making

Components:

  • State space: \(\mathcal{S}\)
  • Action space: \(\mathcal{A}\)
  • Reward function: \(R(s, a)\)
  • Policy: \(\pi(a|s)\)

Objective: Maximize expected cumulative reward \[J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]\]

Applications

  • Game playing (Chess, Go, StarCraft)
  • Robotics control
  • Resource allocation
  • Trading strategies

Comparing Paradigms: Same Problem, Three Approaches

Modern Methods Combine Paradigms

Semi-Supervised Learning

  • Use small labeled + large unlabeled data
  • Pseudo-labeling, consistency regularization
  • Example: FixMatch, MixMatch

Multi-Task Learning

  • Learn multiple related tasks simultaneously
  • Shared representations
  • Example: BERT for multiple NLP tasks

Meta-Learning

  • Learn to learn
  • Few-shot adaptation
  • Example: MAML, Prototypical Networks

Transfer Learning Pipeline

Hybrid Learning Approaches

Modern approaches often combine paradigms for better performance

Neural Architecture

Neural Networks Form a Rich Hypothesis Class

Multilayer Perceptron (MLP): Fully connected feedforward network

Architecture Components

  • Input layer: Raw features \(\mathbf{x} \in \mathbb{R}^d\)
  • Hidden layers: Learned representations
  • Output layer: Task-specific predictions
  • Connections: All-to-all between layers

Why “Rich” Hypothesis Class?

  • Each neuron: Nonlinear transformation
  • Composition: Exponential expressivity
  • Universal approximation capability

Defn: Deep Neural Network

A neural network with more than one hidden layer. Depth enables hierarchical feature learning: early layers learn simple features, deeper layers learn complex abstractions.

Single Neuron: Mathematical Detail

Forward Computation

At neuron \(i\) in layer \(l\):

\[a_i^{(l)} = h\left(\left[\mathbf{w}_i^{(l)}\right]^\top \mathbf{a}^{(l-1)} + b_i^{(l)}\right)\]

where:

  • \(\mathbf{a}^{(l-1)}\): Previous layer activations
  • \(\mathbf{w}_i^{(l)}\): Weight vector for neuron \(i\)
  • \(b_i^{(l)}\): Bias term
  • \(h(\cdot)\): Activation function

Matrix Form (Entire Layer)

\[\mathbf{a}^{(l)} = h\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)\]

  • Parallelizes computation
  • Enables GPU acceleration

Universal Approximation Theorem

Cybenko (1989), Hornik et al. (1989)

A feedforward network with:

  • Single hidden layer
  • Finite number of neurons
  • Non-polynomial activation

Can approximate any continuous function on compact subset of \(\mathbb{R}^n\) to arbitrary accuracy

Implications

  • Width vs Depth trade-off
  • Exponential width may be needed
  • Deep networks are more efficient

Activation Functions

Forward and Backward Pass

Implementation

class Layer:
    def forward(self, x):
        # Store for backward pass
        self.x = x
        # Linear transformation
        self.z = np.dot(x, self.W) + self.b
        # Apply activation
        self.a = self.activation(self.z)
        return self.a
    
    def backward(self, grad_output):
        # Chain rule through activation
        grad_z = grad_output * \
                 self.activation_derivative(self.z)
        
        # Parameter gradients
        self.grad_W = np.dot(self.x.T, grad_z)
        self.grad_b = np.sum(grad_z, axis=0)
        
        # Input gradient for previous layer
        grad_input = np.dot(grad_z, self.W.T)
        return grad_input

Computational Graph

Network Capacity and Depth

The Optimization Landscape

Loss Surfaces in High Dimensions

Mathematical Reality

In \(d\) dimensions with \(n\) parameters:

  • Critical points: \(\mathcal{O}(e^n)\)
  • Most are saddle points, not local minima
  • Minima often connected by low-loss paths

Empirical Observations

  • Loss landscapes are surprisingly well-behaved
  • Wide networks have smoother landscapes
  • Overparameterization helps optimization
  • Mode connectivity phenomenon

Stochastic Gradient Descent and Variants

Code
def sgd(w, grad, lr=0.01):
    return w - lr * grad

def sgd_momentum(w, grad, velocity, lr=0.01, beta=0.9):
    velocity = beta * velocity + lr * grad
    return w - velocity, velocity

def adam(w, grad, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    return w - lr * m_hat / (np.sqrt(v_hat) + eps), m, v

The Lottery Ticket Hypothesis

Frankle & Carbin (2019)

“Dense networks contain sparse subnetworks that can train to comparable accuracy from the same initialization”

Implications

  • Networks are vastly overparameterized
  • Winning tickets exist at initialization
  • Pruning can maintain performance
  • Structure matters more than we thought

Practical Impact

\[\text{Parameters: } 100M \to 10M\] \[\text{Performance: } 95\% \to 94.5\%\]

Why Neural Networks Train: Modern Understanding

Generalization: The Central Mystery

The Fundamental Puzzle

Inductive Biases: Architecture as Prior

The Validation Game

WARNING: Data Contamination

Never touch test data until final evaluation - validation data guides all decisions

Generalization Remains Partially Unexplained

What We Don’t Understand

Why does SGD find generalizing solutions?
Networks can memorize random labels perfectly,
yet SGD finds patterns when labels are real

Why does overparameterization help?
10x more parameters than samples should overfit,
but often improves test accuracy

What is the role of depth?
Shallow wide networks have same capacity,
but deep networks generalize better

How do transformers generalize?
No convolutions, no recurrence,
yet state-of-the-art on vision and language

Modern Architectures in Action

Convolutional Networks: Exploiting Spatial Structure

Beyond CNNs: A Glimpse of Modern Architectures

Building a Simple CNN

import numpy as np

class Conv2D:
    def __init__(self, in_channels, out_channels, kernel_size=3):
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        
        # Initialize filters
        self.filters = np.random.randn(
            out_channels, in_channels, kernel_size, kernel_size
        ) * 0.1
        self.bias = np.zeros(out_channels)
    
    def forward(self, x):
        batch, in_c, height, width = x.shape
        out_h = height - self.kernel_size + 1
        out_w = width - self.kernel_size + 1
        
        output = np.zeros((batch, self.out_channels, out_h, out_w))
        
        # Convolution operation
        for b in range(batch):
            for oc in range(self.out_channels):
                for h in range(out_h):
                    for w in range(out_w):
                        # Extract patch
                        patch = x[b, :, h:h+self.kernel_size, w:w+self.kernel_size]
                        # Convolve with filter
                        output[b, oc, h, w] = np.sum(patch * self.filters[oc]) + self.bias[oc]
        
        return output

class MaxPool2D:
    def __init__(self, pool_size=2):
        self.pool_size = pool_size
    
    def forward(self, x):
        batch, channels, height, width = x.shape
        out_h = height // self.pool_size
        out_w = width // self.pool_size
        
        output = np.zeros((batch, channels, out_h, out_w))
        
        for h in range(out_h):
            for w in range(out_w):
                h_start = h * self.pool_size
                w_start = w * self.pool_size
                pool_region = x[:, :, h_start:h_start+self.pool_size, 
                              w_start:w_start+self.pool_size]
                output[:, :, h, w] = np.max(pool_region, axis=(2, 3))
        
        return output

# Example usage
x = np.random.randn(1, 3, 32, 32)  # Batch=1, RGB, 32x32
conv = Conv2D(3, 16, kernel_size=3)
pool = MaxPool2D(pool_size=2)

x = conv.forward(x)
print(f"After conv: {x.shape}")  # (1, 16, 30, 30)
x = np.maximum(0, x)  # ReLU
x = pool.forward(x)
print(f"After pool: {x.shape}")  # (1, 16, 15, 15)
After conv: (1, 16, 30, 30)
After pool: (1, 16, 15, 15)

What Makes Architectures Succeed?

The Practice of Deep Learning

A Recipe for Training Neural Networks

The Training Loop

\[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \mathcal{L}(f_\theta(x_i), y_i)\]

Karpathy’s Principles

1. Become one with the data Look at your data. Plot it. Understand its distribution, outliers, patterns.

2. Set up end-to-end pipeline Get a simple model training before complexity.

3. Overfit a single batch If you can’t overfit 10 examples, something is broken.

4. Verify loss at initialization Check loss matches expected value (e.g., \(\log(n_{classes})\) for classification).

5. Add complexity gradually Start simple, add one thing at a time.

# The debugging progression
def debug_training():
    # Step 1: Overfit one example
    single_x = X[0:1]
    single_y = y[0:1]
    for _ in range(100):
        loss = train_step(single_x, single_y)
    assert loss < 0.01, "Can't overfit single"
    
    # Step 2: Overfit small batch  
    batch_x = X[0:10]
    batch_y = y[0:10]
    for _ in range(500):
        loss = train_step(batch_x, batch_y)
    assert loss < 0.1, "Can't overfit batch"
    
    # Step 3: Check with real data
    # Only now move to full dataset
    return "Ready for full training"

When Things Go Wrong: Debugging Strategies

Computational Realities

Systematic Experimentation

Code
# Tracking experiments
experiment_config = {
    'model': 'resnet18',
    'dataset': 'cifar10',
    'batch_size': 128,
    'lr': 0.1,
    'epochs': 100,
    'seed': 42,
    'timestamp': '2025-01-15-14:30'
}

# Always set seeds for reproducibility
def set_all_seeds(seed=42):
    np.random.seed(seed)
    # torch.manual_seed(seed)
    # torch.cuda.manual_seed_all(seed)
    # random.seed(seed)
    
# Log everything
def log_metrics(epoch, train_loss, val_loss, val_acc):
    metrics = {
        'epoch': epoch,
        'train_loss': train_loss,
        'val_loss': val_loss,
        'val_acc': val_acc,
        'lr': get_current_lr(),
        'timestamp': time.time()
    }
    # Write to file, tensorboard, wandb, etc.
    return metrics

The Importance of Baselines

Current Frontiers & Course Roadmap

Course Architecture: Two Perspectives

Scale: The Emergence Phenomenon

Model Compression and Acceleration

Python Environment Setup

Environment Management: Why It Matters

The Problem

# System Python - Don't do this
python install torch
# Error: requires numpy>=1.19
pip install numpy==1.20
# Breaks: opencv requires numpy==1.18

Dependency conflicts are inevitable

Each project needs:

  • Specific Python version
  • Specific package versions
  • Isolated from system Python

The Solution: Virtual Environments

Conda: Package and Environment Management

Installation

# Download Miniconda (minimal) or Anaconda (full)
# miniconda.anaconda.com

# After installation, verify:
conda --version
conda info

# Update conda itself
conda update -n base conda

Create Course Environment

# Create environment with Python 3.11
conda create -n ee541 python=3.11

# Activate environment
conda activate ee541

# Your prompt changes:
(ee541) $ 

# Deactivate when done
conda deactivate

Why Conda for Deep Learning

Binary package management

  • Precompiled CUDA libraries
  • Optimized BLAS/LAPACK
  • No compilation required

Cross-platform

  • Same commands on Windows/Mac/Linux
  • Handles system dependencies

Channel system - conda-forge: Community packages - pytorch: Official PyTorch builds - nvidia: CUDA toolkit

Environment files

  • Share exact environment
  • environment.yml for reproducibility

Essential Package Installation

# Activate your environment first
conda activate ee541

# Core scientific stack
conda install numpy scipy matplotlib pandas

# Jupyter for notebooks
conda install jupyter ipykernel

# Register kernel for Jupyter
python -m ipykernel install --user --name ee541 --display-name "Python (ee541)"

# PyTorch - SELECT BASED ON YOUR SYSTEM
# CPU only
conda install pytorch torchvision torchaudio cpuonly -c pytorch

# CUDA 11.8 (NVIDIA GPU)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Mac M1/M2/M3 (Metal Performance Shaders)
conda install pytorch torchvision torchaudio -c pytorch

# Additional ML tools
conda install scikit-learn
conda install -c conda-forge tensorboard

GPU Configuration Check

  • NVIDIA: Check CUDA version with nvidia-smi
  • AMD: ROCm support limited, use CPU fallback
  • Mac (MPS): Automatic detection in PyTorch 1.12+
# Standard device selection
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

Jupyter Notebooks: Interactive Development

Starting Jupyter

# From terminal with environment active
(ee541) $ jupyter notebook

# Opens browser at localhost:8888
# Navigate to your work directory

Notebook Structure

  • Cells: Code or Markdown
  • Kernel: Python process executing code
  • State: Variables persist between cells

Key Shortcuts

  • Shift+Enter: Run cell, move to next
  • Ctrl+Enter: Run cell, stay
  • Esc: Command mode
  • Enter: Edit mode
  • A/B: Insert cell above/below

Jupyter Notebook Interface

Real-World Example: Trading Algorithm

Jupyter Project Example

Project Management

Version Control

git init
git add .
git commit -m "Initial commit"

# But exclude:
# .gitignore contents:
*.pyc
__pycache__/
.ipynb_checkpoints/
data/
*.pt
*.pth

Reproducibility

# Save environment
conda env export > environment.yml

# Recreate elsewhere
conda env create -f environment.yml

# Save pip requirements
pip freeze > requirements.txt

Verifying Your Deep Learning Environment

import sys
import platform

print(f"Python: {sys.version}")
print(f"Platform: {platform.platform()}")

packages = {
    'numpy': None,
    'torch': None,
    'torchvision': None,
    'matplotlib': None,
    'jupyter': None,
    'sklearn': 'scikit-learn'
}

for import_name, pip_name in packages.items():
    try:
        module = __import__(import_name)
        version = getattr(module, '__version__', 'installed')
        print(f"✓ {import_name}: {version}")
        
        # Special check for PyTorch GPU
        if import_name == 'torch':
            import torch
            if torch.cuda.is_available():
                print(f"  GPU: CUDA ({torch.version.cuda})")
                print(f"  Device: {torch.cuda.get_device_name(0)}")
            elif torch.backends.mps.is_available():
                print(f"  GPU: MPS (Mac)")
            else:
                print(f"  GPU: Not available")
                
    except ImportError:
        package = pip_name or import_name
        print(f"✗ {import_name}: Not installed")
        print(f"  Install with: conda install {package}")
Python: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ]
Platform: macOS-15.6-arm64-arm-64bit
✓ numpy: 1.26.4
✓ torch: 2.5.1
  GPU: MPS (Mac)
✓ torchvision: 0.20.1
✓ matplotlib: 3.10.1
✓ jupyter: installed
✓ sklearn: 1.6.1

Environment Management Commands

Conda Essentials

# List environments
conda env list

# Create from file
conda env create -f environment.yml

# Clone environment
conda create --name ee541_backup --clone ee541

# Remove environment
conda env remove -n ee541

# Update all packages
conda update --all

# Clean cache (free space)
conda clean --all

Package Management

# Search for package
conda search pytorch

# Install specific version
conda install pytorch=2.0.1

# List installed packages
conda list

# Check for updates
conda update --dry-run --all

# Channel priority
conda config --add channels conda-forge
conda config --set channel_priority strict

PyTorch Demo

Fashion-MNIST: 87% Accuracy in Four Epochs

What We’re Building

Task: Classify clothing items into 10 categories

Architecture: Simple 2-layer network

  • Input: 28×28 grayscale images (784 pixels)
  • Hidden: 128 neurons with ReLU
  • Output: 10 classes (softmax)

Training: 4 epochs, Adam optimizer

Files:

  • Minimal.ipynb: Core training loop
  • Visualize-tensorboard.ipynb: Monitoring


T-shirt


Trouser


Pullover


Dress


Coat


Sandal


Shirt


Sneaker


Bag


Boot

Core Training Loop Structure

# Minimal.ipynb - Key components

# 1. Data Loading
train_loader = DataLoader(train_set, batch_size=100, shuffle=True)
test_loader = DataLoader(test_set, batch_size=100, shuffle=False)

# 2. Model Definition
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.hidden = nn.Linear(784, 128)
        self.output = nn.Linear(128, 10)
    
    def forward(self, x):
        x = F.relu(self.hidden(x))
        return self.output(x)

# 3. Training Loop
for epoch in range(num_epochs):
    for images, labels in train_loader:
        # Forward pass
        outputs = model(images.view(-1, 784))
        loss = loss_func(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Training Performance Benchmarks

Complete training in ~5 minutes on CPU, <1 minute on GPU

Training Dynamics

Observations:

  • Rapid initial learning (first epoch)
  • Diminishing returns with more training
  • Test accuracy plateaus around 87% for this simple model

TensorBoard Visualization

Launch TensorBoard

# From terminal
tensorboard --logdir runs
# Navigate to http://localhost:6006

Available Visualizations

  • Scalars: Loss, accuracy over time
  • Images: Sample predictions
  • Graph: Network architecture
  • Embeddings: Feature space (PCA/t-SNE)
  • Histograms: Weight distributions

Embedding Insight: Classes form distinct clusters in feature space

Model Architecture Inspection

87% Accuracy Achieved with Simple Architecture

Implementation Results

  • Model: 2-layer MLP with 128 hidden units
  • Performance: 87% test accuracy after 10 epochs
  • Training time: ~2 minutes on CPU
  • Bottom line: Simple architectures work well for Fashion-MNIST

Not Addressed (Future Topics)

  • Overfitting analysis: No train/val split comparison
  • Hyperparameter tuning: Fixed learning rate, no grid search
  • Architecture search: Only tried one configuration
  • Regularization: No dropout, weight decay, or data augmentation
  • Advanced optimizers: Used basic SGD, not Adam/AdamW

Main Files

ee541-demo1/
├── Minimal.ipynb         # Core training
├── Visualize.ipynb       # TensorBoard
└── model.pth            # Saved weights

Common Confusions: Shirt ↔︎ T-shirt, Pullover ↔︎ Coat

Python and NumPy for Neural Networks

Next week: Array operations and automatic differentiation