EE 541 - Unit 4
Fall 2025
MMSE theory: \[\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X]\]
Required knowing \(p(y|x)\) to compute conditional expectation
Linear MMSE: \[\hat{Y}_{\text{LMMSE}} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}(X - \mathbb{E}[X]) + \mathbb{E}[Y]\]
Required population moments: \(\mathbb{E}[X]\), \(\mathbb{E}[Y]\), Var\((X)\), Cov\((X,Y)\)
In practice:
Empirical approximation:
\[\mathbb{E}[h(X)] \approx \frac{1}{n}\sum_{i=1}^n h(x_i)\]
\[\text{Cov}(X,Y) \approx \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\]
Sample average approximates expectation:
\[R(f) = \mathbb{E}[\ell(Y, f(X))] \approx \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i))\]
Population moments (unknown): \[\mathbb{E}[X] = \int x \cdot p(x) \, dx\] \[\text{Var}(X) = \int (x - \mathbb{E}[X])^2 p(x) \, dx\]
Sample moments (computable): \[\bar{X} = \frac{1}{n} \sum_{i=1}^n x_i\] \[S_X^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{X})^2\]
Law of Large Numbers: \[\bar{X} \xrightarrow{a.s.} \mathbb{E}[X]\]
Central Limit Theorem: \[\sqrt{n}(\bar{X} - \mathbb{E}[X]) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\]
Monte Carlo approximation: Integrals become sums over samples
Population risk (what we want to minimize): \[R(f) = \mathbb{E}_{(X,Y)}[\ell(Y, f(X))]\] \[= \int\int \ell(y, f(x)) \, p(x,y) \, dx \, dy\]
Cannot compute without knowing \(p(x,y)\)
Empirical risk (what we can compute): \[\hat{R}_n(f) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))\]
(Empirical Risk Minimization) ERM: \[\hat{f}_n = \arg\min_{f \in \mathcal{F}} \hat{R}_n(f)\]
Convergence guarantee: Under regularity conditions, \[R(\hat{f}_n) - \inf_{f \in \mathcal{F}} R(f) \xrightarrow{P} 0\]
Sample minimizer converges to population minimizer as \(n \to \infty\)
Population LMMSE: \[a^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}, \quad b^* = \mathbb{E}[Y] - a^*\mathbb{E}[X]\]
Sample LMMSE: \[\hat{a} = \frac{\hat{K}_{XY}}{\hat{K}_X}, \quad \hat{b} = \bar{Y} - \hat{a}\bar{X}\]
where:
This is linear regression!
Same formula, different source:
Consistency: \(\hat{a} \xrightarrow{P} a^*\), \(\hat{b} \xrightarrow{P} b^*\) as \(n \to \infty\)
Batch gradient descent on empirical risk: \[\hat{R}_n(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w^T x_i)^2\]
Gradient uses all data: \[\nabla \hat{R}_n(w) = -\frac{2}{n} \sum_{i=1}^n (y_i - w^T x_i) x_i\]
Update rule: \[w_{t+1} = w_t - \mu \nabla \hat{R}_n(w_t)\]
Characteristics:
Convergence: Linear rate to minimum \[||w_t - w^*|| \leq (1 - \mu \lambda_{\min})^t ||w_0 - w^*||\] where \(\lambda_{\min}\) = smallest eigenvalue of \(\frac{1}{n}X^TX\)
Stochastic gradient on single sample: \[\ell_i(w) = (y_i - w^T x_i)^2\]
Gradient from one sample: \[\nabla \ell_i(w) = -2(y_i - w^T x_i) x_i\]
SGD update: \[w_{t+1} = w_t - \mu_t \nabla \ell_{i_t}(w_t)\]
where \(i_t\) is randomly selected sample
LMS algorithm = SGD for squared loss: \[w_{t+1} = w_t + \mu e_t x_t\]
Same update! Just different notation.
Characteristics:
Convergence requires decreasing step size: \[\sum_{t=1}^{\infty} \mu_t = \infty, \quad \sum_{t=1}^{\infty} \mu_t^2 < \infty\]
Mini-batch gradient on \(m\) samples: \[\hat{R}_m(w) = \frac{1}{m} \sum_{j \in \mathcal{B}_t} (y_j - w^T x_j)^2\]
where \(\mathcal{B}_t\) is batch of size \(m\)
Typical sizes: \(m = 32, 64, 128, 256\)
Gradient estimator: \[\nabla \hat{R}_m(w) = -\frac{2}{m} \sum_{j \in \mathcal{B}_t} (y_j - w^T x_j) x_j\]
Variance of gradient estimate: \[\text{Var}[\nabla \hat{R}_m] = \frac{1}{m} \text{Var}[\nabla \ell_i]\]
Reduces by factor of \(m\) compared to SGD
Why mini-batches?
Model structure: \[\hat{y} = w^T x + b\]
where:
Vectorized over dataset: \[\hat{\mathbf{y}} = \mathbf{X}w + b\mathbf{1}\]
Augmented notation (“absorb” bias): \[\tilde{x} = \begin{bmatrix} 1 \\ x \end{bmatrix}, \quad \tilde{w} = \begin{bmatrix} b \\ w \end{bmatrix}\]
Then: \(\hat{y} = \tilde{w}^T \tilde{x}\)
Empirical risk (squared loss): \[J(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w^T x_i - b)^2\]
Matrix form: \[J(w) = \frac{1}{n} ||y - Xw||^2\]
Expanded: \[J(w) = \frac{1}{n}(y - Xw)^T(y - Xw)\] \[= \frac{1}{n}(y^Ty - 2y^TXw + w^TX^TXw)\]
This is a quadratic in \(w\): \[J(w) = \frac{1}{n}(w^T\mathbf{A}w - 2\mathbf{b}^Tw + c)\]
where:
Convexity: \(\nabla^2 J = \frac{2}{n}X^TX \succeq 0\)
Minimize: \(J(w) = \frac{1}{n}||y - Xw||^2\)
Take gradient: \[\nabla_w J = \frac{2}{n}X^T(Xw - y)\]
Set to zero: \[X^T(Xw - y) = 0\]
Normal equations: \[\boxed{X^TXw = X^Ty}\]
Key observations:
Hence name: Error orthogonal (normal) to column space of \(X\)
Solution (when \(X^TX\) invertible): \[w = (X^TX)^{-1}X^Ty\]
Moore-Penrose pseudoinverse: \[X^+ = (X^TX)^{-1}X^T\]
So: \(w = X^+y\)
Properties of \(X^+\):
When is \(X^TX\) invertible?
Rank deficient case: Use regularization or SVD
Normal equations: \(X^TXw = X^Ty\)
Divide by \(n\): \[\frac{1}{n}X^TXw = \frac{1}{n}X^Ty\]
This gives sample moments:
\[\hat{K}_X w = \hat{k}_{Xy}\]
where:
Solution: \[w = \hat{K}_X^{-1} \hat{k}_{Xy}\]
This is empirical LMMSE!
Population: \(w^* = K_X^{-1} k_{Xy}\) Sample: \(\hat{w} = \hat{K}_X^{-1} \hat{k}_{Xy}\)
Consistency: \(\hat{w} \to w^*\) as \(n \to \infty\)
QR factorization: \(X = QR\)
where:
Solve via QR: \[X^TXw = X^Ty\] \[(QR)^T(QR)w = (QR)^Ty\] \[R^TQ^TQRw = R^TQ^Ty\]
Since \(Q^TQ = I\): \[R^TRw = R^TQ^Ty\]
If \(R\) full rank: \(R^T\) invertible: \[Rw = Q^Ty\]
Back-substitution: Solve triangular system
Complexity: \(O(2np^2)\) but numerically stable
Advantage: Avoids calcuating \(X^TX\) explicitly
Singular Value Decomposition: \[X = U\Sigma V^T\]
where:
Solution via SVD: \[w = V\Sigma^+ U^T y\]
where \(\Sigma^+\) = pseudoinverse of \(\Sigma\):
Truncated SVD (regularization): Keep only \(k\) largest singular values
Key idea: Move downhill in direction of steepest descent
Gradient points uphill: \[\nabla J(w) = \frac{2}{n}X^T(Xw - y)\]
Direction of maximum increase of \(J\)
Descent direction: \(-\nabla J(w)\)
Update rule: \[w_{t+1} = w_t - \alpha \nabla J(w_t)\]
where \(\alpha\) = step size / learning rate
Intuition:
Convergence criterion: \[||\nabla J(w)|| < \epsilon\]
At optimum: \(\nabla J(w^*) = 0\) (critical point)
Definition: \(f\) is convex if for all \(\lambda \in [0,1]\): \[f(\lambda w_1 + (1-\lambda)w_2) \leq \lambda f(w_1) + (1-\lambda)f(w_2)\]
Squared loss is convex: \[J(w) = \frac{1}{n}||y - Xw||^2\]
Proof via Hessian: \[\nabla^2 J = \frac{2}{n}X^TX\]
Since \(X^TX \succeq 0\) (positive semidefinite): \[v^T(X^TX)v = ||Xv||^2 \geq 0 \quad \forall v\]
Therefore \(J\) is convex.
Strong convexity if \(X\) full rank: \[J(w) \geq J(w^*) + \frac{\mu}{2}||w - w^*||^2\] where \(\mu = \lambda_{\min}(X^TX)/n > 0\)
Critical parameter: step size \(\alpha\)
Too small: Slow convergence \[w_{t+1} \approx w_t\]
Too large: Overshoot, divergence \[J(w_{t+1}) > J(w_t)\]
Just right: Fastest convergence
Condition number effect: \[\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\]
Theoretical bounds: For convergence need: \[0 < \alpha < \frac{2}{\lambda_{\max}(X^TX/n)}\]
“Optimal” step size: \[\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\]
Convergence rate: \[||w_t - w^*|| \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^t ||w_0 - w^*||\]
Loss surface shape determined by eigenvalues of \(X^TX\)
Principal axes: Eigenvectors of \(X^TX\) Curvature along axis: Eigenvalue
Gradient descent behavior:
Preconditioning: Transform to make all eigenvalues equal \[\tilde{X} = XP^{-1}\] where \(P\) chosen so \(\tilde{X}^T\tilde{X} = I\)
Use one sample at a time: \[w_{t+1} = w_t - \alpha_t \nabla \ell_{i_t}(w_t)\]
where \(i_t\) randomly selected
For squared loss: \[\nabla \ell_i(w) = -2(y_i - w^T x_i)x_i\]
Unbiased gradient estimate: \[\mathbb{E}[\nabla \ell_i(w)] = \nabla J(w)\]
Noise in gradient: \[\text{Var}[\nabla \ell_i] = \mathbb{E}[||\nabla \ell_i||^2] - ||\nabla J||^2\]
Benefits:
Fixed step size: Converges to neighborhood \[\mathbb{E}[||w_t - w^*||^2] \leq \alpha \sigma^2 + (1-2\alpha\mu)^t||w_0 - w^*||^2\]
where \(\sigma^2\) = gradient noise variance
Decreasing step size: Converges to exact solution
Required conditions (Robbins-Monro): \[\sum_{t=1}^{\infty} \alpha_t = \infty, \quad \sum_{t=1}^{\infty} \alpha_t^2 < \infty\]
Example: \(\alpha_t = \alpha_0/t\)
Convergence rates:
Method | Rate | Final Error |
---|---|---|
Batch GD | \(O(e^{-t})\) | 0 |
SGD (fixed \(\alpha\)) | \(O(e^{-t})\) | \(O(\alpha)\) |
SGD (decreasing) | \(O(1/t)\) | 0 |
Variance reduction techniques:
Mini-batch gradient: \[\nabla J_{\mathcal{B}} = \frac{1}{m} \sum_{i \in \mathcal{B}} \nabla \ell_i(w)\]
where \(|\mathcal{B}| = m\) = batch size
Variance reduction: \[\text{Var}[\nabla J_{\mathcal{B}}] = \frac{1}{m}\text{Var}[\nabla \ell_i]\]
Linear speedup in variance reduction!
Benefits:
Diminishing returns:
Epoch: One pass through dataset
So far: Minimized squared loss empirically \[J(w) = \frac{1}{n}\sum_{i=1}^n (y_i - w^T x_i)^2\]
But why this particular loss function?
Different losses give different solutions:
Need principled approach:
Maximum likelihood provides framework:
Assume data comes from a process: \[y = w^T x + \epsilon\]
where \(\epsilon\) is random noise
What we observe:
Key questions:
Common assumption: \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)
Why Gaussian?
Model: \(y = w^T x + \epsilon\), where \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)
This implies \(y|x\) is Gaussian: \[y|x \sim \mathcal{N}(w^T x, \sigma^2)\]
Probability of observing \(y\) given \(x\): \[p(y|x; w) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - w^T x)^2}{2\sigma^2}\right)\]
For entire dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\):
Assuming independent samples: \[p(\mathcal{D}|w) = \prod_{i=1}^n p(y_i|x_i; w)\]
Maximum likelihood principle: Find \(w\) that makes observed data most probable: \[\hat{w}_{\text{ML}} = \arg\max_w p(\mathcal{D}|w)\]
Log-likelihood: \[\ell(w) = \log p(\mathcal{D}|w) = \sum_{i=1}^n \log p(y_i|x_i; w)\]
Substitute Gaussian pdf: \[\ell(w) = \sum_{i=1}^n \log \left(\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i - w^T x_i)^2}{2\sigma^2}}\right)\]
Simplify: \[\ell(w) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - w^T x_i)^2\]
To maximize \(\ell(w)\):
Result: MLE = Least squares!
\[\hat{w}_{\text{ML}} = \arg\min_w \sum_{i=1}^n(y_i - w^T x_i)^2\]
Squared loss emerges from Gaussian noise assumption
Maximize log-likelihood: \[\hat{w} = \arg\max_w \ell(w) = \arg\min_w ||y - Xw||^2\]
Take gradient of loss: \[\nabla_w \left[\frac{1}{2}||y - Xw||^2\right] = -X^T(y - Xw)\]
Set to zero (critical point): \[-X^T(y - Xw) = 0\] \[X^T Xw = X^T y\]
Normal equations (same as before!): \[\boxed{\hat{w}_{\text{ML}} = (X^T X)^{-1} X^T y}\]
So::
MLE gives us \(\hat{w}\), but:
Sources of uncertainty:
Statistical inference framework:
MLE is random! \[\hat{w} = (X^TX)^{-1}X^Ty = (X^TX)^{-1}X^T(Xw^* + \epsilon)\] \[= w^* + (X^TX)^{-1}X^T\epsilon\]
Distribution of \(\hat{w}\) depends on distribution of \(\epsilon\)
Fisher Information: Quantifies how much information data contains about parameters
Fisher information matrix: \[\mathbf{I}(w) = -\mathbb{E}\left[\frac{\partial^2 \ell(w)}{\partial w \partial w^T}\right]\]
For linear regression: \[\mathbf{I}(w) = \frac{1}{\sigma^2}X^T X\]
Measures “information” about \(w\) in data
Covariance of ML estimator: \[\text{Cov}(\hat{w}_{\text{ML}}) = \mathbf{I}^{-1} = \sigma^2(X^T X)^{-1}\]
Confidence regions: \[(\hat{w} - w^*)^T \mathbf{I} (\hat{w} - w^*) \sim \chi^2_p\]
Forms ellipsoid around estimate
Consistency: \[\hat{w}_n \xrightarrow{p} w^* \text{ as } n \to \infty\]
MLE converges to true value with enough data
Asymptotic normality: \[\sqrt{n}(\hat{w}_n - w^*) \xrightarrow{d} \mathcal{N}(0, \sigma^2 K^{-1})\]
where \(K = \lim_{n \to \infty} \frac{1}{n}X^T X\)
Efficiency:
Invariance: If \(\hat{w}\) is MLE of \(w\), then:
MLE limitation: Ignores prior knowledge
Bayesian approach: Include prior belief \[p(w|\mathcal{D}) \propto p(\mathcal{D}|w) \cdot p(w)\]
MAP estimation: \[\hat{w}_{\text{MAP}} = \arg\max_w p(w|\mathcal{D})\]
Log posterior: \[= \arg\max_w [\log p(\mathcal{D}|w) + \log p(w)]\] \[= \arg\max_w [\text{log-likelihood} + \text{log-prior}]\]
Example: Gaussian prior \(w \sim \mathcal{N}(0, \tau^2 I)\) \[\log p(w) = -\frac{1}{2\tau^2}||w||^2 + \text{const}\]
MAP objective: \[\hat{w}_{\text{MAP}} = \arg\min_w \left[||y - Xw||^2 + \frac{\sigma^2}{\tau^2}||w||^2\right]\]
This is Ridge regression with \(\lambda = \sigma^2/\tau^2\)
Why care about different noise models?
Real data often has outliers
Laplace noise model: \[p(\epsilon) = \frac{1}{2b}\exp\left(-\frac{|\epsilon|}{b}\right)\]
Heavy tails → robust to outliers
Likelihood with Laplace noise: \[p(y|x; w) = \frac{1}{2b}\exp\left(-\frac{|y - w^T x|}{b}\right)\]
Log-likelihood: \[\ell(w) = -n\log(2b) - \frac{1}{b}\sum_{i=1}^n |y_i - w^T x_i|\]
Maximizing \(\ell(w)\) equivalent to minimizing: \[\sum_{i=1}^n |y_i - w^T x_i|\]
L1 loss emerges from Laplace noise assumption!
Problem 1: Too many features (\(p > n\))
Problem 2: Multicollinearity
Problem 3: Overfitting
Solution approach: Add constraints \[J_{\text{reg}}(w) = ||y - Xw||^2 + \lambda R(w)\]
where \(R(w)\) penalizes complexity
This changes the outcome:
Adding more features:
Expected prediction error: \[\mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2\]
For linear models with \(p\) features:
Overfitting occurs when:
Need way to tune complexity
Cross-validation estimates test error using training data
Modified objective: \[J_{\text{reg}}(w) = ||y - Xw||^2 + \lambda ||w||^2\]
Ridge regression: L2 penalty
New solution: \[\hat{w}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y\]
What this does:
Bayesian interpretation:
Connection to MLE:
How do we measure if a model is good?
So far: Squared error seemed natural for regression \[\text{Loss} = (y - \hat{y})^2\]
But why squared? Why not absolute? Cubic?
Information theory provides the answer:
Learning reduces uncertainty about data - information theory quantifies this
Intuition: Rare events are more “informative”
Information content of outcome \(x\): \[I(x) = -\log p(x)\]
Units:
Properties:
Why logarithm?
Example: Fair coin flip
Entropy = Expected information content \[H(X) = \mathbb{E}[I(X)] = -\mathbb{E}[\log p(X)]\] \[= -\sum_x p(x) \log p(x)\]
Interpretation:
Properties:
Examples:
Continuous case (differential entropy): \[H(X) = -\int p(x) \log p(x) \, dx\]
Note: Can be negative for continuous distributions
Question: What distribution should we assume given constraints?
Maximum entropy principle: Choose distribution with maximum entropy subject to known constraints
Why?
Example 1: Known mean and variance \[\max_{p} H(p) \text{ s.t. } \mathbb{E}[X] = \mu, \text{Var}(X) = \sigma^2\]
Solution: Gaussian \(\mathcal{N}(\mu, \sigma^2)\)
Example 2: Positive with known mean \[\max_{p} H(p) \text{ s.t. } X > 0, \mathbb{E}[X] = \lambda\]
Solution: Exponential with rate \(1/\lambda\)
This explains why:
Problem: How do we measure if two distributions are “different”?
Not just point-wise difference:
Natural idea: If true distribution is \(p\) but we design code for \(q\):
Extra bits needed: \[\text{Inefficiency} = \mathbb{E}_p[-\log q(X)] - \mathbb{E}_p[-\log p(X)]\] \[= H(p,q) - H(p)\]
This motivates KL divergence as the “coding penalty” for using wrong distribution
Cross-entropy between distributions \(p\) and \(q\): \[H(p, q) = -\mathbb{E}_p[\log q(X)] = -\int p(x) \log q(x) \, dx\]
Interpretation: Average bits to encode samples from \(p\) using code optimized for \(q\)
Properties:
For empirical distribution (data): \[p_{data} = \frac{1}{n}\sum_{i=1}^n \delta(x - x_i)\]
Cross-entropy becomes: \[H(p_{data}, q) = -\frac{1}{n}\sum_{i=1}^n \log q(x_i)\]
This is negative average log-likelihood!
Kullback-Leibler (KL) divergence: \[D_{KL}(p||q) = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right]\] \[= \int p(x) \log \frac{p(x)}{q(x)} \, dx\]
Interpretation:
Properties:
Forward vs Reverse KL:
From previous slide’s definition: \[H(p, q) = H(p) + D_{KL}(p||q)\]
Connection to likelihood: \[D_{KL}(p_{data}||p_{model}) = \text{const} - \mathbb{E}_{data}[\log p_{model}]\]
Learning as distribution matching:
Want model \(q_\theta\) to match data distribution \(p_{data}\)
Minimize KL divergence: \[\min_\theta D_{KL}(p_{data} || q_\theta)\]
Expanding KL: \[D_{KL}(p_{data} || q_\theta) = H(p_{data}, q_\theta) - H(p_{data})\]
Since \(H(p_{data})\) doesn’t depend on \(\theta\): \[\min_\theta D_{KL}(p_{data} || q_\theta) \equiv \min_\theta H(p_{data}, q_\theta)\]
With empirical distribution (data as delta functions): \[p_{data} = \frac{1}{n}\sum_{i=1}^n \delta(x - x_i)\]
\[H(p_{data}, q_\theta) = -\frac{1}{n}\sum_{i=1}^n \log q_\theta(x_i)\]
This is negative average log-likelihood!
\[\min_\theta H(p_{data}, q_\theta) \equiv \max_\theta \frac{1}{n}\sum_{i=1}^n \log q_\theta(x_i)\]
MLE = Cross-entropy minimization = KL minimization
General principle: Loss = negative log-likelihood \[\text{Loss}(y, \hat{y}) = -\log p(y|\hat{y})\]
Different noise → different loss:
Gaussian noise: \(y = f(x) + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) \[p(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x))^2}{2\sigma^2}\right)\] \[-\log p(y|x) = \frac{(y - f(x))^2}{2\sigma^2} + \text{const}\] → Squared loss
Laplace noise: \(\epsilon \sim \text{Laplace}(0, b)\) \[p(y|x) = \frac{1}{2b} \exp\left(-\frac{|y - f(x)|}{b}\right)\] \[-\log p(y|x) = \frac{|y - f(x)|}{b} + \text{const}\] → Absolute loss
Categorical: \(y \in \{1, ..., K\}\), \(p(y=k|x) = \pi_k\) \[-\log p(y|x) = -\log \pi_y\] → Cross-entropy loss
Decomposing KL divergence: \[D_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]
Expanding the logarithm: \[D_{KL}(p||q) = \int p(x) \log p(x) dx - \int p(x) \log q(x) dx\] \[= -H(p) + [-\mathbb{E}_p[\log q]]\] \[= H(p, q) - H(p)\]
Decomposition: KL = Cross-entropy - Entropy
Mutual information is KL of a special form: \[I(X;Y) = D_{KL}(p(x,y) || p(x)p(y))\]
Compares joint distribution to what it would be if independent
Alternative forms: \[I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\] \[= H(X) + H(Y) - H(X,Y)\]
Mutual information between \(X\) and \(Y\): \[I(X; Y) = D_{KL}(p(x,y) || p(x)p(y))\] \[= \mathbb{E}_{X,Y}\left[\log \frac{p(X,Y)}{p(X)p(Y)}\right]\]
Equivalent forms: \[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\] \[= H(X) + H(Y) - H(X,Y)\]
Interpretation:
Properties:
Applications in ML:
Loss functions from information theory:
Regression (continuous \(Y\)):
Classification (discrete \(Y\)):
Generative models:
Information bottleneck principle: \[\min I(X; Z) - \beta I(Z; Y)\]
Compress \(X\) into \(Z\) while preserving information about \(Y\)
Representation learning: Learn features that maximize \(I(Z; Y)\) while minimizing \(I(Z; X\setminus Y)\)
Loss = Negative log-likelihood under noise model
Regression with Gaussian noise: \[y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)\] \[p(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-f(x))^2}{2\sigma^2}\right)\] \[-\log p(y|x) = \frac{(y-f(x))^2}{2\sigma^2} + \text{const}\] \[\Rightarrow \text{Loss} = (y - f(x))^2\]
Regression with Laplace noise: \[\epsilon \sim \text{Laplace}(0, b)\] \[-\log p(y|x) = \frac{|y-f(x)|}{b} + \text{const}\] \[\Rightarrow \text{Loss} = |y - f(x)|\]
Classification (discrete \(y \in \{1, ..., K\}\)):
Model outputs probabilities: \(\pi_k(x) = P(y=k|x)\)
Categorical distribution: \[p(y|x) = \prod_{k=1}^K \pi_k(x)^{[y=k]}\]
Negative log-likelihood: \[-\log p(y|x) = -\sum_{k} [y=k] \log \pi_k\]
One-hot encoding: \(y_k = [y=k]\) \[\Rightarrow \text{Loss} = -\sum_{k} y_k \log \pi_k\]
This is categorical cross-entropy!
Binary case (\(K=2\)): \[\text{Loss} = -y\log \pi - (1-y)\log(1-\pi)\]
Bernoulli leads to binary cross-entropy
Later: How to ensure \(\pi \in [0,1]\) → sigmoid/softmax
Binary classification problem:
Naive approach: Treat as regression
Seems reasonable?
But there are serious problems…
Regression outputs can be any real number
For points far from boundary:
Squared loss encourages extreme values:
Outliers dominate the loss:
Need: Output constrained to \([0, 1]\)
Classification is fundamentally different from regression
Regression: Minimize distance to target
Classification: Correct or incorrect
Information theory perspective:
Squared loss for binary data:
We’re using the wrong probabilistic model!
Linear decision boundary: \[w^Tx + b = 0\]
Defines hyperplane in \(\mathbb{R}^p\)
Geometric interpretation:
Distance to boundary: \[d(x) = \frac{|w^Tx + b|}{||w||}\]
Signed distance (positive = class 1 side): \[d_{\text{signed}}(x) = \frac{w^Tx + b}{||w||}\]
Decision rule: \[\hat{y} = \begin{cases} 1 & \text{if } w^Tx + b > 0 \\ 0 & \text{if } w^Tx + b \leq 0 \end{cases}\]
Not all problems are linearly separable
Classic example: XOR
No single line can separate the classes!
Linear separability requires: \[\exists w, b : \forall i, \quad y_i(w^Tx_i + b) > 0\]
When linear fails:
Solutions:
Practical implications:
Medical diagnosis: Disease probability needed
Credit scoring: Default probability
Multi-class: Even worse
What we need:
Solutions to our problems:
Sigmoid function: Maps \(\mathbb{R} \to [0,1]\) \[\sigma(z) = \frac{1}{1 + e^{-z}}\]
Log-odds (logit) transformation: \[\log\frac{P(y=1|x)}{P(y=0|x)} = w^Tx + b\]
Cross-entropy loss (from information theory): \[L = -y\log\hat{p} - (1-y)\log(1-\hat{p})\]
Maximum likelihood estimation:
Multi-class: Softmax function \[P(y=k|x) = \frac{e^{w_k^Tx}}{\sum_j e^{w_j^Tx}}\]
Ensures outputs sum to 1!
Linear in parameters, not features:
Original features: \(x \in \mathbb{R}^d\)
Transform to: \(\phi(x) \in \mathbb{R}^p\) where \(p > d\)
Model: \(y = w^T\phi(x) + b\)
Polynomial features: \[x \mapsto [1, x, x^2, x^3, ..., x^p]\]
Two variables: \[[x_1, x_2] \mapsto [1, x_1, x_2, x_1^2, x_1x_2, x_2^2, ...]\]
Still solving linear regression: \[\min_w ||\mathbf{y} - \Phi \mathbf{w}||^2\]
where \(\Phi_{ij} = \phi_j(x_i)\)
Complexity growth:
Curse of dimensionality!
Why normalize?
Standardization (z-score): \[x' = \frac{x - \mu}{\sigma}\]
Min-Max scaling: \[x' = \frac{x - x_{min}}{x_{max} - x_{min}}\]
Robust scaling: \[x' = \frac{x - \text{median}}{\text{IQR}}\]
When to use which:
Multicollinearity: Features are highly correlated
Problems:
Detection:
VIF for feature \(j\): \[\text{VIF}_j = \frac{1}{1 - R_j^2}\] where \(R_j^2\) = R² from regressing \(x_j\) on other features
Solutions:
Adding \(\lambda I\) improves conditioning!
Residuals: \(e_i = y_i - \hat{y}_i\)
Assumptions to check:
Linearity: Residuals vs fitted values
Homoscedasticity: Constant variance
Normality: Q-Q plot
Independence: Autocorrelation
When linear regression fails:
NumPy implementation:
# Normal equations
def linear_regression_normal(X, y):
# Add bias column
X_b = np.c_[np.ones(len(X)), X]
# Solve normal equations
theta = np.linalg.solve(X_b.T @ X_b, X_b.T @ y)
return theta
# Gradient descent
def linear_regression_gd(X, y, lr=0.01, epochs=1000):
n, p = X.shape
X_b = np.c_[np.ones(n), X]
theta = np.zeros(p + 1)
for _ in range(epochs):
gradients = 2/n * X_b.T @ (X_b @ theta - y)
theta = theta - lr * gradients
return theta
Scikit-learn:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
# Coefficients: model.coef_, model.intercept_
Scikit-learn advantages:
Regression metrics:
Mean Squared Error (MSE): \[\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]
Root MSE: \(\text{RMSE} = \sqrt{\text{MSE}}\) (same units as \(y\))
R² (Coefficient of Determination): \[R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}\]
Mean Absolute Error (MAE): \[\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|\]
More robust to outliers than MSE
Classification metrics (linear as classifier):
Linear regression limitations:
Nonlinear relationships
Non-Gaussian noise
Discrete outputs
High dimensionality (\(p > n\))
Complex interactions
Temporal/spatial correlation
Signs of failure:
Remember: Linear models are foundation