Regression, Maximum Likelihood, and Information Theory

EE 541 - Unit 4

Dr. Brandon Franzke

Fall 2025

Outline

Regression Framework

Finite Sample Setting

  • Population moments → sample estimates
  • Empirical risk minimization
  • Batch vs stochastic processing
  • Convergence of sample averages

Linear Regression

  • Normal equations: \(\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}\)
  • Computational methods: QR, SVD, iterative
  • Gradient descent dynamics
  • Eigenvalue geometry and conditioning

Probabilistic Foundation

  • Gaussian noise → squared loss
  • Maximum likelihood = least squares
  • Fisher information and uncertainty
  • MAP estimation = regularization

Practical Considerations

  • Feature engineering and basis expansion
  • Normalization and preprocessing
  • Multicollinearity and ridge regression
  • Model diagnostics and residual analysis

Beyond Regression

Information Theory

  • Entropy as uncertainty measure
  • KL divergence between distributions
  • Cross-entropy = negative log-likelihood
  • Loss functions from noise models

Classification Challenges

  • Unbounded outputs problem
  • Wrong loss for discrete targets
  • Linear separability limitations
  • Need for probabilistic outputs

Model Selection

  • Bias-variance decomposition
  • Training vs generalization error
  • Regularization as constraint
  • When linear models fail

Next Steps

  • Logistic regression preview
  • From regression to neural networks
  • Nonlinear transformations
  • Deep learning connections

Empirical Risk Minimization

Where We Left Off

MMSE theory: \[\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X]\]

Required knowing \(p(y|x)\) to compute conditional expectation

Linear MMSE: \[\hat{Y}_{\text{LMMSE}} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}(X - \mathbb{E}[X]) + \mathbb{E}[Y]\]

Required population moments: \(\mathbb{E}[X]\), \(\mathbb{E}[Y]\), Var\((X)\), Cov\((X,Y)\)

In practice:

  • Don’t know distributions
  • Don’t know population moments
  • Only have finite samples: \((x_1, y_1), ..., (x_n, y_n)\)

Empirical approximation:

\[\mathbb{E}[h(X)] \approx \frac{1}{n}\sum_{i=1}^n h(x_i)\]

\[\text{Cov}(X,Y) \approx \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\]

Sample average approximates expectation:

\[R(f) = \mathbb{E}[\ell(Y, f(X))] \approx \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i))\]

From Expectations to Samples

Population moments (unknown): \[\mathbb{E}[X] = \int x \cdot p(x) \, dx\] \[\text{Var}(X) = \int (x - \mathbb{E}[X])^2 p(x) \, dx\]

Sample moments (computable): \[\bar{X} = \frac{1}{n} \sum_{i=1}^n x_i\] \[S_X^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{X})^2\]

Law of Large Numbers: \[\bar{X} \xrightarrow{a.s.} \mathbb{E}[X]\]

Central Limit Theorem: \[\sqrt{n}(\bar{X} - \mathbb{E}[X]) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\]

Monte Carlo approximation: Integrals become sums over samples

Empirical Risk Minimization

Population risk (what we want to minimize): \[R(f) = \mathbb{E}_{(X,Y)}[\ell(Y, f(X))]\] \[= \int\int \ell(y, f(x)) \, p(x,y) \, dx \, dy\]

Cannot compute without knowing \(p(x,y)\)

Empirical risk (what we can compute): \[\hat{R}_n(f) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))\]

(Empirical Risk Minimization) ERM: \[\hat{f}_n = \arg\min_{f \in \mathcal{F}} \hat{R}_n(f)\]

Convergence guarantee: Under regularity conditions, \[R(\hat{f}_n) - \inf_{f \in \mathcal{F}} R(f) \xrightarrow{P} 0\]

Sample minimizer converges to population minimizer as \(n \to \infty\)

LMMSE with Sample Moments

Population LMMSE: \[a^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}, \quad b^* = \mathbb{E}[Y] - a^*\mathbb{E}[X]\]

Sample LMMSE: \[\hat{a} = \frac{\hat{K}_{XY}}{\hat{K}_X}, \quad \hat{b} = \bar{Y} - \hat{a}\bar{X}\]

where:

  • \(\bar{X} = \frac{1}{n}\sum x_i\), \(\bar{Y} = \frac{1}{n}\sum y_i\)
  • \(\hat{K}_X = \frac{1}{n}\sum (x_i - \bar{X})^2\)
  • \(\hat{K}_{XY} = \frac{1}{n}\sum (x_i - \bar{X})(y_i - \bar{Y})\)

This is linear regression!

Same formula, different source:

  • Theory: Population moments
  • Practice: Sample moments

Consistency: \(\hat{a} \xrightarrow{P} a^*\), \(\hat{b} \xrightarrow{P} b^*\) as \(n \to \infty\)

Batch Processing: All Data at Once

Batch gradient descent on empirical risk: \[\hat{R}_n(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w^T x_i)^2\]

Gradient uses all data: \[\nabla \hat{R}_n(w) = -\frac{2}{n} \sum_{i=1}^n (y_i - w^T x_i) x_i\]

Update rule: \[w_{t+1} = w_t - \mu \nabla \hat{R}_n(w_t)\]

Characteristics:

  • Each iteration sees entire dataset
  • Smooth, predictable convergence
  • Per-iteration cost: \(O(np)\)

Convergence: Linear rate to minimum \[||w_t - w^*|| \leq (1 - \mu \lambda_{\min})^t ||w_0 - w^*||\] where \(\lambda_{\min}\) = smallest eigenvalue of \(\frac{1}{n}X^TX\)

Stochastic Processing: One Sample at a Time

Stochastic gradient on single sample: \[\ell_i(w) = (y_i - w^T x_i)^2\]

Gradient from one sample: \[\nabla \ell_i(w) = -2(y_i - w^T x_i) x_i\]

SGD update: \[w_{t+1} = w_t - \mu_t \nabla \ell_{i_t}(w_t)\]

where \(i_t\) is randomly selected sample

LMS algorithm = SGD for squared loss: \[w_{t+1} = w_t + \mu e_t x_t\]

Same update! Just different notation.

Characteristics:

  • Per-update cost: \(O(p)\)
  • Noisy gradient estimates
  • Can escape shallow local minima

Convergence requires decreasing step size: \[\sum_{t=1}^{\infty} \mu_t = \infty, \quad \sum_{t=1}^{\infty} \mu_t^2 < \infty\]

Mini-batch Gradient Descent

Mini-batch gradient on \(m\) samples: \[\hat{R}_m(w) = \frac{1}{m} \sum_{j \in \mathcal{B}_t} (y_j - w^T x_j)^2\]

where \(\mathcal{B}_t\) is batch of size \(m\)

Typical sizes: \(m = 32, 64, 128, 256\)

Gradient estimator: \[\nabla \hat{R}_m(w) = -\frac{2}{m} \sum_{j \in \mathcal{B}_t} (y_j - w^T x_j) x_j\]

Variance of gradient estimate: \[\text{Var}[\nabla \hat{R}_m] = \frac{1}{m} \text{Var}[\nabla \ell_i]\]

Reduces by factor of \(m\) compared to SGD

Why mini-batches?

  • Variance reduction: \(\sqrt{m}\) times smaller than SGD
  • Vectorization: Modern hardware (GPU/TPU) processes batches efficiently in parallel

Linear Regression from Data

Linear Model for Regression

Model structure: \[\hat{y} = w^T x + b\]

where:

  • \(x \in \mathbb{R}^p\): feature vector
  • \(w \in \mathbb{R}^p\): weight vector
  • \(b \in \mathbb{R}\): bias/intercept
  • \(\hat{y} \in \mathbb{R}\): prediction

Vectorized over dataset: \[\hat{\mathbf{y}} = \mathbf{X}w + b\mathbf{1}\]

  • \(\mathbf{X} \in \mathbb{R}^{n \times p}\): data matrix
  • \(\mathbf{y} \in \mathbb{R}^n\): target vector
  • \(\mathbf{1} \in \mathbb{R}^n\): vector of ones

Augmented notation (“absorb” bias): \[\tilde{x} = \begin{bmatrix} 1 \\ x \end{bmatrix}, \quad \tilde{w} = \begin{bmatrix} b \\ w \end{bmatrix}\]

Then: \(\hat{y} = \tilde{w}^T \tilde{x}\)

Squared Loss on Finite Data

Empirical risk (squared loss): \[J(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w^T x_i - b)^2\]

Matrix form: \[J(w) = \frac{1}{n} ||y - Xw||^2\]

Expanded: \[J(w) = \frac{1}{n}(y - Xw)^T(y - Xw)\] \[= \frac{1}{n}(y^Ty - 2y^TXw + w^TX^TXw)\]

This is a quadratic in \(w\): \[J(w) = \frac{1}{n}(w^T\mathbf{A}w - 2\mathbf{b}^Tw + c)\]

where:

  • \(\mathbf{A} = X^TX\) (always positive semidefinite)
  • \(\mathbf{b} = X^Ty\)
  • \(c = y^Ty\)

Convexity: \(\nabla^2 J = \frac{2}{n}X^TX \succeq 0\)

Setting Up the Normal Equations

Minimize: \(J(w) = \frac{1}{n}||y - Xw||^2\)

Take gradient: \[\nabla_w J = \frac{2}{n}X^T(Xw - y)\]

Set to zero: \[X^T(Xw - y) = 0\]

Normal equations: \[\boxed{X^TXw = X^Ty}\]

Key observations:

  1. \(X^TX \in \mathbb{R}^{p \times p}\): Gram matrix
  2. \(X^Ty \in \mathbb{R}^p\): cross-correlation
  3. System is \(p\) equations in \(p\) unknowns
  4. Unique solution if \(X^TX\) invertible

Hence name: Error orthogonal (normal) to column space of \(X\)

Solving the Normal Equations

Solution (when \(X^TX\) invertible): \[w = (X^TX)^{-1}X^Ty\]

Moore-Penrose pseudoinverse: \[X^+ = (X^TX)^{-1}X^T\]

So: \(w = X^+y\)

Properties of \(X^+\):

  1. \(X^+X = I_p\) (left inverse)
  2. \(XX^+ = P_X\) (projection onto col(\(X\)))
  3. Minimizes \(||w||\) if multiple solutions

When is \(X^TX\) invertible?

  • Need \(\text{rank}(X) = p\) (full column rank)
  • Requires \(n \geq p\) (more data than features)
  • Columns of \(X\) linearly independent

Rank deficient case: Use regularization or SVD

Connection to Sample Statistics

Normal equations: \(X^TXw = X^Ty\)

Divide by \(n\): \[\frac{1}{n}X^TXw = \frac{1}{n}X^Ty\]

This gives sample moments:

\[\hat{K}_X w = \hat{k}_{Xy}\]

where:

  • \(\hat{K}_X = \frac{1}{n}X^TX\): sample covariance matrix
  • \(\hat{k}_{Xy} = \frac{1}{n}X^Ty\): sample cross-covariance

Solution: \[w = \hat{K}_X^{-1} \hat{k}_{Xy}\]

This is empirical LMMSE!

Population: \(w^* = K_X^{-1} k_{Xy}\) Sample: \(\hat{w} = \hat{K}_X^{-1} \hat{k}_{Xy}\)

Consistency: \(\hat{w} \to w^*\) as \(n \to \infty\)

Direct Solution (Computation)

Method 1: Normal equations

w = np.linalg.solve(X.T @ X, X.T @ y)

Complexity:

  • Form \(X^TX\): \(O(np^2)\)
  • Form \(X^Ty\): \(O(np)\)
  • Solve system: \(O(p^3)\)
  • Total: \(O(np^2 + p^3)\)

Condition number: \[\kappa(X^TX) = \kappa(X)^2\] Squaring doubles the condition number (in log scale)!

QR Decomposition Method

QR factorization: \(X = QR\)

where:

  • \(Q \in \mathbb{R}^{n \times p}\): orthogonal columns
  • \(R \in \mathbb{R}^{p \times p}\): upper triangular

Solve via QR: \[X^TXw = X^Ty\] \[(QR)^T(QR)w = (QR)^Ty\] \[R^TQ^TQRw = R^TQ^Ty\]

Since \(Q^TQ = I\): \[R^TRw = R^TQ^Ty\]

If \(R\) full rank: \(R^T\) invertible: \[Rw = Q^Ty\]

Back-substitution: Solve triangular system

Complexity: \(O(2np^2)\) but numerically stable

Advantage: Avoids calcuating \(X^TX\) explicitly

SVD for Robust Solutions

Singular Value Decomposition: \[X = U\Sigma V^T\]

where:

  • \(U \in \mathbb{R}^{n \times n}\): left singular vectors
  • \(\Sigma \in \mathbb{R}^{n \times p}\): diagonal (singular values)
  • \(V \in \mathbb{R}^{p \times p}\): right singular vectors

Solution via SVD: \[w = V\Sigma^+ U^T y\]

where \(\Sigma^+\) = pseudoinverse of \(\Sigma\):

  • If \(\sigma_i > \epsilon\): \(\sigma_i^+ = 1/\sigma_i\)
  • If \(\sigma_i \leq \epsilon\): \(\sigma_i^+ = 0\)

Truncated SVD (regularization): Keep only \(k\) largest singular values

Gradient Descent: Following the Slope

Key idea: Move downhill in direction of steepest descent

Gradient points uphill: \[\nabla J(w) = \frac{2}{n}X^T(Xw - y)\]

Direction of maximum increase of \(J\)

Descent direction: \(-\nabla J(w)\)

Update rule: \[w_{t+1} = w_t - \alpha \nabla J(w_t)\]

where \(\alpha\) = step size / learning rate

Intuition:

  • At each point, linearize the function
  • Take small step in direction that decreases \(J\) most
  • Repeat until convergence

Convergence criterion: \[||\nabla J(w)|| < \epsilon\]

At optimum: \(\nabla J(w^*) = 0\) (critical point)

Convexity of Squared Loss

Definition: \(f\) is convex if for all \(\lambda \in [0,1]\): \[f(\lambda w_1 + (1-\lambda)w_2) \leq \lambda f(w_1) + (1-\lambda)f(w_2)\]

Squared loss is convex: \[J(w) = \frac{1}{n}||y - Xw||^2\]

Proof via Hessian: \[\nabla^2 J = \frac{2}{n}X^TX\]

Since \(X^TX \succeq 0\) (positive semidefinite): \[v^T(X^TX)v = ||Xv||^2 \geq 0 \quad \forall v\]

Therefore \(J\) is convex.

  1. Any local minimum is global minimum
  2. If minimum exists, it’s unique (if \(X^TX \succ 0\))
  3. Gradient descent converges to global minimum
  4. No need for random restarts

Strong convexity if \(X\) full rank: \[J(w) \geq J(w^*) + \frac{\mu}{2}||w - w^*||^2\] where \(\mu = \lambda_{\min}(X^TX)/n > 0\)

Step Size and Convergence

Critical parameter: step size \(\alpha\)

Too small: Slow convergence \[w_{t+1} \approx w_t\]

Too large: Overshoot, divergence \[J(w_{t+1}) > J(w_t)\]

Just right: Fastest convergence

Condition number effect: \[\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\]

  • Small \(\kappa\): Fast convergence
  • Large \(\kappa\): Slow, zigzagging

Theoretical bounds: For convergence need: \[0 < \alpha < \frac{2}{\lambda_{\max}(X^TX/n)}\]

“Optimal” step size: \[\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\]

Convergence rate: \[||w_t - w^*|| \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^t ||w_0 - w^*||\]

Eigenvalues and Geometry

Loss surface shape determined by eigenvalues of \(X^TX\)

Principal axes: Eigenvectors of \(X^TX\) Curvature along axis: Eigenvalue

Gradient descent behavior:

  1. Fast convergence along high curvature (large \(\lambda\))
  2. Slow along low curvature (small \(\lambda\))
  3. Zigzagging when \(\kappa\) large

Preconditioning: Transform to make all eigenvalues equal \[\tilde{X} = XP^{-1}\] where \(P\) chosen so \(\tilde{X}^T\tilde{X} = I\)

Stochastic Gradient Descent

Use one sample at a time: \[w_{t+1} = w_t - \alpha_t \nabla \ell_{i_t}(w_t)\]

where \(i_t\) randomly selected

For squared loss: \[\nabla \ell_i(w) = -2(y_i - w^T x_i)x_i\]

Unbiased gradient estimate: \[\mathbb{E}[\nabla \ell_i(w)] = \nabla J(w)\]

Noise in gradient: \[\text{Var}[\nabla \ell_i] = \mathbb{E}[||\nabla \ell_i||^2] - ||\nabla J||^2\]

Benefits:

  • Cheap iteration: \(O(p)\) vs \(O(np)\)
  • Online learning capability
  • Escape shallow local minima
  • Implicit regularization via noise

Explaining SGD Convergence

Fixed step size: Converges to neighborhood \[\mathbb{E}[||w_t - w^*||^2] \leq \alpha \sigma^2 + (1-2\alpha\mu)^t||w_0 - w^*||^2\]

where \(\sigma^2\) = gradient noise variance

Decreasing step size: Converges to exact solution

Required conditions (Robbins-Monro): \[\sum_{t=1}^{\infty} \alpha_t = \infty, \quad \sum_{t=1}^{\infty} \alpha_t^2 < \infty\]

Example: \(\alpha_t = \alpha_0/t\)

Convergence rates:

Method Rate Final Error
Batch GD \(O(e^{-t})\) 0
SGD (fixed \(\alpha\)) \(O(e^{-t})\) \(O(\alpha)\)
SGD (decreasing) \(O(1/t)\) 0

Variance reduction techniques:

  • Mini-batching
  • Momentum
  • Adaptive methods (Adam, RMSprop)

Mini-batching: Variance Reduction

Mini-batch gradient: \[\nabla J_{\mathcal{B}} = \frac{1}{m} \sum_{i \in \mathcal{B}} \nabla \ell_i(w)\]

where \(|\mathcal{B}| = m\) = batch size

Variance reduction: \[\text{Var}[\nabla J_{\mathcal{B}}] = \frac{1}{m}\text{Var}[\nabla \ell_i]\]

Linear speedup in variance reduction!

Benefits:

  1. Reduces gradient noise by \(\sqrt{m}\)
  2. Allows larger step sizes
  3. Better hardware utilization
  4. Parallelizable

Diminishing returns:

  • Computational cost: \(O(m)\)
  • Variance reduction: \(O(1/m)\)

Epoch: One pass through dataset

  • Batch: 1 update per epoch
  • SGD: \(n\) updates per epoch
  • Mini-batch: \(n/m\) updates per epoch

Maximum Likelihood Estimation

Why Squared Loss?

So far: Minimized squared loss empirically \[J(w) = \frac{1}{n}\sum_{i=1}^n (y_i - w^T x_i)^2\]

But why this particular loss function?

Different losses give different solutions:

  • Squared: \((y - \hat{y})^2\)
  • Absolute: \(|y - \hat{y}|\)
  • Huber: Combination of both

Need principled approach:

  1. What assumptions about data generation?
  2. What are we actually estimating?
  3. How does noise affect our choice?

Maximum likelihood provides framework:

  • Start with probabilistic model
  • Derive optimal estimator
  • Loss function emerges naturally

Modeling Data Generation

Assume data comes from a process: \[y = w^T x + \epsilon\]

where \(\epsilon\) is random noise

What we observe:

  • True relationship: \(w^T x\)
  • Plus noise: \(\epsilon\)

Key questions:

  1. What distribution for \(\epsilon\)?
  2. How does this affect estimation?

Common assumption: \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)

Why Gaussian?

  • Central limit theorem: sum of many small effects
  • Maximum entropy for fixed variance
  • Mathematical tractability
  • Often reasonable in practice

Probabilistic View of Regression

Model: \(y = w^T x + \epsilon\), where \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)

This implies \(y|x\) is Gaussian: \[y|x \sim \mathcal{N}(w^T x, \sigma^2)\]

Probability of observing \(y\) given \(x\): \[p(y|x; w) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - w^T x)^2}{2\sigma^2}\right)\]

For entire dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\):

Assuming independent samples: \[p(\mathcal{D}|w) = \prod_{i=1}^n p(y_i|x_i; w)\]

Maximum likelihood principle: Find \(w\) that makes observed data most probable: \[\hat{w}_{\text{ML}} = \arg\max_w p(\mathcal{D}|w)\]

From Likelihood to Least Squares

Log-likelihood: \[\ell(w) = \log p(\mathcal{D}|w) = \sum_{i=1}^n \log p(y_i|x_i; w)\]

Substitute Gaussian pdf: \[\ell(w) = \sum_{i=1}^n \log \left(\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i - w^T x_i)^2}{2\sigma^2}}\right)\]

Simplify: \[\ell(w) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - w^T x_i)^2\]

To maximize \(\ell(w)\):

  • First term: constant w.r.t. \(w\)
  • Second term: minimize \(\sum_i(y_i - w^T x_i)^2\)

Result: MLE = Least squares!

\[\hat{w}_{\text{ML}} = \arg\min_w \sum_{i=1}^n(y_i - w^T x_i)^2\]

Squared loss emerges from Gaussian noise assumption

MLE Gives Normal Equations

Maximize log-likelihood: \[\hat{w} = \arg\max_w \ell(w) = \arg\min_w ||y - Xw||^2\]

Take gradient of loss: \[\nabla_w \left[\frac{1}{2}||y - Xw||^2\right] = -X^T(y - Xw)\]

Set to zero (critical point): \[-X^T(y - Xw) = 0\] \[X^T Xw = X^T y\]

Normal equations (same as before!): \[\boxed{\hat{w}_{\text{ML}} = (X^T X)^{-1} X^T y}\]

So::

  • Least squares has probabilistic justification
  • Assumes Gaussian noise
  • Different noise → different estimator
  • MLE provides principled framework

Beyond Point Estimates: Uncertainty Matters

MLE gives us \(\hat{w}\), but:

  • How confident are we?
  • Could true \(w^*\) be far from \(\hat{w}\)?
  • Which parameters are well-determined?

Sources of uncertainty:

  1. Finite sample size
  2. Noise level \(\sigma^2\)
  3. Data distribution (design matrix \(X\))

Statistical inference framework:

  • Point estimate: \(\hat{w}_{\text{ML}}\) (where)
  • Uncertainty: \(\text{Cov}(\hat{w})\) (how sure)
  • Confidence regions: Where \(w^*\) likely lies

MLE is random! \[\hat{w} = (X^TX)^{-1}X^Ty = (X^TX)^{-1}X^T(Xw^* + \epsilon)\] \[= w^* + (X^TX)^{-1}X^T\epsilon\]

Distribution of \(\hat{w}\) depends on distribution of \(\epsilon\)

Fisher Information: Quantifies how much information data contains about parameters

Fisher Information and Uncertainty

Fisher information matrix: \[\mathbf{I}(w) = -\mathbb{E}\left[\frac{\partial^2 \ell(w)}{\partial w \partial w^T}\right]\]

For linear regression: \[\mathbf{I}(w) = \frac{1}{\sigma^2}X^T X\]

Measures “information” about \(w\) in data

Covariance of ML estimator: \[\text{Cov}(\hat{w}_{\text{ML}}) = \mathbf{I}^{-1} = \sigma^2(X^T X)^{-1}\]

  • Large \(X^T X\) → small variance (precise)
  • Small \(X^T X\) → large variance (uncertain)
  • Ill-conditioned \(X^T X\) → very uncertain

Confidence regions: \[(\hat{w} - w^*)^T \mathbf{I} (\hat{w} - w^*) \sim \chi^2_p\]

Forms ellipsoid around estimate

Properties of ML Estimators

Consistency: \[\hat{w}_n \xrightarrow{p} w^* \text{ as } n \to \infty\]

MLE converges to true value with enough data

Asymptotic normality: \[\sqrt{n}(\hat{w}_n - w^*) \xrightarrow{d} \mathcal{N}(0, \sigma^2 K^{-1})\]

where \(K = \lim_{n \to \infty} \frac{1}{n}X^T X\)

Efficiency:

  • Achieves minimum possible variance
  • Cramér-Rao bound: \(\text{Var}(\hat{w}) \geq \mathbf{I}^{-1}\)
  • MLE achieves this bound (asymptotically)

Invariance: If \(\hat{w}\) is MLE of \(w\), then:

  • \(g(\hat{w})\) is MLE of \(g(w)\)
  • Example: \(||\hat{w}||^2\) is MLE of \(||w||^2\)

Beyond MLE: Maximum A Posteriori (MAP)

MLE limitation: Ignores prior knowledge

Bayesian approach: Include prior belief \[p(w|\mathcal{D}) \propto p(\mathcal{D}|w) \cdot p(w)\]

MAP estimation: \[\hat{w}_{\text{MAP}} = \arg\max_w p(w|\mathcal{D})\]

Log posterior: \[= \arg\max_w [\log p(\mathcal{D}|w) + \log p(w)]\] \[= \arg\max_w [\text{log-likelihood} + \text{log-prior}]\]

Example: Gaussian prior \(w \sim \mathcal{N}(0, \tau^2 I)\) \[\log p(w) = -\frac{1}{2\tau^2}||w||^2 + \text{const}\]

MAP objective: \[\hat{w}_{\text{MAP}} = \arg\min_w \left[||y - Xw||^2 + \frac{\sigma^2}{\tau^2}||w||^2\right]\]

This is Ridge regression with \(\lambda = \sigma^2/\tau^2\)

Robust Regression: Laplace Noise → L1 Loss

Why care about different noise models?

Real data often has outliers

  • Sensor errors
  • Data entry mistakes
  • Rare events

Laplace noise model: \[p(\epsilon) = \frac{1}{2b}\exp\left(-\frac{|\epsilon|}{b}\right)\]

Heavy tails → robust to outliers

Likelihood with Laplace noise: \[p(y|x; w) = \frac{1}{2b}\exp\left(-\frac{|y - w^T x|}{b}\right)\]

Log-likelihood: \[\ell(w) = -n\log(2b) - \frac{1}{b}\sum_{i=1}^n |y_i - w^T x_i|\]

Maximizing \(\ell(w)\) equivalent to minimizing: \[\sum_{i=1}^n |y_i - w^T x_i|\]

L1 loss emerges from Laplace noise assumption!

When Linear Models Need Help

Problem 1: Too many features (\(p > n\))

  • \(X^T X\) not invertible
  • Infinite solutions
  • Which one to choose?

Problem 2: Multicollinearity

  • Features highly correlated
  • \(X^T X\) nearly singular
  • Unstable estimates, huge variance

Problem 3: Overfitting

  • Model too flexible
  • Memorizes training data
  • Poor generalization

Solution approach: Add constraints \[J_{\text{reg}}(w) = ||y - Xw||^2 + \lambda R(w)\]

where \(R(w)\) penalizes complexity

This changes the outcome:

  • Unique solution even if \(p > n\)
  • Stabilizes estimation
  • Controls model complexity

Model Selection: Bias-Variance Trade-off

Adding more features:

  • Reduces bias (better fit)
  • Increases variance (more parameters)

Expected prediction error: \[\mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2\]

For linear models with \(p\) features:

  • Bias: Decreases as \(p\) increases
  • Variance: \(\propto p \sigma^2 / n\)

Overfitting occurs when:

  • Model too complex for amount of data
  • Variance dominates bias
  • Training error \(\ll\) test error

Need way to tune complexity

Cross-validation estimates test error using training data

Regularization: Adding Constraints

Modified objective: \[J_{\text{reg}}(w) = ||y - Xw||^2 + \lambda ||w||^2\]

Ridge regression: L2 penalty

New solution: \[\hat{w}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y\]

What this does:

  • Always invertible (even if \(p > n\))
  • Shrinks weights toward zero
  • Reduces variance at cost of bias

Bayesian interpretation:

  • Prior belief: \(w \sim \mathcal{N}(0, \tau^2 I)\)
  • Small weights more likely
  • \(\lambda = \sigma^2/\tau^2\)

Connection to MLE:

  • MLE: maximize likelihood only
  • MAP: maximize posterior \(\propto\) likelihood × prior
  • Ridge = MAP with Gaussian prior

Information Theory and Loss Functions

Why Information Theory in Machine Learning?

How do we measure if a model is good?

So far: Squared error seemed natural for regression \[\text{Loss} = (y - \hat{y})^2\]

But why squared? Why not absolute? Cubic?

Information theory provides the answer:

  • Loss functions emerge from probabilistic models
  • Different distributions → different optimal losses
  • Unifies regression and classification
  • Provides principled model comparison

Learning reduces uncertainty about data - information theory quantifies this

Information Content: Quantifying Surprise

Intuition: Rare events are more “informative”

Information content of outcome \(x\): \[I(x) = -\log p(x)\]

Units:

  • Base 2: bits (binary digits)
  • Base e: nats (natural units)
  • Base 10: dits (decimal digits)

Properties:

  1. \(I(x) \geq 0\) (information is non-negative)
  2. \(p(x) = 1 \Rightarrow I(x) = 0\) (certain events have no surprise)
  3. \(p(x) \to 0 \Rightarrow I(x) \to \infty\) (impossible events maximally surprising)

Why logarithm?

  • Additivity: Independent events \(A, B\): \[I(A \cap B) = I(A) + I(B)\] \[-\log p(A)p(B) = -\log p(A) - \log p(B)\]

Example: Fair coin flip

  • \(p(\text{heads}) = 0.5\)
  • \(I(\text{heads}) = -\log_2(0.5) = 1\) bit

Entropy: Average Information

Entropy = Expected information content \[H(X) = \mathbb{E}[I(X)] = -\mathbb{E}[\log p(X)]\] \[= -\sum_x p(x) \log p(x)\]

Interpretation:

  • Average uncertainty before observing \(X\)
  • Average surprise when observing \(X\)
  • Minimum bits needed to encode \(X\) (on average)

Properties:

  1. \(H(X) \geq 0\) (entropy is non-negative)
  2. \(H(X) = 0 \iff X\) is deterministic
  3. Maximum when uniform: \(H(X) \leq \log |X|\)

Examples:

  • Coin (fair): \(H = 1\) bit
  • Coin (biased, \(p=0.9\)): \(H = 0.47\) bits
  • Die (fair): \(H = \log_2 6 = 2.58\) bits

Continuous case (differential entropy): \[H(X) = -\int p(x) \log p(x) \, dx\]

Note: Can be negative for continuous distributions

Maximum Entropy Principle

Question: What distribution should we assume given constraints?

Maximum entropy principle: Choose distribution with maximum entropy subject to known constraints

Why?

  • Least assumptions beyond constraints
  • Most “uncertain” = least biased
  • Unique and consistent

Example 1: Known mean and variance \[\max_{p} H(p) \text{ s.t. } \mathbb{E}[X] = \mu, \text{Var}(X) = \sigma^2\]

Solution: Gaussian \(\mathcal{N}(\mu, \sigma^2)\)

Example 2: Positive with known mean \[\max_{p} H(p) \text{ s.t. } X > 0, \mathbb{E}[X] = \lambda\]

Solution: Exponential with rate \(1/\lambda\)

This explains why:

  • Gaussian noise is common assumption
  • Exponential for waiting times
  • Uniform when “no information”

Comparing Distributions

Problem: How do we measure if two distributions are “different”?

Not just point-wise difference:

  • Distributions are functions, not numbers
  • Need to weight by importance (probability)
  • Want single scalar measure

Natural idea: If true distribution is \(p\) but we design code for \(q\):

  • Optimal code for \(q\): \(-\log q(x)\) bits
  • But \(x\) actually comes from \(p\)
  • Expected code length: \(\mathbb{E}_p[-\log q(X)]\)

Extra bits needed: \[\text{Inefficiency} = \mathbb{E}_p[-\log q(X)] - \mathbb{E}_p[-\log p(X)]\] \[= H(p,q) - H(p)\]

This motivates KL divergence as the “coding penalty” for using wrong distribution

Defining Cross-Entropy

Cross-entropy between distributions \(p\) and \(q\): \[H(p, q) = -\mathbb{E}_p[\log q(X)] = -\int p(x) \log q(x) \, dx\]

Interpretation: Average bits to encode samples from \(p\) using code optimized for \(q\)

Properties:

  • \(H(p, q) \geq H(p)\) (using wrong code is worse)
  • \(H(p, q) = H(p)\) iff \(p = q\) (optimal when correct)
  • Not symmetric: \(H(p, q) \neq H(q, p)\)

For empirical distribution (data): \[p_{data} = \frac{1}{n}\sum_{i=1}^n \delta(x - x_i)\]

Cross-entropy becomes: \[H(p_{data}, q) = -\frac{1}{n}\sum_{i=1}^n \log q(x_i)\]

This is negative average log-likelihood!

KL Divergence: Measuring Distribution Distance

Kullback-Leibler (KL) divergence: \[D_{KL}(p||q) = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right]\] \[= \int p(x) \log \frac{p(x)}{q(x)} \, dx\]

Interpretation:

  • Extra bits needed to encode \(p\) using code for \(q\)
  • Information lost when approximating \(p\) with \(q\)
  • Expected log-likelihood ratio

Properties:

  1. \(D_{KL}(p||q) \geq 0\) (non-negative)
  2. \(D_{KL}(p||q) = 0 \iff p = q\) a.e.
  3. Not symmetric: \(D_{KL}(p||q) \neq D_{KL}(q||p)\)
  4. Not a metric: No triangle inequality

Forward vs Reverse KL:

  • Forward \(D_{KL}(p||q)\): \(q\) must cover all of \(p\)
  • Reverse \(D_{KL}(q||p)\): \(q\) can be more focused

From previous slide’s definition: \[H(p, q) = H(p) + D_{KL}(p||q)\]

Connection to likelihood: \[D_{KL}(p_{data}||p_{model}) = \text{const} - \mathbb{E}_{data}[\log p_{model}]\]

Maximum Likelihood via Cross-Entropy Minimization

Learning as distribution matching:

Want model \(q_\theta\) to match data distribution \(p_{data}\)

Minimize KL divergence: \[\min_\theta D_{KL}(p_{data} || q_\theta)\]

Expanding KL: \[D_{KL}(p_{data} || q_\theta) = H(p_{data}, q_\theta) - H(p_{data})\]

Since \(H(p_{data})\) doesn’t depend on \(\theta\): \[\min_\theta D_{KL}(p_{data} || q_\theta) \equiv \min_\theta H(p_{data}, q_\theta)\]

With empirical distribution (data as delta functions): \[p_{data} = \frac{1}{n}\sum_{i=1}^n \delta(x - x_i)\]

\[H(p_{data}, q_\theta) = -\frac{1}{n}\sum_{i=1}^n \log q_\theta(x_i)\]

This is negative average log-likelihood!

\[\min_\theta H(p_{data}, q_\theta) \equiv \max_\theta \frac{1}{n}\sum_{i=1}^n \log q_\theta(x_i)\]

MLE = Cross-entropy minimization = KL minimization

From Noise Models to Loss Functions

General principle: Loss = negative log-likelihood \[\text{Loss}(y, \hat{y}) = -\log p(y|\hat{y})\]

Different noise → different loss:

Gaussian noise: \(y = f(x) + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) \[p(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x))^2}{2\sigma^2}\right)\] \[-\log p(y|x) = \frac{(y - f(x))^2}{2\sigma^2} + \text{const}\]Squared loss

Laplace noise: \(\epsilon \sim \text{Laplace}(0, b)\) \[p(y|x) = \frac{1}{2b} \exp\left(-\frac{|y - f(x)|}{b}\right)\] \[-\log p(y|x) = \frac{|y - f(x)|}{b} + \text{const}\]Absolute loss

Categorical: \(y \in \{1, ..., K\}\), \(p(y=k|x) = \pi_k\) \[-\log p(y|x) = -\log \pi_y\]Cross-entropy loss

KL Divergence and Entropy Relationships

Decomposing KL divergence: \[D_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]

Expanding the logarithm: \[D_{KL}(p||q) = \int p(x) \log p(x) dx - \int p(x) \log q(x) dx\] \[= -H(p) + [-\mathbb{E}_p[\log q]]\] \[= H(p, q) - H(p)\]

Decomposition: KL = Cross-entropy - Entropy

Mutual information is KL of a special form: \[I(X;Y) = D_{KL}(p(x,y) || p(x)p(y))\]

Compares joint distribution to what it would be if independent

  • If \(X \perp Y\): \(p(x,y) = p(x)p(y)\)\(I(X;Y) = 0\)
  • If dependent: \(I(X;Y) > 0\) quantifies dependence

Alternative forms: \[I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\] \[= H(X) + H(Y) - H(X,Y)\]

Mutual Information: Shared Information

Mutual information between \(X\) and \(Y\): \[I(X; Y) = D_{KL}(p(x,y) || p(x)p(y))\] \[= \mathbb{E}_{X,Y}\left[\log \frac{p(X,Y)}{p(X)p(Y)}\right]\]

Equivalent forms: \[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\] \[= H(X) + H(Y) - H(X,Y)\]

Interpretation:

  • Information gained about \(X\) by observing \(Y\)
  • Reduction in uncertainty of \(X\) given \(Y\)
  • Measure of dependence (0 iff independent)

Properties:

  1. \(I(X; Y) \geq 0\) (non-negative)
  2. \(I(X; Y) = 0 \iff X \perp Y\) (independence)
  3. \(I(X; Y) = I(Y; X)\) (symmetric)
  4. \(I(X; X) = H(X)\) (self-information)

Applications in ML:

  • Feature selection
  • Information bottleneck
  • Representation learning

Loss functions from information theory:

  1. Regression (continuous \(Y\)):

    • Gaussian noise → MSE
    • Laplace noise → MAE
    • Student-t noise → Robust loss
  2. Classification (discrete \(Y\)):

    • Binary: Cross-entropy
    • Multi-class: Categorical cross-entropy
    • Multi-label: Binary cross-entropy per label
  3. Generative models:

    • VAE: ELBO maximization
    • GAN: JS divergence minimization
    • Normalizing flows: Exact likelihood

Information bottleneck principle: \[\min I(X; Z) - \beta I(Z; Y)\]

Compress \(X\) into \(Z\) while preserving information about \(Y\)

Representation learning: Learn features that maximize \(I(Z; Y)\) while minimizing \(I(Z; X\setminus Y)\)

Loss Functions Emerge from Probabilistic Models

Loss = Negative log-likelihood under noise model

Regression with Gaussian noise: \[y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)\] \[p(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-f(x))^2}{2\sigma^2}\right)\] \[-\log p(y|x) = \frac{(y-f(x))^2}{2\sigma^2} + \text{const}\] \[\Rightarrow \text{Loss} = (y - f(x))^2\]

Regression with Laplace noise: \[\epsilon \sim \text{Laplace}(0, b)\] \[-\log p(y|x) = \frac{|y-f(x)|}{b} + \text{const}\] \[\Rightarrow \text{Loss} = |y - f(x)|\]

Classification (discrete \(y \in \{1, ..., K\}\)):

Model outputs probabilities: \(\pi_k(x) = P(y=k|x)\)

Categorical distribution: \[p(y|x) = \prod_{k=1}^K \pi_k(x)^{[y=k]}\]

Negative log-likelihood: \[-\log p(y|x) = -\sum_{k} [y=k] \log \pi_k\]

One-hot encoding: \(y_k = [y=k]\) \[\Rightarrow \text{Loss} = -\sum_{k} y_k \log \pi_k\]

This is categorical cross-entropy!

Binary case (\(K=2\)): \[\text{Loss} = -y\log \pi - (1-y)\log(1-\pi)\]

Bernoulli leads to binary cross-entropy

Later: How to ensure \(\pi \in [0,1]\) → sigmoid/softmax

Classification Setup and Limitations

Can We Use Regression for Classification?

Binary classification problem:

  • Input: \(x \in \mathbb{R}^p\)
  • Output: \(y \in \{0, 1\}\) or \(y \in \{-1, +1\}\)

Naive approach: Treat as regression

  1. Encode classes numerically
  2. Apply linear regression: \(\hat{y} = w^Tx + b\)
  3. Threshold the output:
    • If \(y \in \{0,1\}\): predict 1 if \(\hat{y} > 0.5\)
    • If \(y \in \{-1,+1\}\): predict sign(\(\hat{y}\))

Seems reasonable?

  • Classes have numeric labels
  • Can minimize squared error
  • Get a decision boundary

But there are serious problems…

Problem 1: Unbounded Outputs

Regression outputs can be any real number

For points far from boundary:

  • \(\hat{y} \gg 1\) or \(\hat{y} \ll 0\)
  • Cannot interpret as probability
  • \(P(y=1|x) = 2.7\)? Nonsense!

Squared loss encourages extreme values:

  • Point correctly classified at \(\hat{y} = 0.9\)
  • Loss = \((1 - 0.9)^2 = 0.01\)
  • Point correctly classified at \(\hat{y} = 5.0\)
  • Loss = \((1 - 5.0)^2 = 16\) (penalized for being too correct!)

Outliers dominate the loss:

  • Single point far from boundary
  • Contributes enormous squared error
  • Pulls decision boundary away from optimal

Need: Output constrained to \([0, 1]\)

Problem 2: Wrong Loss Function

Classification is fundamentally different from regression

Regression: Minimize distance to target

  • Error magnitude matters
  • 0.1 error better than 0.2 error

Classification: Correct or incorrect

  • Only threshold crossing matters
  • Confidence beyond threshold less important

Information theory perspective:

  • Binary outcome: \(y \sim \text{Bernoulli}(p)\)
  • Optimal loss: \(-y\log p - (1-y)\log(1-p)\)
  • Not squared error!

Squared loss for binary data:

  • Assumes Gaussian noise
  • But \(y \in \{0,1\}\) cannot be Gaussian
  • Violates fundamental assumption

We’re using the wrong probabilistic model!

Decision Boundaries

Linear decision boundary: \[w^Tx + b = 0\]

Defines hyperplane in \(\mathbb{R}^p\)

Geometric interpretation:

  • \(w\): normal vector to hyperplane
  • \(|b|/||w||\): distance from origin
  • Points satisfy: \(w^Tx + b > 0\) (positive side)
  • Points satisfy: \(w^Tx + b < 0\) (negative side)

Distance to boundary: \[d(x) = \frac{|w^Tx + b|}{||w||}\]

Signed distance (positive = class 1 side): \[d_{\text{signed}}(x) = \frac{w^Tx + b}{||w||}\]

Decision rule: \[\hat{y} = \begin{cases} 1 & \text{if } w^Tx + b > 0 \\ 0 & \text{if } w^Tx + b \leq 0 \end{cases}\]

Fundamental Limitation: Linear Separability

Not all problems are linearly separable

Classic example: XOR

  • \((0,0) \to 0\)
  • \((0,1) \to 1\)
  • \((1,0) \to 1\)
  • \((1,1) \to 0\)

No single line can separate the classes!

Linear separability requires: \[\exists w, b : \forall i, \quad y_i(w^Tx_i + b) > 0\]

When linear fails:

  • Non-convex class regions
  • Classes with holes
  • Nested or interleaved patterns
  • Manifold structures

Solutions:

  1. Feature transformation: \(\phi(x)\)
  2. Multiple linear pieces (neural networks)
  3. Kernel methods (implicit transformation)

Why These Problems Matter

Practical implications:

  1. Medical diagnosis: Disease probability needed

    • Regression gives \(\hat{y} = -0.3\) or \(\hat{y} = 1.7\)
    • Cannot interpret as risk score
    • Cannot make calibrated decisions
  2. Credit scoring: Default probability

    • Need \(P(\text{default}) \in [0,1]\)
    • Regression unbounded
    • Cannot set risk-based thresholds
  3. Multi-class: Even worse

    • One-vs-all regression
    • Outputs don’t sum to 1
    • No probabilistic interpretation

What we need:

  • Output: true probability \(P(y=1|x) \in [0,1]\)
  • Loss: appropriate for Bernoulli/categorical
  • Method: respects classification structure

What’s Next: Logistic Regression

Solutions to our problems:

  1. Sigmoid function: Maps \(\mathbb{R} \to [0,1]\) \[\sigma(z) = \frac{1}{1 + e^{-z}}\]

  2. Log-odds (logit) transformation: \[\log\frac{P(y=1|x)}{P(y=0|x)} = w^Tx + b\]

  3. Cross-entropy loss (from information theory): \[L = -y\log\hat{p} - (1-y)\log(1-\hat{p})\]

  4. Maximum likelihood estimation:

    • Proper probabilistic framework
    • Bernoulli distribution for binary outcomes

Multi-class: Softmax function \[P(y=k|x) = \frac{e^{w_k^Tx}}{\sum_j e^{w_j^Tx}}\]

Ensures outputs sum to 1!

Implementation and Practical Considerations

Feature Engineering: Expanding Linear Models

Linear in parameters, not features:

Original features: \(x \in \mathbb{R}^d\)

Transform to: \(\phi(x) \in \mathbb{R}^p\) where \(p > d\)

Model: \(y = w^T\phi(x) + b\)

Polynomial features: \[x \mapsto [1, x, x^2, x^3, ..., x^p]\]

Two variables: \[[x_1, x_2] \mapsto [1, x_1, x_2, x_1^2, x_1x_2, x_2^2, ...]\]

Still solving linear regression: \[\min_w ||\mathbf{y} - \Phi \mathbf{w}||^2\]

where \(\Phi_{ij} = \phi_j(x_i)\)

Complexity growth:

  • Degree 2: \(O(d^2)\) features
  • Degree 3: \(O(d^3)\) features
  • Degree \(p\): \(O(d^p)\) features

Curse of dimensionality!

Normalization and Preprocessing

Why normalize?

  • Features on different scales
  • Gradient descent convergence
  • Numerical stability

Standardization (z-score): \[x' = \frac{x - \mu}{\sigma}\]

  • Zero mean, unit variance
  • Preserves outliers
  • Assumes roughly Gaussian

Min-Max scaling: \[x' = \frac{x - x_{min}}{x_{max} - x_{min}}\]

  • Maps to [0, 1]
  • Sensitive to outliers
  • Bounded output

Robust scaling: \[x' = \frac{x - \text{median}}{\text{IQR}}\]

  • Uses median and interquartile range
  • Robust to outliers

When to use which:

  • Standardization: Most cases, especially with Gaussian assumptions
  • Min-Max: Neural networks, bounded inputs needed
  • Robust: Data with outliers

Multicollinearity and Regularization

Multicollinearity: Features are highly correlated

Problems:

  • \((X^TX)\) nearly singular
  • Unstable parameter estimates
  • Large parameter variance
  • Poor generalization

Detection:

  • Correlation matrix
  • Variance Inflation Factor (VIF)
  • Condition number of \(X^TX\)

VIF for feature \(j\): \[\text{VIF}_j = \frac{1}{1 - R_j^2}\] where \(R_j^2\) = R² from regressing \(x_j\) on other features

Solutions:

  1. Remove correlated features
  2. Principal Component Analysis (PCA)
  3. Regularization (L2/Ridge): \[\min_w ||y - Xw||^2 + \lambda ||w||^2\] Solution: \(w = (X^TX + \lambda I)^{-1}X^Ty\)

Adding \(\lambda I\) improves conditioning!

Residual Analysis and Model Diagnostics

Residuals: \(e_i = y_i - \hat{y}_i\)

Assumptions to check:

  1. Linearity: Residuals vs fitted values

    • Should show no pattern
    • Curvature suggests nonlinearity
  2. Homoscedasticity: Constant variance

    • Residual spread shouldn’t change
    • Funnel shape = heteroscedasticity
  3. Normality: Q-Q plot

    • Points should follow diagonal
    • Heavy tails or skewness problematic
  4. Independence: Autocorrelation

    • No pattern in residual sequence
    • Durbin-Watson test

When linear regression fails:

  • Nonlinear relationships → Polynomial/nonlinear models
  • Non-constant variance → Transform response or weighted regression
  • Non-normal errors → Robust regression or different loss
  • Correlated errors → Time series models

Implementation: NumPy vs Scikit-learn

NumPy implementation:

# Normal equations
def linear_regression_normal(X, y):
    # Add bias column
    X_b = np.c_[np.ones(len(X)), X]
    # Solve normal equations
    theta = np.linalg.solve(X_b.T @ X_b, X_b.T @ y)
    return theta

# Gradient descent
def linear_regression_gd(X, y, lr=0.01, epochs=1000):
    n, p = X.shape
    X_b = np.c_[np.ones(n), X]
    theta = np.zeros(p + 1)
    
    for _ in range(epochs):
        gradients = 2/n * X_b.T @ (X_b @ theta - y)
        theta = theta - lr * gradients
    return theta

Scikit-learn:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
# Coefficients: model.coef_, model.intercept_

Scikit-learn advantages:

  • Handles edge cases
  • Built-in preprocessing
  • Cross-validation support
  • Multiple solver options

Performance Metrics

Regression metrics:

Mean Squared Error (MSE): \[\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Root MSE: \(\text{RMSE} = \sqrt{\text{MSE}}\) (same units as \(y\))

R² (Coefficient of Determination): \[R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}\]

  • \(R^2 = 1\): Perfect fit
  • \(R^2 = 0\): No better than mean
  • \(R^2 < 0\): Worse than mean (possible on test set)

Mean Absolute Error (MAE): \[\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|\]

More robust to outliers than MSE

Classification metrics (linear as classifier):

  • Accuracy: Fraction correct
  • Confusion matrix
  • But linear regression poor for classification!

When Linear Models Fail

Linear regression limitations:

  1. Nonlinear relationships

    • Solution: Polynomial features, kernels, neural networks
  2. Non-Gaussian noise

    • Solution: Different loss functions (MAE, Huber)
  3. Discrete outputs

    • Solution: Logistic regression, classification models
  4. High dimensionality (\(p > n\))

    • Solution: Regularization, dimensionality reduction
  5. Complex interactions

    • Solution: Tree models, neural networks
  6. Temporal/spatial correlation

    • Solution: Time series models, spatial models

Signs of failure:

  • Poor R² on test set
  • Patterned residuals
  • Prediction outside reasonable range
  • High variance in parameters

Remember: Linear models are foundation

  • Many advanced methods build on them
  • Often good baseline
  • Interpretable