MMSE Estimation and Prediction

EE 541 - Unit 3

Dr. Brandon Franzke

Fall 2025

Outline

Estimation Theory

Partial Observability

  • Partial observability in systems
  • Loss functions and MSE justification
  • Predictor classes: constant to unrestricted
  • Signal-noise decomposition

Conditional Expectation

  • Joint to conditional distributions
  • Computing \(\mathbb{E}[Y|X]\) for discrete/continuous
  • Tower property and variance decomposition
  • Properties: pulling out known terms

MMSE Theory

  • Pointwise optimization strategy
  • Proof: \(\mathbb{E}[Y|X]\) minimizes MSE
  • Minimum achievable: \(\mathbb{E}[\text{Var}(Y|X)]\)
  • Computational challenges

Linear Methods & Extensions

Linear MMSE

  • Quadratic optimization in \((a,b)\)
  • Optimal: \(a^* = \text{Cov}(X,Y)/\text{Var}(X)\)
  • Orthogonality principle
  • Performance: MSE = \(\text{Var}(Y)(1-\rho^2)\)

Algorithms & Implementation

  • LMS (Widrow-Hoff) updates
  • Convergence: \(0 < \mu < 2/\mathbb{E}[X^2]\)
  • Numerical stability considerations
  • Sample-based computation

Special Cases

  • Gaussian: LMMSE = MMSE
  • When linear is optimal
  • Vector extensions and normal equations
  • Connection to ML/regression

Estimation Under Uncertainty

Partial Observability and Uncertainty

Partial observability is fundamental

Real systems operate with incomplete information:

  • Temperature sensor reads 72°F ± 2°F
  • Radar measures distance with noise
  • Stock price reflects unknown future earnings
  • MRI image contains artifacts

Objective: Given measurements (\(X\)), infer unknowns (\(Y\))

Uncertainty is inherent, not a nuisance

  • Measurement noise
  • Model uncertainty
  • Finite samples
  • Quantization effects

Need algorithms for optimal inference under uncertainty

Math from Intuition

Problem:

  • \(Y\): quantity of interest (unknown)
  • \(X\): observation (known)
  • \((X, Y)\): jointly distributed random variables

Predictor function: \(g: \mathbb{R} \to \mathbb{R}\) \[\hat{Y} = g(X)\]

Example: Room temperature estimation

  • \(Y\): true temperature
  • \(X\): sensor reading = \(Y + N\)
  • \(N\): measurement noise
  • \(g(X) =\) ? Best guess for \(Y\) given reading \(X\)

Among all possible functions \(g\), which one is “best”?

Need:

  1. What we mean by “best” (loss function)
  2. How to average over randomness (expectation)

Defining “Best”: Loss Functions

Loss function \(\ell(y, \hat{y})\) measures prediction quality

Common choices:

Loss Formula Properties
Squared \((y - \hat{y})^2\) Differentiable, convex, emphasizes large errors
Absolute \(\|y - \hat{y}\|\) Robust to outliers, non-differentiable at 0
0-1 \(\mathbb{1}[y \neq \hat{y}]\) Classification, discontinuous
Huber \(\begin{cases} \frac{1}{2}(y-\hat{y})^2 & \|y-\hat{y}\| \leq \delta \\ \delta\|y-\hat{y}\| - \frac{\delta^2}{2} & \text{otherwise} \end{cases}\) Robust + differentiable

Risk: Expected loss over joint distribution \[R(g) = \mathbb{E}[\ell(Y, g(X))] = \int\int \ell(y, g(x)) p(x,y) \, dx \, dy\]

Goal: Find \(g^* = \arg\min_g R(g)\)

Why Squared Error?

Mathematical advantages:

  1. Differentiable: \[\frac{d}{d\hat{y}}(y - \hat{y})^2 = -2(y - \hat{y})\] Smooth objective

  2. Unique minimum: Strictly convex → no local minima

  3. Closed-form solutions: Often analytically tractable

  4. Decomposition: \[\text{MSE} = \text{Bias}^2 + \text{Variance}\]

Statistical connection:

Under Gaussian noise \(N \sim \mathcal{N}(0, \sigma^2)\): \[Y = \mu + N\]

Maximum likelihood estimation: \[\hat{\mu}_{\text{MLE}} = \arg\max_\mu p(y|\mu) = \arg\min_\mu (y - \mu)^2\]

MSE ↔︎ Gaussian MLE

Mean Squared Error (MSE)

Definition: \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2]\]

Expanding the expectation: \[\text{MSE}(g) = \int\int (y - g(x))^2 p(x,y) \, dx \, dy\]

Empirical MSE (from data): \[\widehat{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - g(x_i))^2\]

Properties:

  • Always non-negative: MSE ≥ 0
  • Zero iff perfect prediction: MSE = 0 ⟺ P(Y = g(X)) = 1
  • Units: [MSE] = [Y]²

Other forms:

  • RMSE = √MSE (same units as Y)
  • Normalized MSE = MSE/Var(Y)
  • \(\rho^2\) = 1 - MSE/Var(Y)

Classes of Predictors

Hierarchy of function classes:

  1. Constant: \(\hat{Y} = c\)
    • Ignores observation X
    • Optimal: \(c^* = \mathbb{E}[Y]\)
    • MSE = Var(Y)
  2. Linear (affine): \(\hat{Y} = aX + b\)
    • Simple, interpretable
    • Only two parameters to optimize
    • Many relationships are approximately linear
  3. Polynomial: \(\hat{Y} = \sum_{k=0}^p a_k X^k\)
    • More flexible
    • Risk of overfitting
  4. Unrestricted: \(\hat{Y} = g(X)\)
    • Any measurable function
    • Maximum flexibility
    • Requires knowing \(p(y|x)\)

Flexibility vs Tractability

Linear Estimators: Definition

Definition: A linear estimator has the form \[\hat{Y} = aX + b\] where \(a, b \in \mathbb{R}\) are constants.

Note: “Linear” means linear in \(X\), not linear in parameters.

  • \(\hat{Y} = 3X + 2\) is linear (in \(X\))
  • \(\hat{Y} = \beta X^2\) is NOT linear (quadratic in \(X\))
  • Despite \(\beta\) appearing linearly!

Parameters:

  • \(a\): slope/gain/sensitivity
  • \(b\): offset/bias/intercept

Special cases:

  • \(b = 0\): Proportional estimator \(\hat{Y} = aX\)
  • \(a = 0\): Constant estimator \(\hat{Y} = b\)
  • \(a = 1, b = 0\): Direct use \(\hat{Y} = X\)

Property: Superposition If \(\hat{Y}_1 = aX_1 + b\) and \(\hat{Y}_2 = aX_2 + b\), then: \[\alpha\hat{Y}_1 + \beta\hat{Y}_2 = a(\alpha X_1 + \beta X_2) + (\alpha + \beta)b\]

Matrix form (preview): \[\hat{\mathbf{y}} = \mathbf{A}\mathbf{x} + \mathbf{b}\]

Why These Statistical Quantities Matter

Mean E[Y]: Baseline predictor

  • Best constant estimate
  • What to predict without observing \(X\)
  • MSE = Var(Y) when ignoring observations

Variance Var(Y): Spread/uncertainty

  • Total variation to explain
  • Upper bound on MSE (constant predictor)
  • Normalizes performance: R² = 1 - MSE/Var(Y)

Covariance Cov(X,Y): Linear relationship strength

  • Determines optimal linear slope
  • Zero → X doesn’t help linearly
  • Sign → direction of relationship

Correlation ρ: Normalized association

  • Scale-free measure ∈ [-1, 1]
  • |ρ| = 1 → perfect linear prediction possible
  • ρ² = fraction of variance explained linearly

These moments completely determine the optimal linear estimator

Moments of Random Variables

First moments: \[\mathbb{E}[Y] = \int y \, p(y) \, dy\]

Second moments: \[\mathbb{E}[Y^2] = \int y^2 \, p(y) \, dy\]

Variance: \[\text{Var}(Y) = \mathbb{E}[(Y - \mathbb{E}[Y])^2] = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2\]

Covariance: \[\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]\] \[= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]

Correlation coefficient: \[\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1, 1]\]

  • \(|\rho| = 1\): Perfect linear relationship
  • \(\rho = 0\): Uncorrelated (not necessarily independent!)
  • Sign indicates direction

Signal + Noise Framework

Additive noise model: \[X = Y + N\]

  • \(Y\): true signal (what we want)
  • \(N\): noise (corruption)
  • \(X\): observation (what we get)
  • Independence: \(Y \perp N\)

Signal-to-Noise Ratio (SNR): \[\text{SNR} = \frac{\text{Power}(Y)}{\text{Power}(N)} = \frac{\mathbb{E}[Y^2]}{\mathbb{E}[N^2]}\]

For zero-mean: \[\text{SNR} = \frac{\sigma_Y^2}{\sigma_N^2}\]

In dB: \(\text{SNR}_{\text{dB}} = 10\log_{10}(\text{SNR})\)

Fundamental limit: Even with optimal estimation, cannot remove all noise: \[\text{MSE}_{\text{min}} \geq \frac{\sigma_Y^2 \sigma_N^2}{\sigma_Y^2 + \sigma_N^2} = \frac{\sigma_Y^2}{1 + \text{SNR}}\]

Conditional Expectation

Conditioning Reduces Uncertainty

Joint distribution \((X, Y) \sim p(x,y)\) contains all information.

Bayes’ rule: \[p(y|x) = \frac{p(x,y)}{p(x)} = \frac{p(x,y)}{\int p(x,y') dy'}\]

Once we observe \(X = x\), the distribution of \(Y\) changes from \(p(y)\) to \(p(y|x)\).

Example: Temperature sensor

  • Prior: \(Y \sim \mathcal{N}(70, 10^2)\) (before measurement)
  • Observe: \(X = 75\) (sensor reading)
  • Posterior: \(Y|X=75 \sim \mathcal{N}(73, 4^2)\) (after measurement)

Conditioning incorporates information, reducing spread.

Conditioning reduces uncertainty

Discrete case: \[P(Y = y|X = x) = \frac{P(X = x, Y = y)}{P(X = x)}\]

The Conditional Expectation

Definition: The conditional expectation of \(Y\) given \(X = x\) is:

Discrete case: \[\mathbb{E}[Y|X=x] = \sum_y y \cdot P(Y=y|X=x)\]

Continuous case: \[\mathbb{E}[Y|X=x] = \int_{-\infty}^{\infty} y \cdot p(y|x) \, dy\]

Facts:

  1. \(\mathbb{E}[Y|X=x]\) is a number (for fixed \(x\))
    • expectation when \(X\) equals \(x\)
  2. \(\mathbb{E}[Y|X]\) is a random variable (function of \(X\))
  3. \(g(x) = \mathbb{E}[Y|X=x]\) is a deterministic function

Example: \(Y = X + N\), \(N \sim \mathcal{N}(0,1)\), independent \(X\) and \(N\) \[\mathbb{E}[Y|X=x] = \mathbb{E}[x + N] = x\]

Computing E[Y|X]: Discrete Example

Example: Binary communication channel

  • \(X \in \{0, 1\}\): transmitted bit
  • \(Y \in \{0, 1\}\): received bit
  • Channel flips bit with probability \(p = 0.1\)

Joint distribution: \[P(X=0, Y=0) = 0.45, \quad P(X=1, Y=0) = 0.05\] \[P(X=0, Y=1) = 0.05, \quad P(X=1, Y=1) = 0.45\]

Conditional distributions: \[P(X=0|Y=0) = \frac{0.45}{0.5} = 0.9\] \[P(X=1|Y=0) = \frac{0.05}{0.5} = 0.1\]

Conditional expectations: \[\mathbb{E}[X|Y=0] = 0 \cdot 0.9 + 1 \cdot 0.1 = 0.1\] \[\mathbb{E}[X|Y=1] = 0 \cdot 0.1 + 1 \cdot 0.9 = 0.9\]

MSE for different predictors:

  • \(\hat{X} = \mathbb{E}[X|Y]\): MSE = 0.09
  • \(\hat{X} = Y\) (identity): MSE = 0.10
  • \(\hat{X} = 0.5\) (constant): MSE = 0.25

Result: \(\mathbb{E}[X|Y]\) achieves minimum MSE

Computing E[Y|X]: Gaussian Example

Joint Gaussian: \((X, Y) \sim \mathcal{N}(\mu, K)\)

Parameters: \[\mu = \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \quad K = \begin{bmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{bmatrix}\]

Result (derive later): \[\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)\]

Properties:

  1. Linear in \(x\) (always!)
  2. Slope = \(\rho\sigma_Y/\sigma_X\)
  3. Passes through \((\mu_X, \mu_Y)\)
  4. \(|\rho| = 1 \Rightarrow\) perfect prediction

Conditional variance (constant!): \[\text{Var}(Y|X=x) = \sigma_Y^2(1 - \rho^2)\]

Independent of \(x\) - uncertainty same everywhere.

Special cases:

  • \(\rho = 0\): \(\mathbb{E}[Y|X=x] = \mu_Y\) (no information)
  • \(|\rho| = 1\): \(\text{Var}(Y|X) = 0\) (perfect prediction)

General Case: Need p(y|x)

Computing \(\mathbb{E}[Y|X=x]\) requires \(p(y|x)\)

From Bayes: \[p(y|x) = \frac{p(x,y)}{p(x)} = \frac{p(x|y)p(y)}{\int p(x|y')p(y')dy'}\]

When tractable:

  1. Known parametric families (Gaussian, exponential)
  2. Discrete with small state space
  3. Special structures (independence, Markov)

When intractable (often no closed form):

  1. Complex nonlinear relationships
  2. High dimensions
  3. Unknown distributions

Example: Nonlinear sensor \[Y = \text{true value}, \quad X = Y + Y^2 \cdot N\] where \(N \sim \mathcal{N}(0, 0.1)\)

Computing \(\mathbb{E}[Y|X=x]\) requires:

  1. Joint density \(p(x,y)\)
  2. Marginal \(p(x) = \int p(x,y)dy\)
  3. Conditional \(p(y|x) = p(x,y)/p(x)\)
  4. Integration \(\mathbb{E}[Y|X=x] = \int y p(y|x) dy\)

Law of Total Expectation (Tower Property)

Law of Total Expectation: \[\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]\]

Proof: \[\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[g(X)]\] where \(g(x) = \mathbb{E}[Y|X=x]\)

\[= \int g(x) p(x) dx = \int \mathbb{E}[Y|X=x] p(x) dx\]

\[= \int \left(\int y \, p(y|x) dy\right) p(x) dx\]

\[= \int\int y \, p(y|x) p(x) dy \, dx\]

\[= \int y \left(\int p(y|x)p(x) dx\right) dy\]

\[= \int y \, p(y) dy = \mathbb{E}[Y]\]

Intuition: Average of conditional averages = unconditional average

Example: Class grades

  • Section A: mean 85, 30 students
  • Section B: mean 75, 20 students
  • Overall mean = \(\frac{30 \cdot 85 + 20 \cdot 75}{50} = 81\)

Pulling Out Known Terms

Property: For any function \(h(X)\): \[\mathbb{E}[h(X)Y | X] = h(X)\mathbb{E}[Y|X]\]

Proof: Fix \(X = x\). Then \(h(X) = h(x)\) is a constant: \[\mathbb{E}[h(X)Y | X=x] = \mathbb{E}[h(x)Y | X=x]\] \[= h(x)\mathbb{E}[Y | X=x]\]

Special cases:

  1. \(\mathbb{E}[X \cdot Y | X] = X \cdot \mathbb{E}[Y|X]\)
  2. \(\mathbb{E}[f(X) + Y | X] = f(X) + \mathbb{E}[Y|X]\)
  3. \(\mathbb{E}[X^2Y | X] = X^2\mathbb{E}[Y|X]\)

Example: Signal processing \[Z = X \cdot Y + X^2\] \[\mathbb{E}[Z|X] = X\mathbb{E}[Y|X] + X^2\]

Only need \(\mathbb{E}[Y|X]\), not \(\mathbb{E}[XY|X]\)!

BUT: Cannot pull out functions of \(Y\): \[\mathbb{E}[Y^2|X] \neq Y \cdot \mathbb{E}[Y|X]\]

Conditional Variance

Definition: \[\text{Var}(Y|X) = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2 | X]\]

Computing formula: \[\text{Var}(Y|X) = \mathbb{E}[Y^2|X] - (\mathbb{E}[Y|X])^2\]

Measures remaining uncertainty after observing \(X\)

Law of Total Variance: \[\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])\]

  • First term: Average conditional variance (irreducible)
  • Second term: Variance of conditional means (explained)

Interpretation: \[\text{Total uncertainty} = \text{Unexplained} + \text{Explained by } X\]

Minimum MSE: \[\text{MSE}_{\text{min}} = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2] = \mathbb{E}[\text{Var}(Y|X)]\]

Cannot do better than conditional variance!

E[Y|X] is Optimal

(prove next):

Among all functions \(g: \mathbb{R} \to \mathbb{R}\): \[\mathbb{E}[Y|X] = \arg\min_{g} \mathbb{E}[(Y - g(X))^2]\]

Achieving minimum MSE: \[\text{MSE}_{\text{min}} = \mathbb{E}[\text{Var}(Y|X)]\]

i.e., cannot beat the average conditional variance.

Why conditional expectation?

For fixed \(X = x\), minimizing \(\mathbb{E}[(Y - c)^2|X=x]\) over constants \(c\) yields: \[c^* = \mathbb{E}[Y|X=x]\]

Do this for each \(x\) → optimal function is \(g(x) = \mathbb{E}[Y|X=x]\)

Computational reality:

  • Need \(p(y|x)\) to compute \(\mathbb{E}[Y|X=x] = \int y p(y|x) dy\)
  • Gaussian: Closed form (linear!)
  • General: Often intractable

Constraints simplify the problem.

MMSE Optimality

Optimizing Over Function Space

Goal: Find the best predictor of \(Y\) given \(X\)

Function space: \[\mathcal{G} = \{g: \mathbb{R} \to \mathbb{R} \text{ measurable}\}\]

Optimization problem: \[\min_{g \in \mathcal{G}} \text{MSE}(g) = \min_{g \in \mathcal{G}} \mathbb{E}[(Y - g(X))^2]\]

Infinite-dimensional optimization

  • Search over all possible functions
  • No parametric form assumed
  • Need functional analysis tools

Strategy: Decompose the problem by conditioning on \(X\)

  1. Fix \(X = x\), solve pointwise optimization
  2. Combine solutions for all \(x\)
  3. Verify global optimality

Strategy 1: Pointwise Optimization

Decomposition: \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2] = \mathbb{E}[\mathbb{E}[(Y - g(X))^2|X]]\]

Why this helps:

  • Outside expectation: Average over \(X\)
  • Inside expectation: For each fixed \(X = x\)

When \(X = x\) is fixed:

  • \(g(X) = g(x)\) is just a number
  • \(Y|X=x\) has distribution \(p(y|x)\)
  • We minimize: \(\mathbb{E}[(Y - g(x))^2|X=x]\)

This transforms the problem:

  • From: Find optimal function \(g\)
  • To: For each \(x\), find optimal constant \(g(x)\)

Solution approach:

  1. Fix \(x\), find best constant predictor
  2. This gives us \(g^*(x)\) for each \(x\)
  3. The function \(g^*\) is our MMSE estimator

Deriving the MMSE Estimator

Step 1: Express MSE using conditional expectation \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2]\] \[= \mathbb{E}[\mathbb{E}[(Y - g(X))^2|X]]\]

Step 2: For fixed \(X = x\), \(g(X) = g(x)\) is a constant \[\mathbb{E}[(Y - g(x))^2|X=x] = \mathbb{E}[(Y - a)^2|X=x]\] where \(a = g(x)\)

Step 3: Minimize over \(a\) \[\frac{d}{da}\mathbb{E}[(Y - a)^2|X=x] = 0\] \[-2\mathbb{E}[(Y - a)|X=x] = 0\] \[\mathbb{E}[Y|X=x] - a = 0\]

Solution: \(a^* = \mathbb{E}[Y|X=x]\)

Step 4: Since this holds for all \(x\): \[g^*(x) = \mathbb{E}[Y|X=x]\]

Therefore: \[\boxed{\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X]}\]

Orthogonality Principle

Orthogonality principle: Optimal error orthogonal to any function of \(X\): \[\mathbb{E}[(Y - \mathbb{E}[Y|X])h(X)] = 0\] for any measurable \(h\).

  1. Uniqueness: Only one function satisfies this → unique MMSE

  2. Optimality test: If \(\mathbb{E}[\epsilon \cdot h(X)] \neq 0\), can reduce MSE by adjusting estimate in direction of \(h(X)\)

  3. Computational foundation:

    • Linear case → normal equations
    • Adaptive filters → orthogonalize error
    • Kalman filter → innovation orthogonality
  4. Geometric meaning: MMSE = projection in Hilbert space

Proof: \(\mathbb{E}[(Y - \mathbb{E}[Y|X])|X] = 0\) by definition, multiplying by \(h(X)\) preserves this.

Cannot extract more information from \(X\)

Why Is This Global?

We solved pointwise: For each \(x\), found \(g^*(x) = \mathbb{E}[Y|X=x]\)

Question: Is this globally optimal over all functions?

Two concerns:

  1. We optimized separately at each \(x\)
  2. Maybe coupling across different \(x\) values helps?

Key fact: MSE decomposes additively \[\text{MSE}(g) = \int \mathbb{E}[(Y - g(x))^2|X=x] p(x) dx\]

Each term \(\mathbb{E}[(Y - g(x))^2|X=x]\) depends only on \(g(x)\), not on \(g(x')\) for \(x' \neq x\).

Therefore:

  • Minimizing each term separately minimizes the sum
  • No coupling can improve the solution
  • Pointwise optimization = Global optimization

Idea: Prediction at \(x\) doesn’t help prediction at \(x'\)

Verifying Global Optimality

Claim: No function can beat \(\mathbb{E}[Y|X]\)

Proof: For any function \(g\): \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2]\]

Add and subtract \(\mathbb{E}[Y|X]\): \[= \mathbb{E}[(Y - \mathbb{E}[Y|X] + \mathbb{E}[Y|X] - g(X))^2]\]

Expand the square: \[= \mathbb{E}[(Y - \mathbb{E}[Y|X])^2] + \mathbb{E}[(\mathbb{E}[Y|X] - g(X))^2]\] \[+ 2\mathbb{E}[(Y - \mathbb{E}[Y|X])(\mathbb{E}[Y|X] - g(X))]\]

Key: The cross term is zero! \[\mathbb{E}[(Y - \mathbb{E}[Y|X])(\mathbb{E}[Y|X] - g(X))] = 0\]

Because \(\mathbb{E}[Y|X] - g(X)\) is a function of \(X\) and: \[\mathbb{E}[(Y - \mathbb{E}[Y|X])h(X)] = 0\] for any \(h(X)\) (orthogonality).

Therefore: \[\text{MSE}(g) = \text{MSE}_{\min} + \mathbb{E}[(\mathbb{E}[Y|X] - g(X))^2] \geq \text{MSE}_{\min}\]

Minimum Achievable MSE

Minimum achievable MSE: \[\text{MSE}_{\min} = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2]\]

Result: \[\boxed{\text{MSE}_{\min} = \mathbb{E}[\text{Var}(Y|X)]}\]

Proof: \[\text{MSE}_{\min} = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2]\] \[= \mathbb{E}[\mathbb{E}[(Y - \mathbb{E}[Y|X])^2|X]]\]

For fixed \(X = x\): \[\mathbb{E}[(Y - \mathbb{E}[Y|X=x])^2|X=x] = \text{Var}(Y|X=x)\]

Therefore: \[\text{MSE}_{\min} = \mathbb{E}[\text{Var}(Y|X)]\]

  • Cannot eliminate conditional variance
  • Irreducible uncertainty
  • Fundamental limit of prediction

Special cases:

  • \(X\) and \(Y\) independent: \(\text{MSE}_{\min} = \text{Var}(Y)\)
  • \(Y = f(X)\) deterministic: \(\text{MSE}_{\min} = 0\)

Impact of Outliers on MMSE

Squared error is sensitive to outliers

Single outlier can dominate MSE: \[\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]

If \(y_k\) is outlier: \((y_k - \hat{y}_k)^2 \gg (y_i - \hat{y}_i)^2\) for \(i \neq k\)

Effect on MMSE:

  • \(\mathbb{E}[Y|X]\) gets pulled toward outliers
  • Large increase in MSE
  • Solution may not represent majority of data

Example: Sensor failure

  • 99 readings: \(Y \approx X + \varepsilon\), \(\varepsilon \sim \mathcal{N}(0, 0.1^2)\)
  • 1 reading: \(Y = 100\) (sensor spike)

MMSE tries to accommodate the outlier, degrading performance for typical cases.

Alternative loss functions:

  • Absolute error: \(|Y - \hat{Y}|\) (median)
  • Huber loss: Quadratic for small errors, linear for large
  • Trimmed MSE: Ignore largest errors

Robust Alternatives to MMSE

When outliers are present:

  1. Median regression (LAD): \[\min_g \mathbb{E}[|Y - g(X)|]\] Solution: \(g^*(x) = \text{median}(Y|X=x)\)

  2. Huber regression: \[\min_g \mathbb{E}[\ell_\delta(Y - g(X))]\] where \(\ell_\delta(e) = \begin{cases} \frac{1}{2}e^2 & |e| \leq \delta \\ \delta|e| - \frac{\delta^2}{2} & |e| > \delta \end{cases}\)

  3. Trimmed mean: Remove top/bottom α% before computing \(\mathbb{E}[Y|X]\)

Example: Gaussian Case

Joint Gaussian: \((X, Y) \sim \mathcal{N}(\mu, K)\)

\[K = \begin{bmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{bmatrix}\]

MMSE estimator: \[\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X - \mu_X)\]

Key property: Linear in \(X\)!

Minimum MSE: \[\text{MSE}_{\min} = \mathbb{E}[\text{Var}(Y|X)] = \sigma_Y^2(1 - \rho^2)\]

Observations:

  1. \(\rho = 0\): MSE\(_{\min}\) = \(\sigma_Y^2\) (no improvement)
  2. \(|\rho| = 1\): MSE\(_{\min}\) = 0 (perfect prediction)
  3. Linear MMSE = Unrestricted MMSE

Why linear? Gaussian conditional \(Y|X\) has:

  • Mean linear in \(x\)
  • Variance independent of \(x\)

Example: Non-Gaussian Case

Uniform-Quadratic example:

  • \(X \sim \text{Uniform}(-2, 2)\)
  • \(Y = X^2 + N\), \(N \sim \mathcal{N}(0, 0.5^2)\)

MMSE estimator: \[\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X] = X^2\]

NOTE: Nonlinear in \(X\)

Linear MMSE (suboptimal): \[\hat{Y}_{\text{linear}} = \mathbb{E}[Y] + \frac{\text{Cov}(X,Y)}{\text{Var}(X)}(X - \mathbb{E}[X])\]

Since \(\mathbb{E}[X] = 0\) and \(\text{Cov}(X,Y) = \mathbb{E}[X^3] = 0\): \[\hat{Y}_{\text{linear}} = \mathbb{E}[Y] = \mathbb{E}[X^2] = \frac{4}{3}\]

Performance gap:

  • MSE\(_{\text{MMSE}}\) = \(\sigma_N^2\) = 0.25
  • MSE\(_{\text{linear}}\) = Var(\(X^2\)) + \(\sigma_N^2\) ≈ 2.17

Linear loses significant performance when true relationship is nonlinear!

Linear MMSE

Curse of Dimensionality

Curse of dimensionality:

To compute \(\mathbb{E}[Y|X]\), need \(p(y|x)\)

Density estimation convergence:

  • Rate: \(N^{-4/(d+4)}\) for \(d\)-dimensional data
  • \(d = 10\): Need \(N \approx 10^6\) samples
  • \(d = 100\): Need \(N \approx 10^{20}\) samples!

Why this matters:

  • Modern data is high-dimensional
  • Never have enough samples for nonparametric methods
  • Must accept suboptimality for tractability

Linear class advantages:

  • Only need second moments: \(O(d^2)\) parameters
  • Sample complexity: \(O(d^2)\) vs \(O(e^d)\)
  • Often captures dominant effects

Theory to Practice

MMSE requires \(p(y|x)\) to compute \(\mathbb{E}[Y|X]\)

Reality: We rarely know conditional distributions

What we often have:

  • Sample data: \((x_1, y_1), ..., (x_n, y_n)\)
  • First and second moments
  • Rough sense of relationship

Practical solution: Restrict to linear predictors \[\hat{Y} = aX + b\]

  1. Only need means, variances, covariance
  2. Optimal for Gaussian (prove in Part 5)
  3. Often good enough in practice

Restricting to Linear Predictors

Function class: \[\mathcal{L} = \{g(x) = ax + b : a, b \in \mathbb{R}\}\]

New optimization problem: \[\min_{a,b} \text{MSE}(a,b) = \min_{a,b} \mathbb{E}[(Y - aX - b)^2]\]

From infinite to finite:

  • Before: Search over all functions
  • Now: Search over 2 parameters

Expanding the MSE: \[\text{MSE}(a,b) = \mathbb{E}[Y^2 - 2Y(aX + b) + (aX + b)^2]\] \[= \mathbb{E}[Y^2] - 2a\mathbb{E}[XY] - 2b\mathbb{E}[Y]\] \[+ a^2\mathbb{E}[X^2] + 2ab\mathbb{E}[X] + b^2\]

MSE is quadratic in \((a, b)\)

  • Unique minimum (if Var\((X) > 0\))
  • Closed-form solution via derivatives
  • No iterative optimization needed

Deriving Optimal Coefficients: Bias

Step 1: Find optimal \(b\) for fixed \(a\)

Take derivative with respect to \(b\): \[\frac{\partial}{\partial b}\text{MSE}(a,b) = \frac{\partial}{\partial b}\mathbb{E}[(Y - aX - b)^2]\]

\[= -2\mathbb{E}[Y - aX - b] = 0\]

Solving for \(b\): \[\mathbb{E}[Y - aX - b] = 0\] \[\mathbb{E}[Y] - a\mathbb{E}[X] - b = 0\]

\[\boxed{b^* = \mathbb{E}[Y] - a\mathbb{E}[X]}\]

Interpretation:

  • Offset ensures zero mean error
  • Forces predictor through \((\mathbb{E}[X], \mathbb{E}[Y])\)
  • Adjusts for non-zero means

Substituting back: \[\hat{Y} = aX + \mathbb{E}[Y] - a\mathbb{E}[X] = \mathbb{E}[Y] + a(X - \mathbb{E}[X])\]

Deriving Optimal Coefficients: Slope

Step 2: Find optimal \(a\) using \(b^* = \mathbb{E}[Y] - a\mathbb{E}[X]\)

Substitute into MSE: \[\text{MSE}(a) = \mathbb{E}[(Y - \mathbb{E}[Y] - a(X - \mathbb{E}[X]))^2]\]

Define centered variables:

  • \(\tilde{Y} = Y - \mathbb{E}[Y]\)
  • \(\tilde{X} = X - \mathbb{E}[X]\)

Then: \(\text{MSE}(a) = \mathbb{E}[(\tilde{Y} - a\tilde{X})^2]\)

Take derivative: \[\frac{d}{da}\text{MSE}(a) = -2\mathbb{E}[\tilde{X}(\tilde{Y} - a\tilde{X})] = 0\]

\[\mathbb{E}[\tilde{X}\tilde{Y}] - a\mathbb{E}[\tilde{X}^2] = 0\]

Solution: \[\boxed{a^* = \frac{\mathbb{E}[\tilde{X}\tilde{Y}]}{\mathbb{E}[\tilde{X}^2]} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}}\]

Note: \(\mathbb{E}[\tilde{X}\tilde{Y}] = \text{Cov}(X,Y)\) and \(\mathbb{E}[\tilde{X}^2] = \text{Var}(X)\)

LMS: Adaptive Linear MMSE

Problem: Don’t know statistics a priori

LMMSE needs: \(\text{Cov}(X,Y)\) and \(\text{Var}(X)\)

Solution: Learn from streaming data

Gradient descent on MSE: \[\frac{\partial}{\partial a} \mathbb{E}[(Y - aX - b)^2] = -2\mathbb{E}[(Y - aX - b)X]\]

Widrow-Hoff insight:

Replace expectation with instantaneous value: \[\mathbb{E}[(Y - aX - b)X] \approx (y_i - ax_i - b)x_i = e_i x_i\]

LMS update rule: \[a_{i+1} = a_i + \mu e_i x_i\] \[b_{i+1} = b_i + \mu e_i\]

where \(e_i = y_i - \hat{y}_i\) is prediction error

Convergence: \(a_i \to a^*_{\text{LMMSE}}\) as \(i \to \infty\)

LMS Properties

Convergence condition: \[0 < \mu < \frac{2}{\mathbb{E}[X^2]}\]

Too large \(\mu\) → divergence

Steady-state error:

Misadjustment: \(M = \frac{\mu \mathbb{E}[X^2]}{2}\)

Trade-off:

  • Small \(\mu\): Slow convergence, low steady-state error
  • Large \(\mu\): Fast convergence, high steady-state error

Time constant: \[\tau \approx \frac{1}{2\mu \mathbb{E}[X^2]}\]

iterations to converge

Advantages:

  • No matrix operations
  • Adapts to changing statistics
  • Simple hardware implementation

RelationLMS = stochastic gradient descent for MSE

Linear MMSE Formula

Complete solution: \[\boxed{\hat{Y}_{\text{LMMSE}} = \mathbb{E}[Y] + \frac{\text{Cov}(X,Y)}{\text{Var}(X)}(X - \mathbb{E}[X])}\]

Alternative forms:

Using correlation \(\rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\): \[\hat{Y}_{\text{LMMSE}} = \mathbb{E}[Y] + \rho\frac{\sigma_Y}{\sigma_X}(X - \mathbb{E}[X])\]

Standardized form: If \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\) and \(\sigma_X = \sigma_Y = 1\): \[\hat{Y}_{\text{LMMSE}} = \rho X\]

Minimum MSE: \[\text{MSE}_{\text{LMMSE}} = \text{Var}(Y)(1 - \rho^2)\]

Properties:

  1. Only needs first and second moments
  2. Always exists (if Var\((X) > 0\))
  3. Unique solution
  4. Orthogonal errors: \(\mathbb{E}[(Y - \hat{Y})X] = 0\)

Connection to Correlation

Slope decomposition: \[a^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\rho \sigma_X \sigma_Y}{\sigma_X^2} = \rho \frac{\sigma_Y}{\sigma_X}\]

Correlation determines predictability

Fraction of variance explained: \[R^2 = \frac{\text{Var}(Y) - \text{MSE}_{\text{LMMSE}}}{\text{Var}(Y)} = \rho^2\]

Correlation measures linear predictability, not general dependence

Linear vs Unrestricted MMSE

Fundamental inequality: \[\text{MSE}_{\text{LMMSE}} \geq \text{MSE}_{\text{MMSE}}\]

Always! Linear is restricted class.

Performance gap: \[\text{MSE}_{\text{LMMSE}} - \text{MSE}_{\text{MMSE}} = \mathbb{E}[(\mathbb{E}[Y|X] - \hat{Y}_{\text{LMMSE}})^2]\]

Measures nonlinearity of \(\mathbb{E}[Y|X]\).

When are they equal? \[\text{MSE}_{\text{LMMSE}} = \text{MSE}_{\text{MMSE}}\] if and only if \(\mathbb{E}[Y|X]\) is linear in \(X\).

For jointly Gaussian \((X,Y)\):

  • \(\mathbb{E}[Y|X]\) is linear
  • LMMSE = MMSE
  • No performance loss from restriction

Orthogonality in Linear MMSE

Orthogonality principle for LMMSE:

The error is orthogonal to the input: \[\mathbb{E}[(Y - \hat{Y}_{\text{LMMSE}})X] = 0\]

Proof: Substituting \(\hat{Y}_{\text{LMMSE}} = b^* + a^*X\): \[\mathbb{E}[(Y - b^* - a^*X)X] = \mathbb{E}[YX] - b^*\mathbb{E}[X] - a^*\mathbb{E}[X^2]\]

Using \(b^* = \mathbb{E}[Y] - a^*\mathbb{E}[X]\) and \(a^* = \text{Cov}(X,Y)/\text{Var}(X)\): \[= \mathbb{E}[YX] - \mathbb{E}[Y]\mathbb{E}[X] - a^*(\mathbb{E}[X^2] - \mathbb{E}[X]^2)\] \[= \text{Cov}(X,Y) - a^*\text{Var}(X) = 0\]

Geometric interpretation:

  • \(Y\) = signal in Hilbert space
  • \(\hat{Y}_{\text{LMMSE}}\) = projection onto linear span of \(\{1, X\}\)
  • Error perpendicular to subspace

Note: Only orthogonal to \(X\) and constants, not all functions of \(X\)!

Practical Computation

What we need:

  • \(\mathbb{E}[X], \mathbb{E}[Y]\)
  • \(\mathbb{E}[X^2], \mathbb{E}[Y^2], \mathbb{E}[XY]\)

Sample estimates from data \((x_1, y_1), ..., (x_n, y_n)\): \[\hat{\mu}_X = \frac{1}{n}\sum_{i=1}^n x_i, \quad \hat{\mu}_Y = \frac{1}{n}\sum_{i=1}^n y_i\]

\[\widehat{\text{Var}}(X) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \hat{\mu}_X)^2\]

\[\widehat{\text{Cov}}(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \hat{\mu}_X)(y_i - \hat{\mu}_Y)\]

Then: \(\hat{a} = \widehat{\text{Cov}}(X,Y)/\widehat{\text{Var}}(X)\)

Gaussian Case

Joint Gaussian Distribution

Bivariate Gaussian \((X, Y) \sim \mathcal{N}(\mu, K)\)

Parameters: \[\mu = \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \quad K = \begin{bmatrix} \sigma_X^2 & k_{XY} \\ k_{XY} & \sigma_Y^2 \end{bmatrix}\]

where \(k_{XY} = \rho\sigma_X\sigma_Y\)

Joint density: \[p(x,y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left[z_x^2 - 2\rho z_x z_y + z_y^2\right]\right)\]

where \(z_x = \frac{x-\mu_X}{\sigma_X}\) and \(z_y = \frac{y-\mu_Y}{\sigma_Y}\) are standardized variables

Fully determined by first and second moments

  • 5 parameters: \(\mu_X, \mu_Y, \sigma_X^2, \sigma_Y^2, \rho\)
  • All higher moments determined by these

Conditional Distribution Y|X

Result: \(Y|X=x\) is Gaussian

Conditional mean: \[\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)\]

Conditional variance (constant!): \[\text{Var}(Y|X=x) = \sigma_Y^2(1 - \rho^2)\]

Therefore: \[Y|X=x \sim \mathcal{N}\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X), \sigma_Y^2(1 - \rho^2)\right)\]

Observations:

  1. \(\mathbb{E}[Y|X=x]\) is linear in \(x\)
  2. \(\text{Var}(Y|X=x)\) independent of \(x\)
  3. Conditional distribution remains Gaussian

Regression line: \[y = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)\] passes through \((\mu_X, \mu_Y)\) with slope \(\rho\sigma_Y/\sigma_X\)

Deriving the Conditional Mean

Method 1: Complete the square

Start with joint density, condition on \(X = x\): \[p(y|x) \propto \exp\left(-\frac{1}{2(1-\rho^2)\sigma_Y^2}\left(y - \mu_Y - \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X)\right)^2\right)\]

Method 2: Linear projection

Find \(a, b\) minimizing \(\mathbb{E}[(Y - aX - b)^2]\):

  • \(a^* = \text{Cov}(X,Y)/\text{Var}(X) = \rho\sigma_Y/\sigma_X\)
  • \(b^* = \mathbb{E}[Y] - a^*\mathbb{E}[X] = \mu_Y - a^*\mu_X\)

Method 3: Schur complement

Using block matrix operations on \(K\): \[\mathbb{E}[Y|X=x] = \mu_Y + k_{YX}k_{XX}^{-1}(x - \mu_X)\] \[= \mu_Y + \rho\sigma_X\sigma_Y \cdot \frac{1}{\sigma_X^2}(x - \mu_X)\]

(all methods give same result)

MMSE = LMMSE for Gaussian

Fundamental result: For jointly Gaussian \((X, Y)\)

\[\boxed{\hat{Y}_{\text{MMSE}} = \hat{Y}_{\text{LMMSE}} = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X - \mu_X)}\]

Why equal?

  • MMSE: \(\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X]\)
  • For Gaussian: \(\mathbb{E}[Y|X]\) is linear
  • LMMSE: Best linear predictor

Minimum MSE: \[\text{MSE}_{\min} = \sigma_Y^2(1 - \rho^2)\]

Implications:

  1. No need to consider nonlinear predictors
  2. Second moments sufficient for optimal prediction
  3. Closed-form solution

Why Gaussian is Special

Maximum entropy property: Among all distributions with given mean and covariance, Gaussian has maximum entropy \[H(X,Y) = -\int\int p(x,y)\log p(x,y) \, dx \, dy\]

Central Limit Theorem: Sum of many independent effects → Gaussian \[\frac{1}{\sqrt{n}}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}(\mu, \sigma^2)\]

Preserved under linear operations: If \((X,Y)\) Gaussian and \(Z = aX + bY + c\):

  • \(Z\) is Gaussian
  • \((X,Z)\) jointly Gaussian
  • \((Y,Z)\) jointly Gaussian

Sufficient statistics: For Gaussian, sample mean and covariance are sufficient

  • No information lost by summarizing with \((\bar{X}, \bar{Y}, S_{XX}, S_{YY}, S_{XY})\)

Extension to Vectors (Preview)

Multivariate Gaussian: \(\mathbf{X} = [X_1, ..., X_n]^T\), \(\mathbf{Y} = [Y_1, ..., Y_m]^T\)

Joint distribution: \[\begin{bmatrix} \mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_Y \end{bmatrix}, \begin{bmatrix} \mathbf{K}_{XX} & \mathbf{K}_{XY} \\ \mathbf{K}_{YX} & \mathbf{K}_{YY} \end{bmatrix}\right)\]

MMSE estimator (still linear!): \[\hat{\mathbf{Y}}_{\text{MMSE}} = \mathbb{E}[\mathbf{Y}|\mathbf{X}] = \boldsymbol{\mu}_Y + \mathbf{K}_{YX}\mathbf{K}_{XX}^{-1}(\mathbf{X} - \boldsymbol{\mu}_X)\]

Error covariance: \[\mathbf{K}_{\mathbf{Y}|\mathbf{X}} = \mathbf{K}_{YY} - \mathbf{K}_{YX}\mathbf{K}_{XX}^{-1}\mathbf{K}_{XY}\]

Computational considerations:

  • Matrix inversion: \(O(n^3)\)
  • Numerical stability
  • Regularization for singular \(\mathbf{K}_{XX}\)

Everything generalizes!

Connection to Machine Learning

From Theory to Data

Theory (Population)

Given joint distribution \(p(x,y)\): \[\text{MSE} = \mathbb{E}[(Y - \hat{Y})^2]\]

Linear MMSE solution: \[a^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]}{\mathbb{E}[X^2] - \mathbb{E}[X]^2}\]

Practice (Sample)

Given data \(\{(x_i, y_i)\}_{i=1}^n\): \[\text{MSE}_n = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Sample estimates: \[\hat{a} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}\]

Convergence: \(\hat{a} \to a^*\) as \(n \to \infty\) (law of large numbers)

This is linear regression

Empirical Risk Minimization

Population risk (unknown): \[R(a,b) = \mathbb{E}[(Y - aX - b)^2]\]

Empirical risk (computable): \[R_n(a,b) = \frac{1}{n}\sum_{i=1}^n (y_i - ax_i - b)^2\]

Optimization: \[\frac{\partial R_n}{\partial b} = -\frac{2}{n}\sum_{i=1}^n (y_i - ax_i - b) = 0\] \[\implies b = \bar{y} - a\bar{x}\]

\[\frac{\partial R_n}{\partial a} = -\frac{2}{n}\sum_{i=1}^n x_i(y_i - ax_i - b) = 0\] \[\implies a = \frac{\sum_i x_i y_i - n\bar{x}\bar{y}}{\sum_i x_i^2 - n\bar{x}^2}\]

Sample optimization gives sample LMMSE

Beyond Linear: Features

Linear model limitations: \[\hat{y} = ax + b\]

Can only model linear relationships

Feature transformation: \[\hat{y} = \sum_{k=0}^{K} w_k \phi_k(x) = \mathbf{w}^T \boldsymbol{\phi}(x)\]

where \(\boldsymbol{\phi}(x) = [\phi_0(x), \phi_1(x), ..., \phi_K(x)]^T\)

Still linear MMSE in feature space:

  • Design matrix: \(\Phi_{ij} = \phi_j(x_i)\)
  • Solution: \(\mathbf{w}^* = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{y}\)

Example: Polynomial features \[\phi_0(x) = 1, \quad \phi_1(x) = x, \quad \phi_2(x) = x^2\]

Now can fit parabolas while still using linear methods

Gradient Methods

Batch gradient descent: \[J(a,b) = \frac{1}{n}\sum_{i=1}^n (y_i - ax_i - b)^2\]

Gradients: \[\frac{\partial J}{\partial a} = -\frac{2}{n}\sum_{i=1}^n x_i(y_i - ax_i - b)\] \[\frac{\partial J}{\partial b} = -\frac{2}{n}\sum_{i=1}^n (y_i - ax_i - b)\]

Update: \[a_{t+1} = a_t - \eta \frac{\partial J}{\partial a}\] \[b_{t+1} = b_t - \eta \frac{\partial J}{\partial b}\]

Stochastic version (one sample): \[a_{t+1} = a_t + \eta x_i (y_i - a_t x_i - b_t)\] \[b_{t+1} = b_t + \eta (y_i - a_t x_i - b_t)\]

This is LMS algorithm (Widrow-Hoff)

Neural Network = Composition

Single neuron: \[z = \mathbf{w}^T\mathbf{x} + b\] \[h = \sigma(z)\]

where \(\sigma\) is activation (ReLU, sigmoid, etc.)

Without activation: Linear regression \[\hat{y} = \mathbf{w}^T\mathbf{x} + b\]

With activation: Nonlinear transform \[\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)\]

Deep network: Composition of layers \[\mathbf{h}^{(1)} = \sigma(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)})\] \[\mathbf{h}^{(2)} = \sigma(W^{(2)}\mathbf{h}^{(1)} + \mathbf{b}^{(2)})\] \[\hat{y} = W^{(L)}\mathbf{h}^{(L-1)} + b^{(L)}\]

Still minimizing MSE: \(\frac{1}{n}\sum_i (y_i - \hat{y}_i)^2\)

Backpropagation: Chain rule for gradients

Extension to Vectors

Second-Order Statistics for Random Vectors

Random vector: \(\mathbf{X} = [X_1, X_2, ..., X_n]^T\)

Mean vector: \[\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_n] \end{bmatrix}\]

Covariance matrix: \[K_{\mathbf{XX}} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]\]

Element structure: \[[K_{\mathbf{XX}}]_{ij} = \text{Cov}(X_i, X_j) = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)]\]

Properties:

  • Symmetric: \(K_{\mathbf{XX}} = K_{\mathbf{XX}}^T\)
  • Positive semi-definite: \(\mathbf{a}^T K_{\mathbf{XX}} \mathbf{a} \geq 0\)
  • Diagonal elements = variances
  • Off-diagonal = covariances

Joint Gaussian Vectors

Joint Gaussian distribution:

\(\begin{bmatrix} \mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_Y \end{bmatrix}, \begin{bmatrix} K_{XX} & K_{XY} \\ K_{YX} & K_{YY} \end{bmatrix}\right)\)

Probability density: \[p(\mathbf{x}, \mathbf{y}) = \frac{1}{(2\pi)^{(n+m)/2}|K|^{1/2}} \exp\left(-\frac{1}{2}\mathbf{z}^TK^{-1}\mathbf{z}\right)\]

where \(\mathbf{z} = \begin{bmatrix} \mathbf{x} - \boldsymbol{\mu}_X \\ \mathbf{y} - \boldsymbol{\mu}_Y \end{bmatrix}\)

Linear operations preserve Gaussianity

If \(\mathbf{Z} = A\mathbf{X} + \mathbf{b}\), then: \[\mathbf{Z} \sim \mathcal{N}(A\boldsymbol{\mu}_X + \mathbf{b}, AK_{XX}A^T)\]

Marginals are Gaussian: \[\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_X, K_{XX})\]

Gaussian Conditional Distribution

Given joint Gaussian, conditional is also Gaussian:

\[\mathbf{X}|\mathbf{Y} = \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_{X|Y}, K_{X|Y})\]

Conditional mean (MMSE estimator): \[\boldsymbol{\mu}_{X|Y} = \boldsymbol{\mu}_X + K_{XY}K_{YY}^{-1}(\mathbf{y} - \boldsymbol{\mu}_Y)\]

Conditional covariance (error covariance): \[K_{X|Y} = K_{XX} - K_{XY}K_{YY}^{-1}K_{YX}\]

Observations:

  1. Conditional mean is linear in \(\mathbf{y}\)
  2. Conditional covariance doesn’t depend on \(\mathbf{y}\)
  3. For Gaussian: MMSE = Linear MMSE

Schur complement: \[K_{X|Y} = K_{XX} - K_{XY}K_{YY}^{-1}K_{YX}\] measures reduction in uncertainty

Multiple Predictors

Scalar case: \[\hat{Y} = aX + b\]

Given: \(X\) (single predictor), predict \(Y\)

Vector case: \[\hat{Y} = \mathbf{a}^T\mathbf{X} + b\]

Given: \(\mathbf{X} = [X_1, X_2, ..., X_p]^T\) (multiple predictors)

Example: Predict temperature \(Y\) from:

  • \(X_1\) = humidity sensor
  • \(X_2\) = pressure sensor
  • \(X_3\) = wind speed

Matrix notation: \[\hat{Y} = \mathbf{a}^T\mathbf{X} + b = \sum_{j=1}^p a_j X_j + b\]

where \(\mathbf{a} = [a_1, a_2, ..., a_p]^T\) are weights

Vector LMMSE Problem

Minimize MSE over linear predictors: \[\text{MSE}(\mathbf{a}, b) = \mathbb{E}[(Y - \mathbf{a}^T\mathbf{X} - b)^2]\]

Taking derivatives:

\[\frac{\partial \text{MSE}}{\partial b} = -2\mathbb{E}[Y - \mathbf{a}^T\mathbf{X} - b] = 0\] \[\implies b = \mathbb{E}[Y] - \mathbf{a}^T\mathbb{E}[\mathbf{X}]\]

\[\frac{\partial \text{MSE}}{\partial \mathbf{a}} = -2\mathbb{E}[\mathbf{X}(Y - \mathbf{a}^T\mathbf{X} - b)] = \mathbf{0}\]

Substituting \(b\) and simplifying: \[\mathbb{E}[\mathbf{X}\mathbf{X}^T]\mathbf{a} = \mathbb{E}[\mathbf{X}Y] - \mathbb{E}[\mathbf{X}]\mathbb{E}[Y]\]

Define:

  • \(K_{\mathbf{XX}} = \mathbb{E}[\mathbf{X}\mathbf{X}^T] - \mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{X}]^T\) (covariance matrix)
  • \(\mathbf{k}_{\mathbf{X}Y} = \mathbb{E}[\mathbf{X}Y] - \mathbb{E}[\mathbf{X}]\mathbb{E}[Y]\) (cross-covariance vector)

Vector LMMSE Solution

Normal equations: \[K_{\mathbf{XX}}\mathbf{a} = \mathbf{k}_{\mathbf{X}Y}\]

Solution (if \(K_{\mathbf{XX}}\) invertible): \[\mathbf{a}^* = K_{\mathbf{XX}}^{-1}\mathbf{k}_{\mathbf{X}Y}\] \[b^* = \mathbb{E}[Y] - (\mathbf{a}^*)^T\mathbb{E}[\mathbf{X}]\]

LMMSE predictor: \[\hat{Y}_{\text{LMMSE}} = \mathbb{E}[Y] + (\mathbf{a}^*)^T(\mathbf{X} - \mathbb{E}[\mathbf{X}])\]

Minimum MSE: \[\text{MSE}_{\min} = \text{Var}(Y) - \mathbf{k}_{\mathbf{X}Y}^T K_{\mathbf{XX}}^{-1} \mathbf{k}_{\mathbf{X}Y}\]

Describes:

  • Variance reduction = \(\mathbf{k}_{\mathbf{X}Y}^T K_{\mathbf{XX}}^{-1} \mathbf{k}_{\mathbf{X}Y}\)
  • Measures how much \(\mathbf{X}\) helps predict \(Y\)

Geometric Interpretation

Column space perspective:

Data matrix columns: \(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_p\)

\[\mathcal{S} = \text{span}(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_p)\]

LMMSE = Projection: \[\hat{\mathbf{y}} = \text{Proj}_{\mathcal{S}}(\mathbf{y})\]

Orthogonality: \[\mathbf{y} - \hat{\mathbf{y}} \perp \mathcal{S}\]

Equivalently: \(X^T(\mathbf{y} - X\mathbf{a}) = \mathbf{0}\)

Projection matrix: \[P = X(X^TX)^{-1}X^T\] \[\hat{\mathbf{y}} = P\mathbf{y}\]

Properties:

  • \(P^2 = P\) (idempotent)
  • \(P^T = P\) (symmetric)
  • \(\text{rank}(P) = p\) (dimension of \(\mathcal{S}\))

Concerns in Practice (Computation)

Direct inversion issues:

  1. Ill-conditioning: When predictors highly correlated

    • \(\text{cond}(K_{\mathbf{XX}}) = \frac{\lambda_{\max}}{\lambda_{\min}} \gg 1\)
    • Small changes → large parameter changes
  2. Rank deficiency: When \(p > n\) or perfect collinearity

    • \(K_{\mathbf{XX}}\) not invertible
    • Infinitely many solutions

Solutions:

  1. QR decomposition: \(X = QR\)

    • More stable than normal equations
    • \(\mathbf{a} = R^{-1}Q^T\mathbf{y}\)
  2. SVD: \(X = U\Sigma V^T\)

    • Handles rank-deficient cases
    • Pseudoinverse: \(X^+ = V\Sigma^+ U^T\)
  3. Regularization:

    • Ridge: Add \(\lambda I\) to \(K_{\mathbf{XX}}\)
    • Makes problem well-posed
    • Trades bias for stability

Multivariate Output

Multiple responses: \[\mathbf{Y} = [Y_1, Y_2, ..., Y_m]^T\]

Linear model: \[\hat{\mathbf{Y}} = W^T\mathbf{X} + \mathbf{b}\]

where \(W \in \mathbb{R}^{p \times m}\) is weight matrix

MSE matrix: \[\mathbf{M} = \mathbb{E}[(\mathbf{Y} - \hat{\mathbf{Y}})(\mathbf{Y} - \hat{\mathbf{Y}})^T]\]

Minimize trace (total MSE): \[\text{tr}(\mathbf{M}) = \sum_{i=1}^m \mathbb{E}[(Y_i - \hat{Y}_i)^2]\]

Solution: \[W^* = K_{\mathbf{XX}}^{-1}K_{\mathbf{XY}}\]

where \(K_{\mathbf{XY}} = \mathbb{E}[\mathbf{X}\mathbf{Y}^T] - \mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{Y}]^T\)

Key: Solve \(m\) separate LMMSE problems

Decorrelation via Eigendecomposition

Computational cost:

When \(K_{\mathbf{XX}}\) has off-diagonal terms:

  • Matrix inversion requires \(O(p^3)\) operations
  • All variables coupled in solution
  • Numerical instability for ill-conditioned matrices

Eigendecomposition approach:

Any symmetric positive semi-definite matrix: \[K_{\mathbf{XX}} = U\Lambda U^T\]

where:

  • \(U = [\mathbf{u}_1, \mathbf{u}_2, ..., \mathbf{u}_p]\) orthonormal
  • \(\Lambda = \text{diag}(\lambda_1, \lambda_2, ..., \lambda_p)\)
  • \(\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_p \geq 0\)

Eigenvectors are orthogonal \[\mathbf{u}_i^T \mathbf{u}_j = \delta_{ij}\]

This orthogonality simplifies coordinate transformation.

Karhunen-Loève Transform

Coordinate transformation: \[\mathbf{Z} = U^T\mathbf{X}\]

Covariance in new coordinates: \[K_{\mathbf{ZZ}} = \mathbb{E}[\mathbf{Z}\mathbf{Z}^T]\] \[= \mathbb{E}[U^T\mathbf{X}\mathbf{X}^TU]\] \[= U^T\mathbb{E}[\mathbf{X}\mathbf{X}^T]U\] \[= U^TK_{\mathbf{XX}}U\] \[= U^T(U\Lambda U^T)U\] \[= \Lambda\]

Result: Diagonal covariance matrix

Components of \(\mathbf{Z}\) are uncorrelated: \[\mathbb{E}[Z_iZ_j] = \lambda_i\delta_{ij}\]

Inverse transform (reconstruction): \[\mathbf{X} = U\mathbf{Z}\]

Since \(U\) is orthonormal: \(U^TU = I\)

KL Example: 3×3 Matrix

Given covariance: \[\mathbf{K} = \begin{bmatrix} 6 & -4 & 0 \\ -4 & 6 & 0 \\ 0 & 0 & 3 \end{bmatrix}\]

Find eigenvalues: Solve \(\det(\mathbf{K} - \lambda\mathbf{I}) = 0\)

Block structure simplifies:

  • Upper 2×2 block: \(\lambda_{1,2} = 6 \pm 4\)
  • Lower 1×1 block: \(\lambda_3 = 3\)

Result: \(\lambda_1 = 10, \lambda_2 = 3, \lambda_3 = 2\)

Find eigenvectors: Solve \((\mathbf{K} - \lambda_i\mathbf{I})\mathbf{v} = 0\)

\[\mathbf{u}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\-1\\0\end{bmatrix}, \quad \mathbf{u}_2 = \begin{bmatrix}0\\0\\1\end{bmatrix}, \quad \mathbf{u}_3 = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\\0\end{bmatrix}\]

Result: \(\mathbf{K}_{zz} = \text{diag}(10, 3, 2)\)

Application to LMMSE

Original problem (coupled): \[\hat{\mathbf{X}} = K_{\mathbf{XY}} K_{\mathbf{YY}}^{-1} \mathbf{Y}\]

Requires inverting full matrix \(K_{\mathbf{YY}}\).

After KL transform:

  1. Transform: \(\mathbf{Z} = U^T\mathbf{X}\), \(\mathbf{W} = V^T\mathbf{Y}\)
  2. Covariances become diagonal
  3. Decouples LMMSE:

\[\hat{Z}_i = \frac{\text{Cov}(Z_i, W_i)}{\text{Var}(W_i)} W_i\]

Each component solved independently.

Computational advantage:

  • Original: \(O(p^3)\) for matrix inversion
  • KL approach: \(O(p^2)\) for eigendecomposition, then \(O(p)\) for solution

Numerical stability: Small eigenvalues → regularization: \[\hat{Z}_i = \frac{\text{Cov}(Z_i, W_i)}{\lambda_i + \epsilon} W_i\]

Dimensionality Reduction

Not all dimensions equally informative

Eigenvalues measure variance (information) per dimension.

Truncated representation: Keep only first \(k < p\) components: \[\mathbf{X} \approx \sum_{i=1}^k Z_i \mathbf{u}_i\]

Selection criterion: Retain fraction \(\alpha\) of total variance: \[\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^p \lambda_i} \geq \alpha\]

Common choice: \(\alpha = 0.95\) or \(0.99\)

Reconstruction error: \[\mathbb{E}[\|\mathbf{X} - \hat{\mathbf{X}}_k\|^2] = \sum_{i=k+1}^p \lambda_i\]

Error equals sum of discarded eigenvalues.

Applications:

  • Principal Component Analysis (PCA)
  • Data compression
  • Denoising (discard noisy dimensions)