
EE 541 - Unit 3
Fall 2025
Partial observability is fundamental
Real systems operate with incomplete information:
Objective: Given measurements (\(X\)), infer unknowns (\(Y\))
Uncertainty is inherent, not a nuisance
Need algorithms for optimal inference under uncertainty

Problem:
Predictor function: \(g: \mathbb{R} \to \mathbb{R}\) \[\hat{Y} = g(X)\]
Example: Room temperature estimation
Among all possible functions \(g\), which one is “best”?
Need:

Loss function \(\ell(y, \hat{y})\) measures prediction quality
Common choices:
| Loss | Formula | Properties |
|---|---|---|
| Squared | \((y - \hat{y})^2\) | Differentiable, convex, emphasizes large errors |
| Absolute | \(\|y - \hat{y}\|\) | Robust to outliers, non-differentiable at 0 |
| 0-1 | \(\mathbb{1}[y \neq \hat{y}]\) | Classification, discontinuous |
| Huber | \(\begin{cases} \frac{1}{2}(y-\hat{y})^2 & \|y-\hat{y}\| \leq \delta \\ \delta\|y-\hat{y}\| - \frac{\delta^2}{2} & \text{otherwise} \end{cases}\) | Robust + differentiable |
Risk: Expected loss over joint distribution \[R(g) = \mathbb{E}[\ell(Y, g(X))] = \int\int \ell(y, g(x)) p(x,y) \, dx \, dy\]
Goal: Find \(g^* = \arg\min_g R(g)\)

Mathematical advantages:
Differentiable: \[\frac{d}{d\hat{y}}(y - \hat{y})^2 = -2(y - \hat{y})\] Smooth objective
Unique minimum: Strictly convex → no local minima
Closed-form solutions: Often analytically tractable
Decomposition: \[\text{MSE} = \text{Bias}^2 + \text{Variance}\]
Statistical connection:
Under Gaussian noise \(N \sim \mathcal{N}(0, \sigma^2)\): \[Y = \mu + N\]
Maximum likelihood estimation: \[\hat{\mu}_{\text{MLE}} = \arg\max_\mu p(y|\mu) = \arg\min_\mu (y - \mu)^2\]
MSE ↔︎ Gaussian MLE

Definition: \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2]\]
Expanding the expectation: \[\text{MSE}(g) = \int\int (y - g(x))^2 p(x,y) \, dx \, dy\]
Empirical MSE (from data): \[\widehat{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - g(x_i))^2\]
Properties:
Other forms:

Hierarchy of function classes:
Flexibility vs Tractability

Definition: A linear estimator has the form \[\hat{Y} = aX + b\] where \(a, b \in \mathbb{R}\) are constants.
Note: “Linear” means linear in \(X\), not linear in parameters.
Parameters:
Special cases:
Property: Superposition If \(\hat{Y}_1 = aX_1 + b\) and \(\hat{Y}_2 = aX_2 + b\), then: \[\alpha\hat{Y}_1 + \beta\hat{Y}_2 = a(\alpha X_1 + \beta X_2) + (\alpha + \beta)b\]
Matrix form (preview): \[\hat{\mathbf{y}} = \mathbf{A}\mathbf{x} + \mathbf{b}\]

Mean E[Y]: Baseline predictor
Variance Var(Y): Spread/uncertainty
Covariance Cov(X,Y): Linear relationship strength
Correlation ρ: Normalized association
These moments completely determine the optimal linear estimator

First moments: \[\mathbb{E}[Y] = \int y \, p(y) \, dy\]
Second moments: \[\mathbb{E}[Y^2] = \int y^2 \, p(y) \, dy\]
Variance: \[\text{Var}(Y) = \mathbb{E}[(Y - \mathbb{E}[Y])^2] = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2\]
Covariance: \[\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]\] \[= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]
Correlation coefficient: \[\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1, 1]\]

Additive noise model: \[X = Y + N\]
Signal-to-Noise Ratio (SNR): \[\text{SNR} = \frac{\text{Power}(Y)}{\text{Power}(N)} = \frac{\mathbb{E}[Y^2]}{\mathbb{E}[N^2]}\]
For zero-mean: \[\text{SNR} = \frac{\sigma_Y^2}{\sigma_N^2}\]
In dB: \(\text{SNR}_{\text{dB}} = 10\log_{10}(\text{SNR})\)
Fundamental limit: Even with optimal estimation, cannot remove all noise: \[\text{MSE}_{\text{min}} \geq \frac{\sigma_Y^2 \sigma_N^2}{\sigma_Y^2 + \sigma_N^2} = \frac{\sigma_Y^2}{1 + \text{SNR}}\]

Joint distribution \((X, Y) \sim p(x,y)\) contains all information.
Bayes’ rule: \[p(y|x) = \frac{p(x,y)}{p(x)} = \frac{p(x,y)}{\int p(x,y') dy'}\]
Once we observe \(X = x\), the distribution of \(Y\) changes from \(p(y)\) to \(p(y|x)\).
Example: Temperature sensor
Conditioning incorporates information, reducing spread.
Conditioning reduces uncertainty
Discrete case: \[P(Y = y|X = x) = \frac{P(X = x, Y = y)}{P(X = x)}\]

Definition: The conditional expectation of \(Y\) given \(X = x\) is:
Discrete case: \[\mathbb{E}[Y|X=x] = \sum_y y \cdot P(Y=y|X=x)\]
Continuous case: \[\mathbb{E}[Y|X=x] = \int_{-\infty}^{\infty} y \cdot p(y|x) \, dy\]
Facts:
Example: \(Y = X + N\), \(N \sim \mathcal{N}(0,1)\), independent \(X\) and \(N\) \[\mathbb{E}[Y|X=x] = \mathbb{E}[x + N] = x\]

Example: Binary communication channel
Joint distribution: \[P(X=0, Y=0) = 0.45, \quad P(X=1, Y=0) = 0.05\] \[P(X=0, Y=1) = 0.05, \quad P(X=1, Y=1) = 0.45\]
Conditional distributions: \[P(X=0|Y=0) = \frac{0.45}{0.5} = 0.9\] \[P(X=1|Y=0) = \frac{0.05}{0.5} = 0.1\]
Conditional expectations: \[\mathbb{E}[X|Y=0] = 0 \cdot 0.9 + 1 \cdot 0.1 = 0.1\] \[\mathbb{E}[X|Y=1] = 0 \cdot 0.1 + 1 \cdot 0.9 = 0.9\]
MSE for different predictors:
Result: \(\mathbb{E}[X|Y]\) achieves minimum MSE

Joint Gaussian: \((X, Y) \sim \mathcal{N}(\mu, K)\)
Parameters: \[\mu = \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \quad K = \begin{bmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{bmatrix}\]
Result (derive later): \[\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)\]
Properties:
Conditional variance (constant!): \[\text{Var}(Y|X=x) = \sigma_Y^2(1 - \rho^2)\]
Independent of \(x\) - uncertainty same everywhere.
Special cases:

Computing \(\mathbb{E}[Y|X=x]\) requires \(p(y|x)\)
From Bayes: \[p(y|x) = \frac{p(x,y)}{p(x)} = \frac{p(x|y)p(y)}{\int p(x|y')p(y')dy'}\]
When tractable:
When intractable (often no closed form):
Example: Nonlinear sensor \[Y = \text{true value}, \quad X = Y + Y^2 \cdot N\] where \(N \sim \mathcal{N}(0, 0.1)\)
Computing \(\mathbb{E}[Y|X=x]\) requires:

Law of Total Expectation: \[\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]\]
Proof: \[\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[g(X)]\] where \(g(x) = \mathbb{E}[Y|X=x]\)
\[= \int g(x) p(x) dx = \int \mathbb{E}[Y|X=x] p(x) dx\]
\[= \int \left(\int y \, p(y|x) dy\right) p(x) dx\]
\[= \int\int y \, p(y|x) p(x) dy \, dx\]
\[= \int y \left(\int p(y|x)p(x) dx\right) dy\]
\[= \int y \, p(y) dy = \mathbb{E}[Y]\]
Intuition: Average of conditional averages = unconditional average
Example: Class grades

Property: For any function \(h(X)\): \[\mathbb{E}[h(X)Y | X] = h(X)\mathbb{E}[Y|X]\]
Proof: Fix \(X = x\). Then \(h(X) = h(x)\) is a constant: \[\mathbb{E}[h(X)Y | X=x] = \mathbb{E}[h(x)Y | X=x]\] \[= h(x)\mathbb{E}[Y | X=x]\]
Special cases:
Example: Signal processing \[Z = X \cdot Y + X^2\] \[\mathbb{E}[Z|X] = X\mathbb{E}[Y|X] + X^2\]
Only need \(\mathbb{E}[Y|X]\), not \(\mathbb{E}[XY|X]\)!
BUT: Cannot pull out functions of \(Y\): \[\mathbb{E}[Y^2|X] \neq Y \cdot \mathbb{E}[Y|X]\]

Definition: \[\text{Var}(Y|X) = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2 | X]\]
Computing formula: \[\text{Var}(Y|X) = \mathbb{E}[Y^2|X] - (\mathbb{E}[Y|X])^2\]
Measures remaining uncertainty after observing \(X\)
Law of Total Variance: \[\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])\]
Interpretation: \[\text{Total uncertainty} = \text{Unexplained} + \text{Explained by } X\]
Minimum MSE: \[\text{MSE}_{\text{min}} = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2] = \mathbb{E}[\text{Var}(Y|X)]\]
Cannot do better than conditional variance!

(prove next):
Among all functions \(g: \mathbb{R} \to \mathbb{R}\): \[\mathbb{E}[Y|X] = \arg\min_{g} \mathbb{E}[(Y - g(X))^2]\]
Achieving minimum MSE: \[\text{MSE}_{\text{min}} = \mathbb{E}[\text{Var}(Y|X)]\]
i.e., cannot beat the average conditional variance.
Why conditional expectation?
For fixed \(X = x\), minimizing \(\mathbb{E}[(Y - c)^2|X=x]\) over constants \(c\) yields: \[c^* = \mathbb{E}[Y|X=x]\]
Do this for each \(x\) → optimal function is \(g(x) = \mathbb{E}[Y|X=x]\)
Computational reality:
Constraints simplify the problem.

Goal: Find the best predictor of \(Y\) given \(X\)
Function space: \[\mathcal{G} = \{g: \mathbb{R} \to \mathbb{R} \text{ measurable}\}\]
Optimization problem: \[\min_{g \in \mathcal{G}} \text{MSE}(g) = \min_{g \in \mathcal{G}} \mathbb{E}[(Y - g(X))^2]\]
Infinite-dimensional optimization
Strategy: Decompose the problem by conditioning on \(X\)

Decomposition: \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2] = \mathbb{E}[\mathbb{E}[(Y - g(X))^2|X]]\]
Why this helps:
When \(X = x\) is fixed:
This transforms the problem:
Solution approach:

Step 1: Express MSE using conditional expectation \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2]\] \[= \mathbb{E}[\mathbb{E}[(Y - g(X))^2|X]]\]
Step 2: For fixed \(X = x\), \(g(X) = g(x)\) is a constant \[\mathbb{E}[(Y - g(x))^2|X=x] = \mathbb{E}[(Y - a)^2|X=x]\] where \(a = g(x)\)
Step 3: Minimize over \(a\) \[\frac{d}{da}\mathbb{E}[(Y - a)^2|X=x] = 0\] \[-2\mathbb{E}[(Y - a)|X=x] = 0\] \[\mathbb{E}[Y|X=x] - a = 0\]
Solution: \(a^* = \mathbb{E}[Y|X=x]\)
Step 4: Since this holds for all \(x\): \[g^*(x) = \mathbb{E}[Y|X=x]\]
Therefore: \[\boxed{\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X]}\]

Orthogonality principle: Optimal error orthogonal to any function of \(X\): \[\mathbb{E}[(Y - \mathbb{E}[Y|X])h(X)] = 0\] for any measurable \(h\).
Uniqueness: Only one function satisfies this → unique MMSE
Optimality test: If \(\mathbb{E}[\epsilon \cdot h(X)] \neq 0\), can reduce MSE by adjusting estimate in direction of \(h(X)\)
Computational foundation:
Geometric meaning: MMSE = projection in Hilbert space
Proof: \(\mathbb{E}[(Y - \mathbb{E}[Y|X])|X] = 0\) by definition, multiplying by \(h(X)\) preserves this.
Cannot extract more information from \(X\)

We solved pointwise: For each \(x\), found \(g^*(x) = \mathbb{E}[Y|X=x]\)
Question: Is this globally optimal over all functions?
Two concerns:
Key fact: MSE decomposes additively \[\text{MSE}(g) = \int \mathbb{E}[(Y - g(x))^2|X=x] p(x) dx\]
Each term \(\mathbb{E}[(Y - g(x))^2|X=x]\) depends only on \(g(x)\), not on \(g(x')\) for \(x' \neq x\).
Therefore:
Idea: Prediction at \(x\) doesn’t help prediction at \(x'\)

Claim: No function can beat \(\mathbb{E}[Y|X]\)
Proof: For any function \(g\): \[\text{MSE}(g) = \mathbb{E}[(Y - g(X))^2]\]
Add and subtract \(\mathbb{E}[Y|X]\): \[= \mathbb{E}[(Y - \mathbb{E}[Y|X] + \mathbb{E}[Y|X] - g(X))^2]\]
Expand the square: \[= \mathbb{E}[(Y - \mathbb{E}[Y|X])^2] + \mathbb{E}[(\mathbb{E}[Y|X] - g(X))^2]\] \[+ 2\mathbb{E}[(Y - \mathbb{E}[Y|X])(\mathbb{E}[Y|X] - g(X))]\]
Key: The cross term is zero! \[\mathbb{E}[(Y - \mathbb{E}[Y|X])(\mathbb{E}[Y|X] - g(X))] = 0\]
Because \(\mathbb{E}[Y|X] - g(X)\) is a function of \(X\) and: \[\mathbb{E}[(Y - \mathbb{E}[Y|X])h(X)] = 0\] for any \(h(X)\) (orthogonality).
Therefore: \[\text{MSE}(g) = \text{MSE}_{\min} + \mathbb{E}[(\mathbb{E}[Y|X] - g(X))^2] \geq \text{MSE}_{\min}\]

Minimum achievable MSE: \[\text{MSE}_{\min} = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2]\]
Result: \[\boxed{\text{MSE}_{\min} = \mathbb{E}[\text{Var}(Y|X)]}\]
Proof: \[\text{MSE}_{\min} = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2]\] \[= \mathbb{E}[\mathbb{E}[(Y - \mathbb{E}[Y|X])^2|X]]\]
For fixed \(X = x\): \[\mathbb{E}[(Y - \mathbb{E}[Y|X=x])^2|X=x] = \text{Var}(Y|X=x)\]
Therefore: \[\text{MSE}_{\min} = \mathbb{E}[\text{Var}(Y|X)]\]
Special cases:

Squared error is sensitive to outliers
Single outlier can dominate MSE: \[\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]
If \(y_k\) is outlier: \((y_k - \hat{y}_k)^2 \gg (y_i - \hat{y}_i)^2\) for \(i \neq k\)
Effect on MMSE:
Example: Sensor failure
MMSE tries to accommodate the outlier, degrading performance for typical cases.
Alternative loss functions:

When outliers are present:
Median regression (LAD): \[\min_g \mathbb{E}[|Y - g(X)|]\] Solution: \(g^*(x) = \text{median}(Y|X=x)\)
Huber regression: \[\min_g \mathbb{E}[\ell_\delta(Y - g(X))]\] where \(\ell_\delta(e) = \begin{cases} \frac{1}{2}e^2 & |e| \leq \delta \\ \delta|e| - \frac{\delta^2}{2} & |e| > \delta \end{cases}\)
Trimmed mean: Remove top/bottom α% before computing \(\mathbb{E}[Y|X]\)

Joint Gaussian: \((X, Y) \sim \mathcal{N}(\mu, K)\)
\[K = \begin{bmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{bmatrix}\]
MMSE estimator: \[\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X - \mu_X)\]
Key property: Linear in \(X\)!
Minimum MSE: \[\text{MSE}_{\min} = \mathbb{E}[\text{Var}(Y|X)] = \sigma_Y^2(1 - \rho^2)\]
Observations:
Why linear? Gaussian conditional \(Y|X\) has:

Uniform-Quadratic example:
MMSE estimator: \[\hat{Y}_{\text{MMSE}} = \mathbb{E}[Y|X] = X^2\]
NOTE: Nonlinear in \(X\)
Linear MMSE (suboptimal): \[\hat{Y}_{\text{linear}} = \mathbb{E}[Y] + \frac{\text{Cov}(X,Y)}{\text{Var}(X)}(X - \mathbb{E}[X])\]
Since \(\mathbb{E}[X] = 0\) and \(\text{Cov}(X,Y) = \mathbb{E}[X^3] = 0\): \[\hat{Y}_{\text{linear}} = \mathbb{E}[Y] = \mathbb{E}[X^2] = \frac{4}{3}\]
Performance gap:
Linear loses significant performance when true relationship is nonlinear!

Curse of dimensionality:
To compute \(\mathbb{E}[Y|X]\), need \(p(y|x)\)
Density estimation convergence:
Why this matters:
Linear class advantages:

MMSE requires \(p(y|x)\) to compute \(\mathbb{E}[Y|X]\)
Reality: We rarely know conditional distributions
What we often have:
Practical solution: Restrict to linear predictors \[\hat{Y} = aX + b\]

Function class: \[\mathcal{L} = \{g(x) = ax + b : a, b \in \mathbb{R}\}\]
New optimization problem: \[\min_{a,b} \text{MSE}(a,b) = \min_{a,b} \mathbb{E}[(Y - aX - b)^2]\]
From infinite to finite:
Expanding the MSE: \[\text{MSE}(a,b) = \mathbb{E}[Y^2 - 2Y(aX + b) + (aX + b)^2]\] \[= \mathbb{E}[Y^2] - 2a\mathbb{E}[XY] - 2b\mathbb{E}[Y]\] \[+ a^2\mathbb{E}[X^2] + 2ab\mathbb{E}[X] + b^2\]
MSE is quadratic in \((a, b)\)

Step 1: Find optimal \(b\) for fixed \(a\)
Take derivative with respect to \(b\): \[\frac{\partial}{\partial b}\text{MSE}(a,b) = \frac{\partial}{\partial b}\mathbb{E}[(Y - aX - b)^2]\]
\[= -2\mathbb{E}[Y - aX - b] = 0\]
Solving for \(b\): \[\mathbb{E}[Y - aX - b] = 0\] \[\mathbb{E}[Y] - a\mathbb{E}[X] - b = 0\]
\[\boxed{b^* = \mathbb{E}[Y] - a\mathbb{E}[X]}\]
Interpretation:
Substituting back: \[\hat{Y} = aX + \mathbb{E}[Y] - a\mathbb{E}[X] = \mathbb{E}[Y] + a(X - \mathbb{E}[X])\]

Step 2: Find optimal \(a\) using \(b^* = \mathbb{E}[Y] - a\mathbb{E}[X]\)
Substitute into MSE: \[\text{MSE}(a) = \mathbb{E}[(Y - \mathbb{E}[Y] - a(X - \mathbb{E}[X]))^2]\]
Define centered variables:
Then: \(\text{MSE}(a) = \mathbb{E}[(\tilde{Y} - a\tilde{X})^2]\)
Take derivative: \[\frac{d}{da}\text{MSE}(a) = -2\mathbb{E}[\tilde{X}(\tilde{Y} - a\tilde{X})] = 0\]
\[\mathbb{E}[\tilde{X}\tilde{Y}] - a\mathbb{E}[\tilde{X}^2] = 0\]
Solution: \[\boxed{a^* = \frac{\mathbb{E}[\tilde{X}\tilde{Y}]}{\mathbb{E}[\tilde{X}^2]} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}}\]
Note: \(\mathbb{E}[\tilde{X}\tilde{Y}] = \text{Cov}(X,Y)\) and \(\mathbb{E}[\tilde{X}^2] = \text{Var}(X)\)

Problem: Don’t know statistics a priori
LMMSE needs: \(\text{Cov}(X,Y)\) and \(\text{Var}(X)\)
Solution: Learn from streaming data
Gradient descent on MSE: \[\frac{\partial}{\partial a} \mathbb{E}[(Y - aX - b)^2] = -2\mathbb{E}[(Y - aX - b)X]\]
Widrow-Hoff insight:
Replace expectation with instantaneous value: \[\mathbb{E}[(Y - aX - b)X] \approx (y_i - ax_i - b)x_i = e_i x_i\]
LMS update rule: \[a_{i+1} = a_i + \mu e_i x_i\] \[b_{i+1} = b_i + \mu e_i\]
where \(e_i = y_i - \hat{y}_i\) is prediction error
Convergence: \(a_i \to a^*_{\text{LMMSE}}\) as \(i \to \infty\)

Convergence condition: \[0 < \mu < \frac{2}{\mathbb{E}[X^2]}\]
Too large \(\mu\) → divergence
Steady-state error:
Misadjustment: \(M = \frac{\mu \mathbb{E}[X^2]}{2}\)
Trade-off:
Time constant: \[\tau \approx \frac{1}{2\mu \mathbb{E}[X^2]}\]
iterations to converge
Advantages:
RelationLMS = stochastic gradient descent for MSE

Complete solution: \[\boxed{\hat{Y}_{\text{LMMSE}} = \mathbb{E}[Y] + \frac{\text{Cov}(X,Y)}{\text{Var}(X)}(X - \mathbb{E}[X])}\]
Alternative forms:
Using correlation \(\rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\): \[\hat{Y}_{\text{LMMSE}} = \mathbb{E}[Y] + \rho\frac{\sigma_Y}{\sigma_X}(X - \mathbb{E}[X])\]
Standardized form: If \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\) and \(\sigma_X = \sigma_Y = 1\): \[\hat{Y}_{\text{LMMSE}} = \rho X\]
Minimum MSE: \[\text{MSE}_{\text{LMMSE}} = \text{Var}(Y)(1 - \rho^2)\]
Properties:

Slope decomposition: \[a^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\rho \sigma_X \sigma_Y}{\sigma_X^2} = \rho \frac{\sigma_Y}{\sigma_X}\]
Correlation determines predictability
Fraction of variance explained: \[R^2 = \frac{\text{Var}(Y) - \text{MSE}_{\text{LMMSE}}}{\text{Var}(Y)} = \rho^2\]
Correlation measures linear predictability, not general dependence

Fundamental inequality: \[\text{MSE}_{\text{LMMSE}} \geq \text{MSE}_{\text{MMSE}}\]
Always! Linear is restricted class.
Performance gap: \[\text{MSE}_{\text{LMMSE}} - \text{MSE}_{\text{MMSE}} = \mathbb{E}[(\mathbb{E}[Y|X] - \hat{Y}_{\text{LMMSE}})^2]\]
Measures nonlinearity of \(\mathbb{E}[Y|X]\).
When are they equal? \[\text{MSE}_{\text{LMMSE}} = \text{MSE}_{\text{MMSE}}\] if and only if \(\mathbb{E}[Y|X]\) is linear in \(X\).
For jointly Gaussian \((X,Y)\):

Orthogonality principle for LMMSE:
The error is orthogonal to the input: \[\mathbb{E}[(Y - \hat{Y}_{\text{LMMSE}})X] = 0\]
Proof: Substituting \(\hat{Y}_{\text{LMMSE}} = b^* + a^*X\): \[\mathbb{E}[(Y - b^* - a^*X)X] = \mathbb{E}[YX] - b^*\mathbb{E}[X] - a^*\mathbb{E}[X^2]\]
Using \(b^* = \mathbb{E}[Y] - a^*\mathbb{E}[X]\) and \(a^* = \text{Cov}(X,Y)/\text{Var}(X)\): \[= \mathbb{E}[YX] - \mathbb{E}[Y]\mathbb{E}[X] - a^*(\mathbb{E}[X^2] - \mathbb{E}[X]^2)\] \[= \text{Cov}(X,Y) - a^*\text{Var}(X) = 0\]
Geometric interpretation:
Note: Only orthogonal to \(X\) and constants, not all functions of \(X\)!

What we need:
Sample estimates from data \((x_1, y_1), ..., (x_n, y_n)\): \[\hat{\mu}_X = \frac{1}{n}\sum_{i=1}^n x_i, \quad \hat{\mu}_Y = \frac{1}{n}\sum_{i=1}^n y_i\]
\[\widehat{\text{Var}}(X) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \hat{\mu}_X)^2\]
\[\widehat{\text{Cov}}(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \hat{\mu}_X)(y_i - \hat{\mu}_Y)\]
Then: \(\hat{a} = \widehat{\text{Cov}}(X,Y)/\widehat{\text{Var}}(X)\)

Bivariate Gaussian \((X, Y) \sim \mathcal{N}(\mu, K)\)
Parameters: \[\mu = \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \quad K = \begin{bmatrix} \sigma_X^2 & k_{XY} \\ k_{XY} & \sigma_Y^2 \end{bmatrix}\]
where \(k_{XY} = \rho\sigma_X\sigma_Y\)
Joint density: \[p(x,y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left[z_x^2 - 2\rho z_x z_y + z_y^2\right]\right)\]
where \(z_x = \frac{x-\mu_X}{\sigma_X}\) and \(z_y = \frac{y-\mu_Y}{\sigma_Y}\) are standardized variables
Fully determined by first and second moments

Result: \(Y|X=x\) is Gaussian
Conditional mean: \[\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)\]
Conditional variance (constant!): \[\text{Var}(Y|X=x) = \sigma_Y^2(1 - \rho^2)\]
Therefore: \[Y|X=x \sim \mathcal{N}\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X), \sigma_Y^2(1 - \rho^2)\right)\]
Observations:
Regression line: \[y = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)\] passes through \((\mu_X, \mu_Y)\) with slope \(\rho\sigma_Y/\sigma_X\)

Method 1: Complete the square
Start with joint density, condition on \(X = x\): \[p(y|x) \propto \exp\left(-\frac{1}{2(1-\rho^2)\sigma_Y^2}\left(y - \mu_Y - \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X)\right)^2\right)\]
Method 2: Linear projection
Find \(a, b\) minimizing \(\mathbb{E}[(Y - aX - b)^2]\):
Method 3: Schur complement
Using block matrix operations on \(K\): \[\mathbb{E}[Y|X=x] = \mu_Y + k_{YX}k_{XX}^{-1}(x - \mu_X)\] \[= \mu_Y + \rho\sigma_X\sigma_Y \cdot \frac{1}{\sigma_X^2}(x - \mu_X)\]
(all methods give same result)

Fundamental result: For jointly Gaussian \((X, Y)\)
\[\boxed{\hat{Y}_{\text{MMSE}} = \hat{Y}_{\text{LMMSE}} = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X - \mu_X)}\]
Why equal?
Minimum MSE: \[\text{MSE}_{\min} = \sigma_Y^2(1 - \rho^2)\]
Implications:

Maximum entropy property: Among all distributions with given mean and covariance, Gaussian has maximum entropy \[H(X,Y) = -\int\int p(x,y)\log p(x,y) \, dx \, dy\]
Central Limit Theorem: Sum of many independent effects → Gaussian \[\frac{1}{\sqrt{n}}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}(\mu, \sigma^2)\]
Preserved under linear operations: If \((X,Y)\) Gaussian and \(Z = aX + bY + c\):
Sufficient statistics: For Gaussian, sample mean and covariance are sufficient

Multivariate Gaussian: \(\mathbf{X} = [X_1, ..., X_n]^T\), \(\mathbf{Y} = [Y_1, ..., Y_m]^T\)
Joint distribution: \[\begin{bmatrix} \mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_Y \end{bmatrix}, \begin{bmatrix} \mathbf{K}_{XX} & \mathbf{K}_{XY} \\ \mathbf{K}_{YX} & \mathbf{K}_{YY} \end{bmatrix}\right)\]
MMSE estimator (still linear!): \[\hat{\mathbf{Y}}_{\text{MMSE}} = \mathbb{E}[\mathbf{Y}|\mathbf{X}] = \boldsymbol{\mu}_Y + \mathbf{K}_{YX}\mathbf{K}_{XX}^{-1}(\mathbf{X} - \boldsymbol{\mu}_X)\]
Error covariance: \[\mathbf{K}_{\mathbf{Y}|\mathbf{X}} = \mathbf{K}_{YY} - \mathbf{K}_{YX}\mathbf{K}_{XX}^{-1}\mathbf{K}_{XY}\]
Computational considerations:
Everything generalizes!

Theory (Population)
Given joint distribution \(p(x,y)\): \[\text{MSE} = \mathbb{E}[(Y - \hat{Y})^2]\]
Linear MMSE solution: \[a^* = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]}{\mathbb{E}[X^2] - \mathbb{E}[X]^2}\]
Practice (Sample)
Given data \(\{(x_i, y_i)\}_{i=1}^n\): \[\text{MSE}_n = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]
Sample estimates: \[\hat{a} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}\]
Convergence: \(\hat{a} \to a^*\) as \(n \to \infty\) (law of large numbers)
This is linear regression

Population risk (unknown): \[R(a,b) = \mathbb{E}[(Y - aX - b)^2]\]
Empirical risk (computable): \[R_n(a,b) = \frac{1}{n}\sum_{i=1}^n (y_i - ax_i - b)^2\]
Optimization: \[\frac{\partial R_n}{\partial b} = -\frac{2}{n}\sum_{i=1}^n (y_i - ax_i - b) = 0\] \[\implies b = \bar{y} - a\bar{x}\]
\[\frac{\partial R_n}{\partial a} = -\frac{2}{n}\sum_{i=1}^n x_i(y_i - ax_i - b) = 0\] \[\implies a = \frac{\sum_i x_i y_i - n\bar{x}\bar{y}}{\sum_i x_i^2 - n\bar{x}^2}\]
Sample optimization gives sample LMMSE

Linear model limitations: \[\hat{y} = ax + b\]
Can only model linear relationships
Feature transformation: \[\hat{y} = \sum_{k=0}^{K} w_k \phi_k(x) = \mathbf{w}^T \boldsymbol{\phi}(x)\]
where \(\boldsymbol{\phi}(x) = [\phi_0(x), \phi_1(x), ..., \phi_K(x)]^T\)
Still linear MMSE in feature space:
Example: Polynomial features \[\phi_0(x) = 1, \quad \phi_1(x) = x, \quad \phi_2(x) = x^2\]
Now can fit parabolas while still using linear methods

Batch gradient descent: \[J(a,b) = \frac{1}{n}\sum_{i=1}^n (y_i - ax_i - b)^2\]
Gradients: \[\frac{\partial J}{\partial a} = -\frac{2}{n}\sum_{i=1}^n x_i(y_i - ax_i - b)\] \[\frac{\partial J}{\partial b} = -\frac{2}{n}\sum_{i=1}^n (y_i - ax_i - b)\]
Update: \[a_{t+1} = a_t - \eta \frac{\partial J}{\partial a}\] \[b_{t+1} = b_t - \eta \frac{\partial J}{\partial b}\]
Stochastic version (one sample): \[a_{t+1} = a_t + \eta x_i (y_i - a_t x_i - b_t)\] \[b_{t+1} = b_t + \eta (y_i - a_t x_i - b_t)\]
This is LMS algorithm (Widrow-Hoff)

Single neuron: \[z = \mathbf{w}^T\mathbf{x} + b\] \[h = \sigma(z)\]
where \(\sigma\) is activation (ReLU, sigmoid, etc.)
Without activation: Linear regression \[\hat{y} = \mathbf{w}^T\mathbf{x} + b\]
With activation: Nonlinear transform \[\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)\]
Deep network: Composition of layers \[\mathbf{h}^{(1)} = \sigma(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)})\] \[\mathbf{h}^{(2)} = \sigma(W^{(2)}\mathbf{h}^{(1)} + \mathbf{b}^{(2)})\] \[\hat{y} = W^{(L)}\mathbf{h}^{(L-1)} + b^{(L)}\]
Still minimizing MSE: \(\frac{1}{n}\sum_i (y_i - \hat{y}_i)^2\)
Backpropagation: Chain rule for gradients

Random vector: \(\mathbf{X} = [X_1, X_2, ..., X_n]^T\)
Mean vector: \[\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_n] \end{bmatrix}\]
Covariance matrix: \[K_{\mathbf{XX}} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]\]
Element structure: \[[K_{\mathbf{XX}}]_{ij} = \text{Cov}(X_i, X_j) = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)]\]
Properties:

Joint Gaussian distribution:
\(\begin{bmatrix} \mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_Y \end{bmatrix}, \begin{bmatrix} K_{XX} & K_{XY} \\ K_{YX} & K_{YY} \end{bmatrix}\right)\)
Probability density: \[p(\mathbf{x}, \mathbf{y}) = \frac{1}{(2\pi)^{(n+m)/2}|K|^{1/2}} \exp\left(-\frac{1}{2}\mathbf{z}^TK^{-1}\mathbf{z}\right)\]
where \(\mathbf{z} = \begin{bmatrix} \mathbf{x} - \boldsymbol{\mu}_X \\ \mathbf{y} - \boldsymbol{\mu}_Y \end{bmatrix}\)
Linear operations preserve Gaussianity
If \(\mathbf{Z} = A\mathbf{X} + \mathbf{b}\), then: \[\mathbf{Z} \sim \mathcal{N}(A\boldsymbol{\mu}_X + \mathbf{b}, AK_{XX}A^T)\]
Marginals are Gaussian: \[\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_X, K_{XX})\]

Given joint Gaussian, conditional is also Gaussian:
\[\mathbf{X}|\mathbf{Y} = \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_{X|Y}, K_{X|Y})\]
Conditional mean (MMSE estimator): \[\boldsymbol{\mu}_{X|Y} = \boldsymbol{\mu}_X + K_{XY}K_{YY}^{-1}(\mathbf{y} - \boldsymbol{\mu}_Y)\]
Conditional covariance (error covariance): \[K_{X|Y} = K_{XX} - K_{XY}K_{YY}^{-1}K_{YX}\]
Observations:
Schur complement: \[K_{X|Y} = K_{XX} - K_{XY}K_{YY}^{-1}K_{YX}\] measures reduction in uncertainty

Scalar case: \[\hat{Y} = aX + b\]
Given: \(X\) (single predictor), predict \(Y\)
Vector case: \[\hat{Y} = \mathbf{a}^T\mathbf{X} + b\]
Given: \(\mathbf{X} = [X_1, X_2, ..., X_p]^T\) (multiple predictors)
Example: Predict temperature \(Y\) from:
Matrix notation: \[\hat{Y} = \mathbf{a}^T\mathbf{X} + b = \sum_{j=1}^p a_j X_j + b\]
where \(\mathbf{a} = [a_1, a_2, ..., a_p]^T\) are weights

Minimize MSE over linear predictors: \[\text{MSE}(\mathbf{a}, b) = \mathbb{E}[(Y - \mathbf{a}^T\mathbf{X} - b)^2]\]
Taking derivatives:
\[\frac{\partial \text{MSE}}{\partial b} = -2\mathbb{E}[Y - \mathbf{a}^T\mathbf{X} - b] = 0\] \[\implies b = \mathbb{E}[Y] - \mathbf{a}^T\mathbb{E}[\mathbf{X}]\]
\[\frac{\partial \text{MSE}}{\partial \mathbf{a}} = -2\mathbb{E}[\mathbf{X}(Y - \mathbf{a}^T\mathbf{X} - b)] = \mathbf{0}\]
Substituting \(b\) and simplifying: \[\mathbb{E}[\mathbf{X}\mathbf{X}^T]\mathbf{a} = \mathbb{E}[\mathbf{X}Y] - \mathbb{E}[\mathbf{X}]\mathbb{E}[Y]\]
Define:

Normal equations: \[K_{\mathbf{XX}}\mathbf{a} = \mathbf{k}_{\mathbf{X}Y}\]
Solution (if \(K_{\mathbf{XX}}\) invertible): \[\mathbf{a}^* = K_{\mathbf{XX}}^{-1}\mathbf{k}_{\mathbf{X}Y}\] \[b^* = \mathbb{E}[Y] - (\mathbf{a}^*)^T\mathbb{E}[\mathbf{X}]\]
LMMSE predictor: \[\hat{Y}_{\text{LMMSE}} = \mathbb{E}[Y] + (\mathbf{a}^*)^T(\mathbf{X} - \mathbb{E}[\mathbf{X}])\]
Minimum MSE: \[\text{MSE}_{\min} = \text{Var}(Y) - \mathbf{k}_{\mathbf{X}Y}^T K_{\mathbf{XX}}^{-1} \mathbf{k}_{\mathbf{X}Y}\]
Describes:

Column space perspective:
Data matrix columns: \(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_p\)
\[\mathcal{S} = \text{span}(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_p)\]
LMMSE = Projection: \[\hat{\mathbf{y}} = \text{Proj}_{\mathcal{S}}(\mathbf{y})\]
Orthogonality: \[\mathbf{y} - \hat{\mathbf{y}} \perp \mathcal{S}\]
Equivalently: \(X^T(\mathbf{y} - X\mathbf{a}) = \mathbf{0}\)
Projection matrix: \[P = X(X^TX)^{-1}X^T\] \[\hat{\mathbf{y}} = P\mathbf{y}\]
Properties:

Direct inversion issues:
Ill-conditioning: When predictors highly correlated
Rank deficiency: When \(p > n\) or perfect collinearity
Solutions:
QR decomposition: \(X = QR\)
SVD: \(X = U\Sigma V^T\)
Regularization:

Multiple responses: \[\mathbf{Y} = [Y_1, Y_2, ..., Y_m]^T\]
Linear model: \[\hat{\mathbf{Y}} = W^T\mathbf{X} + \mathbf{b}\]
where \(W \in \mathbb{R}^{p \times m}\) is weight matrix
MSE matrix: \[\mathbf{M} = \mathbb{E}[(\mathbf{Y} - \hat{\mathbf{Y}})(\mathbf{Y} - \hat{\mathbf{Y}})^T]\]
Minimize trace (total MSE): \[\text{tr}(\mathbf{M}) = \sum_{i=1}^m \mathbb{E}[(Y_i - \hat{Y}_i)^2]\]
Solution: \[W^* = K_{\mathbf{XX}}^{-1}K_{\mathbf{XY}}\]
where \(K_{\mathbf{XY}} = \mathbb{E}[\mathbf{X}\mathbf{Y}^T] - \mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{Y}]^T\)
Key: Solve \(m\) separate LMMSE problems

Computational cost:
When \(K_{\mathbf{XX}}\) has off-diagonal terms:
Eigendecomposition approach:
Any symmetric positive semi-definite matrix: \[K_{\mathbf{XX}} = U\Lambda U^T\]
where:
Eigenvectors are orthogonal \[\mathbf{u}_i^T \mathbf{u}_j = \delta_{ij}\]
This orthogonality simplifies coordinate transformation.

Coordinate transformation: \[\mathbf{Z} = U^T\mathbf{X}\]
Covariance in new coordinates: \[K_{\mathbf{ZZ}} = \mathbb{E}[\mathbf{Z}\mathbf{Z}^T]\] \[= \mathbb{E}[U^T\mathbf{X}\mathbf{X}^TU]\] \[= U^T\mathbb{E}[\mathbf{X}\mathbf{X}^T]U\] \[= U^TK_{\mathbf{XX}}U\] \[= U^T(U\Lambda U^T)U\] \[= \Lambda\]
Result: Diagonal covariance matrix
Components of \(\mathbf{Z}\) are uncorrelated: \[\mathbb{E}[Z_iZ_j] = \lambda_i\delta_{ij}\]
Inverse transform (reconstruction): \[\mathbf{X} = U\mathbf{Z}\]
Since \(U\) is orthonormal: \(U^TU = I\)

Given covariance: \[\mathbf{K} = \begin{bmatrix} 6 & -4 & 0 \\ -4 & 6 & 0 \\ 0 & 0 & 3 \end{bmatrix}\]
Find eigenvalues: Solve \(\det(\mathbf{K} - \lambda\mathbf{I}) = 0\)
Block structure simplifies:
Result: \(\lambda_1 = 10, \lambda_2 = 3, \lambda_3 = 2\)
Find eigenvectors: Solve \((\mathbf{K} - \lambda_i\mathbf{I})\mathbf{v} = 0\)
\[\mathbf{u}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\-1\\0\end{bmatrix}, \quad \mathbf{u}_2 = \begin{bmatrix}0\\0\\1\end{bmatrix}, \quad \mathbf{u}_3 = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\\0\end{bmatrix}\]
Result: \(\mathbf{K}_{zz} = \text{diag}(10, 3, 2)\)

Original problem (coupled): \[\hat{\mathbf{X}} = K_{\mathbf{XY}} K_{\mathbf{YY}}^{-1} \mathbf{Y}\]
Requires inverting full matrix \(K_{\mathbf{YY}}\).
After KL transform:
\[\hat{Z}_i = \frac{\text{Cov}(Z_i, W_i)}{\text{Var}(W_i)} W_i\]
Each component solved independently.
Computational advantage:
Numerical stability: Small eigenvalues → regularization: \[\hat{Z}_i = \frac{\text{Cov}(Z_i, W_i)}{\lambda_i + \epsilon} W_i\]

Not all dimensions equally informative
Eigenvalues measure variance (information) per dimension.
Truncated representation: Keep only first \(k < p\) components: \[\mathbf{X} \approx \sum_{i=1}^k Z_i \mathbf{u}_i\]
Selection criterion: Retain fraction \(\alpha\) of total variance: \[\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^p \lambda_i} \geq \alpha\]
Common choice: \(\alpha = 0.95\) or \(0.99\)
Reconstruction error: \[\mathbb{E}[\|\mathbf{X} - \hat{\mathbf{X}}_k\|^2] = \sum_{i=k+1}^p \lambda_i\]
Error equals sum of discarded eigenvalues.
Applications:
