EE 541 - Unit 5
Fall 2025
Task: Given features \(\mathbf{x} \in \mathbb{R}^d\), predict class \(y \in \{0, 1\}\)
Classical approach: Find decision boundary \[\hat{y} = \begin{cases} 1 & \text{if } f(\mathbf{x}) > 0 \\ 0 & \text{otherwise} \end{cases}\]
Probabilistic approach: Estimate \(P(y=1|\mathbf{x})\) \[\hat{p} = P(y=1|\mathbf{x}) \in [0, 1]\] \[\hat{y} = \mathbb{1}[\hat{p} > \tau]\]
Main challenges:
The classification pipeline:
After obtaining probabilities, how do we decide?
Simple threshold: \[\hat{y} = \begin{cases} 1 & p > 0.5 \\ 0 & p \leq 0.5 \end{cases}\]
But this assumes all errors cost the same
Example: Credit card fraud
Optimal threshold depends on these costs!
Different thresholds lead to different decisions
States (observations): \(\Omega = \{\omega_1, \omega_2, \ldots\}\)
Actions: \(\mathcal{A} = \{a_1, a_2, \ldots\}\)
Loss function: \(L: \Omega \times \mathcal{A} \to \mathbb{R}\)
Decision rule: \(\delta: \mathcal{X} \to \mathcal{A}\)
Objective: Minimize expected loss \[R(\delta) = \mathbb{E}[L(Y, \delta(X))]\]
Bayes risk: Minimum achievable expected loss \[R^* = \inf_\delta R(\delta)\]
Bayes risk minimization:
Given observation \(\mathbf{x}\), choose action to minimize expected loss:
\[\delta^*(\mathbf{x}) = \arg\min_{a \in \mathcal{A}} \mathbb{E}[L(Y, a) | \mathbf{X} = \mathbf{x}]\]
Expanding the expectation: \[= \arg\min_{a} \sum_{k} L(k, a) \cdot P(Y=k|\mathbf{x})\]
For binary classification with \(y \in \{0, 1\}\): \[\delta^*(\mathbf{x}) = \arg\min_{a \in \{0,1\}} \left[ L(0,a)(1-p) + L(1,a)p \right]\]
where \(p = P(Y=1|\mathbf{x})\)
Optimal decisions depend on:
Bayes rule minimizes expected loss at each point
0-1 Loss (uniform error cost): \[L(y, \hat{y}) = \begin{cases} 0 & \text{if } y = \hat{y} \\ 1 & \text{if } y \neq \hat{y} \end{cases}\]
Expected loss for action \(a\): \[\mathbb{E}[L(Y, a) | \mathbf{x}] = \sum_{k \neq a} P(Y=k|\mathbf{x}) = 1 - P(Y=a|\mathbf{x})\]
Minimizing expected loss: \[\delta^*(\mathbf{x}) = \arg\min_a [1 - P(Y=a|\mathbf{x})] = \arg\max_a P(Y=a|\mathbf{x})\]
This is the MAP rule!
\[\boxed{\delta_{\text{MAP}}(\mathbf{x}) = \arg\max_k P(Y=k|\mathbf{x})}\]
MAP minimizes error probability when all errors cost the same
Each region: highest posterior probability wins
Binary Bayes decision rule:
Choose class 1 if “expected loss of choosing 1” < “expected loss of choosing 0”:
\[L_{01}P(Y=0|\mathbf{x}) < L_{10}P(Y=1|\mathbf{x})\]
where \(L_{ij}\) = loss of predicting \(j\) when true class is \(i\)
Using Bayes theorem: \[P(Y=k|\mathbf{x}) = \frac{p(\mathbf{x}|Y=k)P(Y=k)}{p(\mathbf{x})}\]
Rearranging: \[\frac{p(\mathbf{x}|Y=1)}{p(\mathbf{x}|Y=0)} > \frac{L_{01}P(Y=0)}{L_{10}P(Y=1)}\]
Likelihood Ratio Test: \[\boxed{\Lambda(\mathbf{x}) = \frac{p(\mathbf{x}|Y=1)}{p(\mathbf{x}|Y=0)} \underset{H_0}{\overset{H_1}{\gtrless}} \tau}\]
where threshold \(\tau = \frac{L_{01}P(Y=0)}{L_{10}P(Y=1)}\)
Log-likelihood ratio (more stable): \[\log \Lambda(\mathbf{x}) = \log p(\mathbf{x}|Y=1) - \log p(\mathbf{x}|Y=0)\]
Linear boundary in log-space → efficient computation
For 0-1 loss (all errors equal): \[L(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 0 & y = \hat{y} \\ 1 & y \neq \hat{y} \end{cases}\]
Expected loss for predicting class \(k\): \[\mathbb{E}[L(Y, k)|X=\mathbf{x}] = \sum_{j \neq k} P(Y=j|\mathbf{x})\] \[= 1 - P(Y=k|\mathbf{x})\]
Bayes optimal decision: \[\hat{y}^* = \arg\max_k P(Y=k|\mathbf{x})\]
Choose the most probable class.
Bayes error rate: \[\epsilon^* = \mathbb{E}[1 - \max_k P(Y=k|X)]\]
This is the irreducible error
Binary classification with costs:
\[L = \begin{bmatrix} 0 & c_{01} \\ c_{10} & 0 \end{bmatrix}\]
where:
Expected loss of predicting 1: \[R_1(\mathbf{x}) = c_{01} \cdot P(y=0|\mathbf{x})\]
Expected loss of predicting 0: \[R_0(\mathbf{x}) = c_{10} \cdot P(y=1|\mathbf{x})\]
Optimal decision: Predict 1 if \(R_1(\mathbf{x}) < R_0(\mathbf{x})\)
\[c_{01}(1-p) < c_{10} \cdot p\] \[p > \frac{c_{01}}{c_{01} + c_{10}}\]
Optimal threshold: \(\tau^* = \frac{c_{01}}{c_{01} + c_{10}}\)
Cancer screening scenario:
Realistic costs:
Loss matrix: \[L = \begin{bmatrix} 0 & 50,000 \\ 1,000,000 & 50,000 \end{bmatrix}\]
Note: We include treatment cost even for true positives
Optimal threshold: \[\tau^* = \frac{50,000}{50,000 + 950,000} = 0.05\]
Treat if \(P(\text{cancer}|\text{test}) > 0.05\)
Three-way decision:
Extended loss matrix: \[L = \begin{bmatrix} 0 & c_{01} & c_r \\ c_{10} & 0 & c_r \end{bmatrix}\]
where \(c_r\) = cost of rejection (expert review)
Decision regions:
Optimal thresholds (for symmetric costs): \[\tau_1 = \frac{c_r}{c_{10}}, \quad \tau_2 = 1 - \frac{c_r}{c_{01}}\]
Reject when model confidence is low
Reject option reduces error on uncertain samples
K-class cost matrix: \(L \in \mathbb{R}^{K \times K}\)
\[L_{ij} = \text{cost of predicting } j \text{ when true is } i\]
Expected loss for action \(a\): \[R_a(\mathbf{x}) = \sum_{k=1}^K L_{ka} \cdot P(y=k|\mathbf{x})\]
Optimal decision: \[\hat{y}^* = \arg\min_a R_a(\mathbf{x})\]
Example: 3-class with varying costs \[L = \begin{bmatrix} 0 & 5 & 10 \\ 20 & 0 & 5 \\ 30 & 15 & 0 \end{bmatrix}\]
Must compute expected loss for each action
Special case: If \(L_{ij} = 1 - \delta_{ij}\) (0-1 loss), reduces to argmax of posteriors
Hard decisions: \(\hat{y} \in \{0, 1, ..., K-1\}\)
Soft decisions: \(\mathbf{p} = [p_0, p_1, ..., p_{K-1}]\)
Soft decisions ideal for:
1. Discrimination (ranking ability):
2. Calibration (probability accuracy):
3. Proper scoring rules:
These penalize both discrimination and calibration errors
Note: Good discrimination ≠ Good calibration
Imbalanced data: 99% negative, 1% positive
Problem with accuracy:
Metrics for imbalanced data:
Precision: Of predicted positives, how many correct? \[\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}\]
Recall (Sensitivity): Of true positives, how many found? \[\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]
F1 Score: Harmonic mean \[F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]
Connection to costs: Optimize threshold based on \[\tau^* = \frac{c_{01} \cdot (1-\pi)}{c_{01}(1-\pi) + c_{10} \cdot \pi}\]
where \(\pi\) = class prior P(y=1)
Try modeling probability directly: \[p(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b\]
With \(w = 0.3\), \(b = 0.5\):
Problem: Linear outputs \(\in (-\infty, \infty)\) but probabilities need \(\in [0, 1]\)
Need transformation \(\mathbb{R} \to [0, 1]\) that:
Candidates:
Odds: Alternative to probability \[\text{Odds} = \frac{p}{1-p} = \frac{P(y=1)}{P(y=0)}\]
Examples:
Range: \([0, \infty)\)
Log-odds (logit): \[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\]
Range: \((-\infty, \infty)\)
Model log-odds linearly: \[\log\left(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\right) = \mathbf{w}^T\mathbf{x} + b\]
Solving for \(p(\mathbf{x})\):
Let \(z = \mathbf{w}^T\mathbf{x} + b\)
\[\frac{p}{1-p} = e^z\]
\[p = (1-p)e^z\]
\[p = e^z - pe^z\]
\[p(1 + e^z) = e^z\]
\[p = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}\]
Gives the logistic (sigmoid) function
\[\boxed{p(\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}}\]
Inputs: \(\mathbf{x} = [x_1, x_2, \ldots, x_d]^T\)
Weights: \(\mathbf{w} = [w_1, w_2, \ldots, w_d]^T\)
Bias: \(b\)
Linear combination: \[z = \mathbf{w}^T\mathbf{x} + b = \sum_{j=1}^d w_j x_j + b\]
Activation function: \[p = \sigma(z) = \frac{1}{1 + e^{-z}}\]
Output: \(p = P[\text{class 1}] \in (0, 1)\)
Decision rule: Predict class 1 if \(p > 0.5\)
Start with: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
Apply chain rule: \[\frac{d\sigma}{dz} = \frac{d}{dz}\left[(1 + e^{-z})^{-1}\right]\]
\[= -1 \cdot (1 + e^{-z})^{-2} \cdot \frac{d}{dz}(1 + e^{-z})\]
\[= -1 \cdot (1 + e^{-z})^{-2} \cdot (-e^{-z})\]
\[= \frac{e^{-z}}{(1 + e^{-z})^2}\]
Rewriting: \[= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}}\]
\[= \sigma(z) \cdot \frac{e^{-z}}{1 + e^{-z}}\]
Note that: \[\frac{e^{-z}}{1 + e^{-z}} = \frac{1/e^z}{1/e^z + 1} = \frac{1}{1 + e^z}\]
This equals \(\sigma(-z) = 1 - \sigma(z)\)
Therefore: \[\boxed{\sigma'(z) = \sigma(z)(1 - \sigma(z))}\]
Properties:
Computing \(p = \sigma(z)\) first gives derivative as \(p(1-p)\).
Other candidates that map \(\mathbb{R} \to [0,1]\):
Probit (cumulative normal): \[p = \Phi(z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\]
Complementary log-log: \[p = 1 - e^{-e^z}\]
Logistic function advantages:
The sigmoid function \(\sigma(z) = \frac{1}{1 + e^{-z}}\):
Valid probabilities: \(\sigma(z) \in (0, 1)\) always
Smooth and differentiable: \[\sigma'(z) = \sigma(z)(1 - \sigma(z))\]
Linear decision boundary:
Interpretable coefficients:
We have the model: \[P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)\]
Now need to learn parameters \(\mathbf{w}, b\)
Given training data: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n\)
Objective: Maximum likelihood
Finding best parameters: Maximum Likelihood Estimation
Recall: With uniform prior, MAP is MLE: \[\hat{\mathbf{w}} = \arg\max_{\mathbf{w}} P(\text{data}|\mathbf{w})\]
“Find parameters that make observed data most probable”
Why MLE?
For regression: Gaussian noise → squared loss
For classification:
Likelihood: \[P(\text{data}|\mathbf{w}) = \prod_{i=1}^n P(y_i|\mathbf{x}_i, \mathbf{w})\]
Suppose each observation \(y_i \in \{0, 1\}\) is Bernoulli (coin flip): \[P(y_i = 1 | \mathbf{x}_i) = p(\mathbf{x}_i)\] \[P(y_i = 0 | \mathbf{x}_i) = 1 - p(\mathbf{x}_i)\]
Therefore: \[P(y_i | \mathbf{x}_i) = p(\mathbf{x}_i)^{y_i} (1-p(\mathbf{x}_i))^{1-y_i}\]
Check:
Likelihood for all data: \[\mathcal{L}(\mathbf{w}) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}\]
where \(p_i = p(\mathbf{x}_i; \mathbf{w})\)
Requirement: Must have \(0 \leq p(\mathbf{x}) \leq 1\)
Logarithm converts products to sums: \[\log \mathcal{L} = \sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
Why log?
Maximizing log-likelihood = Minimizing negative log-likelihood: \[\text{NLL} = -\sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
This equals binary cross-entropy loss
Log-likelihood: \[\ell = \sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
where \(p_i = \sigma(\mathbf{w}^T\mathbf{x}_i + b)\)
Taking derivative (using chain rule): \[\frac{\partial \ell_i}{\partial z_i} = y_i - p_i\]
Therefore: \[\frac{\partial \ell}{\partial \mathbf{w}} = \sum_{i=1}^n (y_i - p_i)\mathbf{x}_i\]
\[\frac{\partial \ell}{\partial b} = \sum_{i=1}^n (y_i - p_i)\]
Elegant result: Gradient is sum of (error × feature)
Misclassified points contribute most to gradient
Recall: Linear regression for classification
\[\hat{y} = \mathbf{w}^T\mathbf{x} + b\]
with targets \(y \in \{-1, +1\}\) and hard decision: \[\text{class} = \text{sign}(\hat{y})\]
Linear regression gives:
The problem:
What we need: Convert \(\mathbf{w}^T\mathbf{x} + b\) to probability \(p \in [0,1]\)
Need smooth function: \(\mathbb{R} \to [0, 1]\)
Model: \(p = \sigma(\mathbf{w}^T\mathbf{x} + b)\)
Objective: Maximize log-likelihood \[\ell = \sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
Gradient: \[\nabla_\mathbf{w} \ell = \mathbf{X}^T(\mathbf{y} - \mathbf{p})\]
Update: Gradient ascent \[\mathbf{w} \leftarrow \mathbf{w} + \eta \nabla_\mathbf{w} \ell\]
Converge: When \(||\nabla \ell|| < \epsilon\)
Iterate until convergence (typically 100-1000 steps)
Problem: Classify emails as spam or legitimate
Features extracted from email:
Example vector \(\mathbf{x} \in \mathbb{R}^6\):
Email: "FREE OFFER! Click HERE!!!"
x = [2, 0.1, 2, 3, 1, 0.3]
↑ ↑ ↑ ↑ ↑ ↑
FREE low 2 3 1 3am
offer rep CAPS !! link
Goal: Learn \(P(\text{spam}|\mathbf{x})\)
Real spam filters use 1000s of features
Converting email → numbers:
1. Bag of words (simple counts):
vocab = ["free", "click", "meeting", "report"]
email = "free click click"
x = [1, 2, 0, 0] # Counts per vocabulary word
2. TF-IDF (weighted by rarity):
# Term Frequency × Inverse Document Frequency
tfidf(word) = count(word) × log(N/df(word))
# Rare words get higher weight
3. Domain-specific features:
Feature vector: \(\mathbf{x} \in \mathbb{R}^d\) where \(d\) = 100-10,000
High values in red features → likely spam
Dataset: \(n\) = 10,000 labeled emails
# Each email becomes a feature vector
X = np.array([
[3, 2, 1, ..., 0.1], # spam
[0, 0, 0, ..., 0.9], # legitimate
[2, 1, 3, ..., 0.2], # spam
...
]) # Shape: (10000, 100)
y = np.array([1, 0, 1, ...]) # Labels
Difficult to balance:
Metric: Precision at 95% recall
Target performance:
For instance: Identify 9,500/10,000 spam with ≤6 legitimate emails blocked
Class distribution (typical):
Model: \[P(\text{spam}|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)\]
Training (gradient ascent on log-likelihood):
# Initialize
w = np.random.randn(100) * 0.01
b = 0
for epoch in range(100):
# Compute predictions
z = X @ w + b
p = sigmoid(z)
# Gradient step
grad_w = X.T @ (y - p) / n
grad_b = np.mean(y - p)
w += learning_rate * grad_w
b += learning_rate * grad_b
Learned weights reveal spam indicators:
Same data, logistic model:
With \(w = 1.5\), \(b = 0\):
Computing \(p(x) = \frac{1}{1 + e^{-(wx + b)}}\):
All probabilities valid!
Log-likelihood: \[\ell = \sum_{i=1}^5 [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
Inference for new email:
def classify_email(email_text, w, b):
# Extract features
x = extract_features(email_text)
# Compute probability
z = np.dot(w, x) + b
p_spam = sigmoid(z)
# Decision with threshold
if p_spam > 0.95: # High threshold
return "spam", p_spam
else:
return "legitimate", p_spam
Performance (typical):
Advantages:
Threshold tuning possible since calibrated probabilities
Linear model: \[z = \mathbf{w}^T\mathbf{x} + b \in \mathbb{R}\]
Sigmoid transform: \[p = \sigma(z) = \frac{1}{1 + e^{-z}} \in (0,1)\]
Logistic regression: \[P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)\]
Optimize by maximizing log-likelihood: \[\max_{\mathbf{w},b} \sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
Gradient: \[\nabla_\mathbf{w} \ell = \sum_{i=1}^n (y_i - p_i)\mathbf{x}_i\]
Bayes theorem for weights (i.e., parameters): \[P(\mathbf{w}|\mathcal{D}) = \frac{P(\mathcal{D}|\mathbf{w})P(\mathbf{w})}{P(\mathcal{D})}\]
Maximum a posteriori (MAP): \[\hat{\mathbf{w}}_{\text{MAP}} = \arg\max_{\mathbf{w}} P(\mathcal{D}|\mathbf{w})P(\mathbf{w})\]
With uniform prior \(P(\mathbf{w}) = \text{const}\): \[\hat{\mathbf{w}}_{\text{MLE}} = \arg\max_{\mathbf{w}} P(\mathcal{D}|\mathbf{w})\]
For classification data \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n\):
Assuming independent observations: \[P(\mathcal{D}|\mathbf{w}) = \prod_{i=1}^n P(y_i|\mathbf{x}_i, \mathbf{w})\]
With Bernoulli model: \[= \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\]
where \(p_i = \sigma(\mathbf{w}^T\mathbf{x}_i + b)\)
Log-likelihood (sums replace products): \[\ell(\mathbf{w}) = \sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
Dataset: 4 points in 1D
Current Model: \(p(x) = \sigma(wx + b)\)
with \(w = 0.5\), \(b = 0\)
Compute predictions:
Errors \(y - p\):
Log-likelihood: \[\ell = \sum_{i=1}^4 [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
So:
Total: \(\ell = -1.58\)
Negative log-likelihood: \(\text{NLL} = 1.58\)
Minimize this loss function
Gradient formulas (from Section 1): \[\frac{\partial \ell}{\partial w} = \sum_{i=1}^n (y_i - p_i)x_i\] \[\frac{\partial \ell}{\partial b} = \sum_{i=1}^n (y_i - p_i)\]
For our example:
\(\frac{\partial \ell}{\partial w}\): \[\begin{align} &= (-0.27)(-2) + (-0.38)(-1)\\ &\quad + (0.38)(1) + (0.27)(2)\\ &= 0.54 + 0.38 + 0.38 + 0.54\\ &= 1.84 \end{align}\]
\(\frac{\partial \ell}{\partial b}\): \[= -0.27 - 0.38 + 0.38 + 0.27 = 0\]
Computing contributions:
\(i\) | \(x_i\) | \(y_i\) | \(p_i\) | \((y_i - p_i)x_i\) |
---|---|---|---|---|
1 | -2 | 0 | 0.27 | 0.54 |
2 | -1 | 0 | 0.38 | 0.38 |
3 | 1 | 1 | 0.62 | 0.38 |
4 | 2 | 1 | 0.73 | 0.54 |
Sum: \(\nabla_w \ell = 1.84\)
Gradient w.r.t. \(w\) positive: increase \(w\) Gradient w.r.t. \(b\) zero: \(b\) locally optimal Larger errors \(|y_i - p_i|\) contribute more to gradient.
Update rule (minimize NLL): \[\mathbf{w}^{\text{new}} = \mathbf{w}^{\text{old}} + \eta \cdot \nabla \ell\]
Addition follows from maximizing log-likelihood (equivalent to subtracting gradient of NLL)
With learning rate \(\eta = 0.5\):
New predictions:
NLL decreased from 1.582 to 0.506
Hessian = matrix of second derivatives
\[\mathbf{H} = \begin{bmatrix} \frac{\partial^2 \ell}{\partial w^2} & \frac{\partial^2 \ell}{\partial w \partial b} \\ \frac{\partial^2 \ell}{\partial b \partial w} & \frac{\partial^2 \ell}{\partial b^2} \end{bmatrix}\]
For logistic regression: \[H_{jk} = -\sum_{i=1}^n p_i(1-p_i) x_{ij} x_{ik}\]
For our 1D example with current \(p = [0.27, 0.38, 0.62, 0.73]\):
Weights: \(p_i(1-p_i)\)
\[H_{ww} = -\sum p_i(1-p_i)x_i^2\] \[= -(0.197 \cdot 4 + 0.236 \cdot 1 + 0.236 \cdot 1 + 0.197 \cdot 4)\] \[= -2.05\]
\[H_{wb} = H_{bw} = -\sum p_i(1-p_i)x_i\] \[= -(0.197 \cdot (-2) + 0.236 \cdot (-1)\] \[\quad + 0.236 \cdot 1 + 0.197 \cdot 2)\] \[= 0\]
\[H_{bb} = -\sum p_i(1-p_i)\] \[= -(0.197 + 0.236 + 0.236 + 0.197)\] \[= -0.866\]
Hessian matrix: \[\mathbf{H} = \begin{bmatrix} -2.05 & 0 \\ 0 & -0.866 \end{bmatrix}\]
For NLL minimization, consider \(-\mathbf{H}\): \[-\mathbf{H} = \begin{bmatrix} 2.05 & 0 \\ 0 & 0.866 \end{bmatrix} \succ 0\]
Positive definite → Convex
Convexity guarantees convergence to global optimum
Generalize to \(n\) samples, \(d\) features:
Data matrix: \(\mathbf{X} \in \mathbb{R}^{n \times d}\) \[\mathbf{X} = \begin{bmatrix} — \mathbf{x}_1^T — \\ — \mathbf{x}_2^T — \\ \vdots \\ — \mathbf{x}_n^T — \end{bmatrix}\]
Predictions: \(\mathbf{p} = \sigma(\mathbf{X}\mathbf{w} + b)\)
Gradient: \[\nabla_\mathbf{w} \ell = \mathbf{X}^T(\mathbf{y} - \mathbf{p})\]
This equals \(\sum_{i=1}^n (y_i - p_i)\mathbf{x}_i\)
Hessian: \[\mathbf{H} = -\mathbf{X}^T\mathbf{D}\mathbf{X}\]
where \(\mathbf{D} = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n))\)
Example for our 4-point dataset: \[\mathbf{X} = \begin{bmatrix} -2 \\ -1 \\ 1 \\ 2 \end{bmatrix}, \quad \mathbf{D} = \begin{bmatrix} 0.197 & 0 & 0 & 0 \\ 0 & 0.236 & 0 & 0 \\ 0 & 0 & 0.236 & 0 \\ 0 & 0 & 0 & 0.197 \end{bmatrix}\]
\[\mathbf{H} = -[-2, -1, 1, 2] \mathbf{D} \begin{bmatrix} -2 \\ -1 \\ 1 \\ 2 \end{bmatrix} = -2.05\]
Theorem: NLL is strictly convex if \(\mathbf{X}\) has full rank.
Proof sketch:
Hessian: \(\mathbf{H} = -\mathbf{X}^T\mathbf{D}\mathbf{X}\)
For any \(\mathbf{v} \neq \mathbf{0}\): \[\mathbf{v}^T(-\mathbf{H})\mathbf{v} = \mathbf{v}^T\mathbf{X}^T\mathbf{D}\mathbf{X}\mathbf{v}\]
Let \(\mathbf{u} = \mathbf{X}\mathbf{v}\): \[= \mathbf{u}^T\mathbf{D}\mathbf{u} = \sum_{i=1}^n p_i(1-p_i)u_i^2\]
Since \(0 < p_i < 1\): all \(p_i(1-p_i) > 0\)
Sum of positive terms → \(\geq 0\)
Equals 0 only if \(\mathbf{u} = \mathbf{0}\)
If \(\mathbf{X}\) full rank: \(\mathbf{X}\mathbf{v} = \mathbf{0} \Rightarrow \mathbf{v} = \mathbf{0}\)
Therefore: \(-\mathbf{H} \succ 0\) (positive definite)
Convexity implications:
Property | Convex (Logistic) | Non-convex (Neural Net) |
---|---|---|
Local minima | None (only global) | Many |
Optimization | Any method works | Can get stuck |
Convergence | Guaranteed | No guarantee |
Initial point | Doesn’t matter | Critical |
Solution | Unique (if full rank) | Multiple |
Gradient Descent:
Newton’s Method:
Trade-off:
Fisher Information measures parameter certainty: \[\mathcal{I} = \mathbb{E}[-\mathbf{H}] = \mathbb{E}[\mathbf{X}^T\mathbf{D}\mathbf{X}]\]
At convergence for our example:
Weights at optimum:
Low weights everywhere → High certainty
Asymptotic distribution: \[\hat{\mathbf{w}} \sim \mathcal{N}(\mathbf{w}^*, \mathcal{I}^{-1}/n)\]
High Fisher information → Low variance → Confident estimates
One-vs-Rest approach:
Problem: Probabilities do not sum to 1
Example with 3 classes:
Sum = 1.7 ≠ 1
Not valid probabilities for the K classes
Generalize sigmoid to K dimensions:
\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\]
For binary case (K=2):
Softmax reduces to sigmoid when K=2
Valid probabilities: All outputs in (0,1), sum to 1 \[\sum_{k=1}^K \text{softmax}(\mathbf{z})_k = 1\]
Preserves order: If \(z_i > z_j\), then \(p_i > p_j\)
Translation invariant: \[\text{softmax}(\mathbf{z} + c) = \text{softmax}(\mathbf{z})\]
Proof: \(\frac{e^{z_k + c}}{\sum_j e^{z_j + c}} = \frac{e^c \cdot e^{z_k}}{e^c \cdot \sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}}\)
Temperature scaling: \[\text{softmax}(\mathbf{z}/T)\]
The problem: Exponentials overflow/underflow
Solution: Apply translation invariance
Subtract maximum before exponent
\[\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} - \max(\mathbf{z}))\]
z = [1000, 999, 998]
z_stable = z - max(z) = [0, -1, -2]
softmax(z_stable) = [0.665, 0.245, 0.090] # Valid
Computing log of sum of exponentials:
\[\log\left(\sum_k e^{z_k}\right)\]
Needed for:
Naive approach fails:
Corrected approach is stable: \[\log\left(\sum_k e^{z_k}\right) = m + \log\left(\sum_k e^{z_k - m}\right)\] where \(m = \max(\mathbf{z})\)
For K classes, we need:
Model: \[\mathbf{z} = \mathbf{W}^T\mathbf{x} + \mathbf{b}\] \[P(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k\]
Over-parameterized:
Each boundary is linear (hyperplane)
One-hot encoding for labels:
Cross-entropy loss: \[\mathcal{L} = -\sum_{k=1}^K y_k \log p_k\]
Since only one \(y_k = 1\): \[\mathcal{L} = -\log p_{\text{true class}}\]
For entire dataset: \[\text{NLL} = -\sum_{i=1}^n \log p_{i,y_i}\]
where \(p_{i,y_i}\) is predicted probability for true class of sample \(i\)
The gradient has the same form as binary case
For sample \(i\) with true class \(y_i\): \[\frac{\partial \ell_i}{\partial \mathbf{w}_k} = (y_{ik} - p_{ik})\mathbf{x}_i\]
where:
Derivation sketch:
Scalar form: \(\frac{\partial \ell}{\partial z_k} = p_k - y_k\)
Matrix form: \(\nabla_{\mathbf{W}} \ell = \mathbf{X}^T(\mathbf{P} - \mathbf{Y})\)
Data: 3 points, 3 classes
Current parameters:
Computing predictions:
Point 1: \(\mathbf{x} = [1, 0]\)
The softmax has vector input and vector output, so its derivative is a Jacobian matrix:
3×3 Example with \(\mathbf{p} = [0.5, 0.3, 0.2]\):
\[\mathbf{J} = \begin{bmatrix} \frac{\partial p_1}{\partial z_1} & \frac{\partial p_1}{\partial z_2} & \frac{\partial p_1}{\partial z_3} \\ \frac{\partial p_2}{\partial z_1} & \frac{\partial p_2}{\partial z_2} & \frac{\partial p_2}{\partial z_3} \\ \frac{\partial p_3}{\partial z_1} & \frac{\partial p_3}{\partial z_2} & \frac{\partial p_3}{\partial z_3} \end{bmatrix}\]
General formula: \[\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1 - p_i) & \text{if } i = j \\ -p_i p_j & \text{if } i \neq j \end{cases}\]
For this example: \[\mathbf{J} = \begin{bmatrix} 0.5(1-0.5) & -0.5(0.3) & -0.5(0.2) \\ -0.3(0.5) & 0.3(1-0.3) & -0.3(0.2) \\ -0.2(0.5) & -0.2(0.3) & 0.2(1-0.2) \end{bmatrix} = \begin{bmatrix} 0.25 & -0.15 & -0.10 \\ -0.15 & 0.21 & -0.06 \\ -0.10 & -0.06 & 0.16 \end{bmatrix}\]
Rows sum to 0 (probabilities sum to 1)
When K=2, softmax reduces to sigmoid:
Let class 0 be reference with \(z_0 = 0\): \[p_1 = \frac{e^{z_1}}{e^0 + e^{z_1}} = \frac{e^{z_1}}{1 + e^{z_1}} = \frac{1}{1 + e^{-z_1}} = \sigma(z_1)\]
\[p_0 = \frac{e^0}{e^0 + e^{z_1}} = \frac{1}{1 + e^{z_1}} = 1 - \sigma(z_1)\]
Softmax generalizes sigmoid to multiple classes
Perfect separation scenario:
# 2D data with polynomial features
X = [[-2, -1], [-1, -2], [1, 2], [2, 1]]
y = [0, 0, 1, 1]
# Add polynomial features
X_poly = [[x1, x2, x1², x2², x1x2]
for x1, x2 in X]
With 5 features for 4 points:
The problem:
As ||w|| increases:
Example: 100 features, 80 samples
Without regularization:
n, d = 80, 100 # More features than samples
X = np.random.randn(n, d)
y = (X @ w_true + noise > 0).astype(int)
What happens during optimization:
Maximum likelihood fails when:
Loss decreases but weights are unstable.
Modified objective: \[\mathcal{L}_{\text{ridge}} = \underbrace{-\ell(\mathbf{w})}_{\text{NLL}} + \underbrace{\lambda ||\mathbf{w}||^2}_{\text{penalty}}\]
Gradient with regularization: \[\nabla \mathcal{L} = -\mathbf{X}^T(\mathbf{y} - \mathbf{p}) + 2\lambda\mathbf{w}\]
Compare to unregularized: \[\nabla \ell = \mathbf{X}^T(\mathbf{y} - \mathbf{p})\]
Effect: Pull weights toward zero
Update rule: \[\mathbf{w} \leftarrow \mathbf{w} - \eta[\nabla \ell - 2\lambda\mathbf{w}]\] \[= (1 - 2\eta\lambda)\mathbf{w} - \eta\nabla \ell\]
Weight decay: Shrink by factor \((1 - 2\eta\lambda)\)
L2 Regularization = Bias toward smaller weights.
Larger λ → Smaller weights → More stable model
L2 regularization = Gaussian prior
Prior on weights: \[P(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})\] \[\propto \exp\left(-\frac{||\mathbf{w}||^2}{2\sigma^2}\right)\]
MAP estimation: \[\hat{\mathbf{w}}_{\text{MAP}} = \arg\max_\mathbf{w} P(\mathcal{D}|\mathbf{w})P(\mathbf{w})\]
Taking logarithm: \[= \arg\max_\mathbf{w} \left[\log P(\mathcal{D}|\mathbf{w}) + \log P(\mathbf{w})\right]\] \[= \arg\max_\mathbf{w} \left[\ell(\mathbf{w}) - \frac{||\mathbf{w}||^2}{2\sigma^2}\right]\]
With \(\lambda = \frac{1}{2\sigma^2}\): \[= \arg\min_\mathbf{w} \left[-\ell(\mathbf{w}) + \lambda ||\mathbf{w}||^2\right]\]
This is Ridge regression
Smaller σ gives stronger regularization
L1 penalty: \[\mathcal{L}_{\text{lasso}} = -\ell(\mathbf{w}) + \lambda ||\mathbf{w}||_1\] \[= -\ell(\mathbf{w}) + \lambda \sum_{j=1}^d |w_j|\]
Subgradient (not differentiable at 0): \[\partial_{w_j} \mathcal{L} = -\frac{\partial \ell}{\partial w_j} + \lambda \cdot \text{sign}(w_j)\]
where \(\text{sign}(w) = \begin{cases} +1 & w > 0 \\ [-1, 1] & w = 0 \\ -1 & w < 0 \end{cases}\)
Induces exact sparsity: Many weights become exactly zero
Bayesian view: Laplace prior \[P(\mathbf{w}) \propto \exp(-\lambda ||\mathbf{w}||_1)\]
L1 hits corners → Some weights exactly 0
Combine L1 and L2: \[\mathcal{L}_{\text{elastic}} = -\ell(\mathbf{w}) + \lambda_1 ||\mathbf{w}||_1 + \lambda_2 ||\mathbf{w}||^2\]
Alternative parameterization: \[= -\ell(\mathbf{w}) + \lambda\left[\alpha ||\mathbf{w}||_1 + \frac{1-\alpha}{2}||\mathbf{w}||^2\right]\]
where:
Advantages:
Elastic Net: Selects groups of correlated features
k-fold cross-validation:
Split data into \(k\) folds
For each \(\lambda\):
Average validation error
Choose \(\lambda\) with lowest error
Common choices:
One standard error rule:
Larger λ within 1 SE yields simpler models with comparable performance
import numpy as np
from scipy.special import expit as sigmoid
def logistic_regression_l2(X, y, lambda_reg=0.1, lr=0.01, max_iter=1000):
"""Logistic regression with L2 regularization."""
n, d = X.shape
w = np.zeros(d)
b = 0
losses = []
for iteration in range(max_iter):
# Forward pass
z = X @ w + b
p = sigmoid(z)
# Loss = NLL + penalty
nll = -np.mean(y * np.log(p + 1e-10) + (1-y) * np.log(1-p + 1e-10))
penalty = lambda_reg * np.sum(w**2)
loss = nll + penalty
losses.append(loss)
# Gradients with regularization
grad_w = -X.T @ (y - p) / n + 2 * lambda_reg * w
grad_b = -np.mean(y - p)
# Update parameters
w -= lr * grad_w
b -= lr * grad_b
if iteration % 200 == 0:
print(f"Iter {iteration}: Loss = {loss:.4f}, ||w|| = {np.linalg.norm(w):.4f}")
return w, b, losses
# Generate synthetic data
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
w_true = np.random.randn(d) * 2
y = (X @ w_true + 0.5 * np.random.randn(n) > 0).astype(float)
# Compare different λ values
for lambda_val in [0, 0.1, 1.0]:
print(f"\nλ = {lambda_val}:")
w, b, losses = logistic_regression_l2(
X, y, lambda_reg=lambda_val, lr=0.1, max_iter=400
)
print(f"Final ||w|| = {np.linalg.norm(w):.3f}")
λ = 0:
Iter 0: Loss = 0.6931, ||w|| = 0.0390
Iter 200: Loss = 0.2209, ||w|| = 2.6773
Final ||w|| = 3.670
λ = 0.1:
Iter 0: Loss = 0.6931, ||w|| = 0.0390
Iter 200: Loss = 0.5105, ||w|| = 0.9713
Final ||w|| = 0.972
λ = 1.0:
Iter 0: Loss = 0.6931, ||w|| = 0.0390
Iter 200: Loss = 0.6575, ||w|| = 0.1738
Final ||w|| = 0.174
XOR (Exclusive OR):
No linear boundary exists
Any line through the space:
Logistic regression fails because: \[P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)\] enforces linear decision boundary
Solution: Add hidden layer
Architecture equivalence:
\[z = \mathbf{w}^T\mathbf{x} + b\] \[p = \sigma(z) = \frac{1}{1 + e^{-z}}\]
Neural network building blocks, neurons
Forward pass: \[\mathbf{x} \to \mathbf{w}^T\mathbf{x} + b \to \sigma(\cdot) \to p\]
Backward pass (gradient): \[\frac{\partial \ell}{\partial \mathbf{w}} = (y - p)\mathbf{x}\]
Two-layer network: \[\mathbf{h} = \sigma(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)\] \[p = \sigma(\mathbf{w}_2^T\mathbf{h} + b_2)\]
where:
Equivalence
Same cross-entropy loss: \[\min_{\mathbf{W}_1, \mathbf{W}_2} -\sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]\]
Zero initialization (fails for hidden layers):
Random normal:
Xavier/Glorot initialization (preserves variance):
He initialization (for ReLU):
Logistic regression: small random weights suffice
SGD: Basic gradient update \[\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla \ell\]
Momentum: Accumulate velocity \[\mathbf{v} \leftarrow \beta \mathbf{v} + \nabla \ell\] \[\mathbf{w} \leftarrow \mathbf{w} - \eta \mathbf{v}\]
Adam: Adaptive per-parameter rates \[\mathbf{m} \leftarrow \beta_1 \mathbf{m} + (1-\beta_1)\nabla \ell\] \[\mathbf{v} \leftarrow \beta_2 \mathbf{v} + (1-\beta_2)(\nabla \ell)^2\] \[\mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\mathbf{m}}{\sqrt{\mathbf{v}} + \epsilon}\]
Memory requirements:
Adam converges fastest for logistic regression: 2-10x fewer iterations than SGD
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
class LogisticRegression:
def __init__(self, lr=0.01, reg=0.1, max_iter=1000):
self.lr = lr
self.reg = reg
self.max_iter = max_iter
def fit(self, X, y):
n, d = X.shape
self.w = np.random.randn(d) * 0.01
self.b = 0
for _ in range(self.max_iter):
# Forward
z = X @ self.w + self.b
p = self._sigmoid(z)
# Backward
grad_w = -X.T @ (y - p) / n + self.reg * self.w
grad_b = -np.mean(y - p)
# Update
self.w -= self.lr * grad_w
self.b -= self.lr * grad_b
def predict_proba(self, X):
z = X @ self.w + self.b
return self._sigmoid(z)
def _sigmoid(self, z):
return np.where(
z >= 0,
1 / (1 + np.exp(-z)),
np.exp(z) / (1 + np.exp(z))
)
Logistic regression = single-layer neural network
Single neuron with sigmoid activation:
Gradient computation: \[\nabla_\mathbf{w} \mathcal{L} = \mathbf{X}^T(\mathbf{p} - \mathbf{y})\]
Gradient formula extends to: