Homework #4

EE 541: Fall 2025

Assignment Details

Assigned: 29 September
Due: Sunday, 05 October at 23:59

BrightSpace Assignment: Homework 4
Getting Started Guide: View Guide

Problem 1: Clustering Algorithms

Unsupervised clustering algorithms are an efficient means to identify groups of related objects within large populations. Implement the following two clustering algorithms and apply them to the data in cluster.txt. The file contains data as: x, y, class. If necessary, use a regular expression to remove lines that are empty or that are invalid data. You may safely ignore any line that fails the regular expression.

Part (a): K-Means Clustering

Requirements

You may use any standard NumPy or SciPy packages or experiment with your own implementation.

Use K-Means clustering with 3-clusters to label each \((x,y)\) pair as Head, Ear_Right, or Ear_Left.

Produce a scatter plot marking each \((x,y)\) pair as either BLUE (class = Head), RED (class = Ear_Left) or GREEN (class = Ear_Right). Compare the K-means predicted labels to the true label and generate a confusion matrix showing the respective accuracies.

Part (b): Gaussian Mixture Models with EM

Requirements

DO NOT use an EM implementation from NumPy, SciPy, or any other package. You must implement and use the EM equations below.

Gaussian Mixture Models (GMM) are a common method to cluster data from multi-modal probability densities. Expectation maximization (EM) is an iterative procedure to compute (locally) optimal GMM parameters – GMM cluster means \(\mu_k\), covariances \(\Sigma_k\), and mixing weights \(w_k\). EM consists of two steps. The E[xpectation]-step uses the mixture parameters to update estimates of hidden variables. The true but unknown class is an example of a hidden variable. The M[aximization]-step then uses the new hidden variable estimates to update the mixture parameter estimates. This back-and forth update provably increases the likelihood function and the estimate eventually converges to a local likelihood maximum.

The GMM update equations follow:

E-Step: Use current mixture parameters estimates to calculate membership probabilities (a.k.a. the hidden variables) for each sample, \(\gamma_k(x_n)\):

\[ \gamma_{k}(x_n) = \frac{w_k f(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^{K} w_j f(x_n; \mu_j, \Sigma_j)} \]

M-Step: Use new \(\gamma_{k}(x_n)\) to update mixture parameter estimates:

\[ \mu_{k} = \frac{\sum_{n=1}^{N} \gamma_{k}(x_n) x_n}{\sum_{n=1}^{N} \gamma_{k}(x_n)} \]

\[ \Sigma_{k} = \frac{\sum_{n=1}^{N} \gamma_{k}(x_n) (x_n - \mu_k)(x_n - \mu_k)^T}{\sum_{n=1}^{N} \gamma_{k}(x_n)} \]

\[ w_{k} = \frac{1}{N} \sum_{n=1}^{N} \gamma_{k}(x_n) \]

for \(k \in \{1, \ldots, K\}\) where \(K \in \mathbb{Z}^+\) is the (predefined) number of mixture components, \(x_n\) for \(n \in \{1, \ldots, N\}\) are the data samples, and \(f(x; \mu_k, \Sigma_k)\) is the \(d\)-dimensional jointly Gaussian pdf:

\[ f(x; \mu_k, \Sigma_k) = \frac{\exp\left(-\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k)\right)}{\sqrt{(2\pi)^d |\Sigma_k|}}. \]

Implement Expectation Maximization and use it to estimate mixture parameters for a 3-component GMM for the cluster dataset.
Initialize \(\gamma_k(x_n)\) for each sample using the K-means labels from part (a) as “one-hot” membership probabilities (i.e., initialize one of the probabilities as “1” and all others are “0”). Then compute initial \(\mu_k\) and \(\Sigma_k\) for each mixture component.
Run EM until it has sufficiently converged. Use either the negative log-likelihood \[ \ell = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} w_k f(x_n; \mu_k, \Sigma_k) \right) \] as a convergence metric or monitor the class assignments until there are only small changes. Be aware that some points may “flip-flop” even when fully converged. Assign each datapoint to the mixture component with the largest membership probability. Produce a Blue-Red-Green scatter plot as in part (a) and generate a confusion matrix showing the respective classification accuracies.
Generate figures showing the class assignments during the first four iterations.
Comment on the difference between the clustering result in (a) and (b). Describe any obvious difference between the plots and indicate which performs better.

Problem 2: HDF5 Binary Sequences

The hd5 format can store multiple data objects in a single file each keyed by object name – e.g., you can store a numpy float array called regressor and a numpy integer array called labels in the same file. Hd5 also allows fast non-sequential access to objects without scanning the entire file. This means you can efficiently access objects and data such as x[idxs] with non-consecutive indexes e.g., idxs = [2, 234, 512]. This random-access property is useful when extracting a random subset from a larger training database.

In this problem you will create an hd5 file containing a numpy array of binary random sequences that you generate yourself. Follow these steps:

Run the provided template python file – random_binary_collection.py. The script is set to DEBUG mode by default.
Experiment with the assert statements to trap errors and understand what they are doing by using the shape method on numpy arrays, etc.
Set the DEBUG flag to False. Manually create 10 binary sequences each with length 50. It is important that you do this by hand, i.e., do not use a coin, computer, or random number generator.
Verify that your hd5 file was written properly by checking that it can be read back.
Submit your hd5 file as directed.

Submission Instructions

Problem 1: Clustering Algorithms

Submit scatter plots, confusion matrices, and analysis to BrightSpace. Submit your Python code as an appendix to Gradescope. A suitably annotated Jupyter notebook with inline analysis is sufficient.

Gradescope: Problem 1

Problem 2: HDF5 Binary Sequences

Submit your hd5 file containing the manually created binary sequences to Gradescope.

Gradescope: Problem 2