Caltech Logo

CMI Seminar: Mateo Diaz

Tuesday, October 12, 2021
4:00pm to 5:00pm
Add to Cal
Annenberg 105
Clustering a mixture of Gaussians with unknown covariance

Clustering is a fundamental data scientific task with broad application. This talk investigates a simple clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We show its solutions achieve the optimal misclassification rate when the number of samples grows linearly in the dimension, up to a logarithmic factor. However, solving the Max-cut problem appears to be computationally intractable. To overcome this, we develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. Although this sample complexity is worse than that of the Max-cut problem, we conjecture that no polynomial-time method can perform better. Furthermore, we present numerical and theoretical evidence that supports the existence of a statistical-computational gap. Finally, we generalize the Max-Cut program to a k-means program that handles multi-component mixtures with possibly unequal weights and has similar guarantees.

For more information, please contact Linda Taddeo by phone at 6704 or by email at ltaddeo@caltech.edu.