Tuesday, November 13, 2012
12:00 pm
Annenberg 105

IST Lunch Bunch

A Unified Framework for Approximating and Clustering Data
Michael Langberg, Mathematics and Computer Science, The Open University of Israel

One of the great challenges in computational theory is the extraction
of patterns from massive and high-dimensional data sets, e.g.,
clustering. The field of geometric clustering is an extensively
studied one with a wide spectrum of "real world" applications.

A "coreset" for a given clustering optimization problem is a
"compressed" representation of the data at hand, in the sense that a
solution for the clustering problem with the (small) coreset as input
would yield an approximate solution to the problem with the original
(larger) input. Using traditional techniques, a coreset usually
implies provable linear time algorithms for the corresponding
clustering problem.

In this talk I will present a unified and abstract framework that
yields the efficient construction of succinct coresets for several
clustering problems. Representing the data by a set F of positive
functions over a ground set X, our framework forges a link between the
combinatorial complexity of the function family F at hand (measured in
terms of classical VC dimension) and the paradigm of coresets. Our
coresets are obtained via randomized sampling according to a
delicately designed sampling distribution.  Examples in which our
framework has been successful include the k-median, the k-line median,
projective clustering, linear regression, low rank approximation, and
subspace approximation problems.

This talk is based on joint works with Dan Feldman and Leonard Schulman.

Contact Sydney Garstang sydney@caltech.edu at 626-395-4555
Add this event to my calendar