skip to main content
Caltech

IST Lunch Bunch

Tuesday, April 16, 2019
12:00pm to 1:00pm
Add to Cal
Annenberg 105
Automated Data Summarization for Scalability in Bayesian Machine Learning
Tamara Broderick, Assistant Professor in Electrical Engineering and Computer Science, MIT, EECS, Massachusetts Institute of Technology,

Many algorithms take prohibitively long to run on modern, large data
sets. But even in complex data sets, many data points may be at least
partially redundant for some task of interest. So one might instead
construct and use a weighted subset of the data (called a "coreset")
that is much smaller than the original dataset. Typically running
algorithms on a much smaller data set will take much less computing
time, but it remains to understand whether the output can be widely
useful. (1) In particular, can running an analysis on a smaller
coreset yield answers close to those from running on the full data
set? (2) And can useful coresets be constructed automatically for new
analyses, with minimal extra work from the user? We answer in the
affirmative for a wide variety of problems in Bayesian machine
learning. We demonstrate how to construct "Bayesian coresets" as an
automatic, practical pre-processing step. We prove that our method
provides geometric decay in relevant approximation error as a function
of coreset size. Empirical analysis shows that our method reduces
approximation error by orders of magnitude relative to uniform random
subsampling of data. Though we focus on Bayesian applications here, we
also show that our construction can be applied in other domains.

For more information, please contact Diane Goodfellow by email at [email protected].