skip to main content
Caltech

High Energy Physics/CS Special Seminar

Tuesday, May 7, 2019
4:00pm to 5:00pm
Add to Cal
Lauritsen 469
Title: Software Engineering for Data Science and Big Data Analytics
Miryung Kim, UCLA,

Abstract: The demand for analyzing large scale telemetry, machine, and quality data is rapidly increasing in software industry. Data scientists are becoming popular within software teams. We conducted a large scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities.  

To process massive quantities of data, data scientists leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Hadoop, and Apache Spark.  While DISC systems help to address the scalability challenges of big data analytics, they also introduce new challenges in debugging. In this talk, I will first describe interactive, real-time debugging primitives that we designed for the next generation data-intensive scalable cloud computing platform, Apache Spark and briefly describe data provenance and optimized incremental computation capabilities that we built within Apache Spark to effectively and efficiently support debugging. Then, I will describe automated debugging that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs.

For more information, please contact Joosep Pata by phone at (626) 395-6677 or by email at [email protected].