skip to main content
Caltech

H.B. Keller Colloquium

Monday, November 9, 2020
2:00pm to 3:00pm
Add to Cal
Online Event
Stochastic Gradient Descent: From Practice Back to Theory
Rachel Ward, W. A. "Tex" Moncrief Distinguished Professor in Computational Engineering and Sciences—Data Science Associate Professor of Mathematics, Department of Mathematics, University of Texas at Austin,

Stochastic Gradient Descent (SGD) is an increasingly popular optimization algorithm for a variety of large-scale learning problems, due to its computational efficiency and ease of implementation. In particular, SGD is the standard algorithm for training neural networks. In the neural network industry, certain "tricks" added on top of the basic SGD algorithm have demonstrated to improve convergence rate and generalization accuracy in practice , without theoretical foundations. In this talk, we focus on two particular such tricks: AdaGrad, an adaptive gradient method which automatically adjusts the learning rate schedule in SGD to reduce the need for hyperparameter tuning, and Weight Normalization, where SGD is implemented (essentially) with respect to polar rather than Cartesian coordinates. We reframe each of these tricks as a general-purpose modification to the standard SGD algorithm, and provide first theoretical guarantees of convergence, robustness, and generalization performance, in a general context.

For more information, please contact Diana Bohler by phone at 626-232-6138 or by email at [email protected].