H.B. Keller Colloquium
Stochastic Gradient Descent (SGD) is an increasingly popular optimization algorithm for a variety of large-scale learning problems, due to its computational efficiency and ease of implementation. In particular, SGD is the standard algorithm for training neural networks. In the neural network industry, certain "tricks" added on top of the basic SGD algorithm have demonstrated to improve convergence rate and generalization accuracy in practice , without theoretical foundations. In this talk, we focus on two particular such tricks: AdaGrad, an adaptive gradient method which automatically adjusts the learning rate schedule in SGD to reduce the need for hyperparameter tuning, and Weight Normalization, where SGD is implemented (essentially) with respect to polar rather than Cartesian coordinates. We reframe each of these tricks as a general-purpose modification to the standard SGD algorithm, and provide first theoretical guarantees of convergence, robustness, and generalization performance, in a general context.