Caltech Logo

Machine Learning & Scientific Computing Series

Tuesday, March 29, 2022
12:00pm to 1:00pm
Add to Cal
Online Event
Examining Occam's Razor in Deep Neural Networks Using Kolmogorov Complexity
Ard A. Louis, Professor of Theoretical Physics, Department of Physics, University of Oxford,

Classic arguments from statistical learning theory, often formulated in terms of bias-variance tradeoff, suggest that models with high capacity should overfit, and therefore generalize poorly on unseen data.  Deep neural networks (DNNs) appear to break this basic rule of statistics, because they perform best in the overparameterized regime. One way of formulating this conundrum is in terms of inductive bias:  DNNs are highly expressive, and so can represent almost any function that fits a training data set. Why then are they biased towards functions that generalize well?    The source of this inductive bias must arise from an interplay between  network architecture, training algorithms, and structure in the data.  

To disentangle these three components, we apply a Bayesian picture,  based on the functions expressed by a DNN, to  supervised learning for some simple classification problems, including Boolean functions, MNIST and CIFAR10.  We show that the DNN prior over functions is determined by the architecture, and is biased towards ``simple'' functions with low Kolmogorov complexity.  This simplicity bias can be varied by exploiting a transition between ordered and chaotic regimes.    The likelihood is calculated from the error spectrum of functions on data sets. Combining the prior and the likelihood  to calculate the posterior accurately predicts the behavior of DNNs trained with stochastic gradient descent.  This analysis suggests that, to overcome the traditional bias-variance problem for models with high capacity requires an Occam's razor-like inductive bias towards simple functions that is powerful enough to overcome the exponential growth in the number of functions with complexity.   When this picture is combined with structured  data, it helps explain the big picture question of why DNNs generalize  in the overparameterized regime.  It doesn't (yet) explain why some DNNs generalize better than others.

https://caltech.zoom.us/rec/share/8KJQ21y5kikJqrHDqSwQH3Fl8ra7sx7uV8nh4x3lRUUtyEbWvD8eO_56Hboj7eaT.F2mMqEy0VbeLtlea Passcode: K3GN.5+i

For more information, please contact Diana Bohler by phone at 626-395-1768 or by email at [email protected].