Gradient Descent with Identity Initialization Efficiently Learns Positive Definite Linear Transformations by Deep Residual Networks
Residual networks are deep neural networks in which, roughly, subnetworks determine how a feature transformation should differ from the identity, rather than how it should differ from zero. After enabling the winning entry in the ILSVRC 2015 classification task, they have become established as a central idea in deep networks.
Hardt and Ma provided a theoretical analysis that shed light on residual networks. They showed that (a) any linear transformation with a positive determinant and a bounded condition number can be approximated by a "deep linear network", where, if there are enough layers, each layer is close to the identity, and (b) for networks that compose near-identity transformations this way, and if the loss is large, then the gradient is steep. These results are interesting because they suggest that, in many cases, this non-convex objective may be efficiently optimized through gradient descent if the layers stay close to the identity.
We continue this line of research, examining when gradient descent of a "deep linear network" with identity initialization converges to a target linear transformation. We provide polynomial bounds on the number of iterations for gradient descent to approximate a target linear mapping, in the case where an initial hypothesis that applies the identity transformation at each layer has loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for targets whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If the target is symmetric positive definite, we show that an algorithm with identity initialization learns an approximation of the target using a number of updates polynomial in all of the natural parameters of the problem. In contrast, we show that if the target is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that the target is positive definite but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant that the hypothesis is positive definite, and another that "balances" the layers so that they have the same singular values.
This is joint work with Peter Bartlett and Dave Helmbold.