# A Bayesian Perspective on Generalization and Stochastic Gradient Descent

## Introduction

This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? They show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” [math] g \approx \epsilon N/B [/math] where [math]ε[/math] is the learning rate, [math]N[/math] the training set size and [math]B[/math] the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, [math]B_{opt} \propto \epsilon N[/math] . They verify these predictions empirically.

## Motivation

This paper shows Bayesian principles can explain many recent observations in the deep learning literature, while also discovering practical new insights. Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving our estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds. Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.

## Contribution

The main contributions of this paper are to show that:

- The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.
- SGD integrates a stochastic differential equation whose “noise scale” [math]g ≈ εN/B[/math], where

ε is the learning rate, [math]N[/math] training set size and [math]B[/math] batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is **proportional to the learning rate and training set size**.

Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.

## Main Results

The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.

## Conclusion

The paper showed that Mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that Bopt ∝ 1/(1 − m), where m is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).

## References

Chaudhari, Pratik, et al. "Entropy-sgd: Biasing gradient descent into wide valleys." arXiv preprint arXiv:1611.01838 (2016).

Dziugaite, Gintare Karolina, and Daniel M. Roy. "Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data." arXiv preprint arXiv:1703.11008 (2017).

Germain, Pascal, et al. "Pac-bayesian theory meets bayesian inference." Advances in Neural Information Processing Systems. 2016.

Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).

Gull, Stephen F. "Bayesian inductive inference and maximum entropy." Maximum-entropy and Bayesian methods in science and engineering. Springer, Dordrecht, 1988. 53-74.

Hoffer, Elad, Itay Hubara, and Daniel Soudry. "Train longer, generalize better: closing the generalization gap in large batch training of neural networks." Advances in Neural Information Processing Systems. 2017. Kass, Robert E., and Adrian E. Raftery. "Bayes factors." Journal of the american statistical association 90.430 (1995): 773-795.

Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. "Generalization in deep learning." arXiv preprint arXiv:1710.05468 (2017).

Keskar, Nitish Shirish, et al. "On large-batch training for deep learning: Generalization gap and sharp minima." arXiv preprint arXiv:1609.04836 (2016).

MacKay, David JC. "A practical Bayesian framework for backpropagation networks." Neural computation 4.3 (1992): 448-472.

Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).