SLIDE 6 Conventional Wisdom ch challenged
(Against Wisdom 2) Small LR can generalize equally well as large LR (Against Wisdom 3) Random walk/SDE view of SGD is way
- ff. There is no evidence of
mixing as traditionally understood, at least within normal training times. Loss β as hessian β when π₯ β 0
Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018)
SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.
(Against Wisdom 1): Full batch gradient descent β gradient flow.
Resnet with BN trained by full-batch GD + WD on subsampled CIFAR10. Resnet with BN trained by SGD with diff. LR schedules on CIFAR10
Same setting, longer training! Same setting, longer training!