DEEP LEARNING WITH COTS HPC SYSTEMS
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Ng Andrew ; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):1337-1345, 2013.
DEEP LEARNING WITH COTS HPC SYSTEMS Adam Coates, Brody Huval, Tao - - PowerPoint PPT Presentation
DEEP LEARNING WITH COTS HPC SYSTEMS Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Ng Andrew ; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):1337-1345, 2013. MODELS NOW Larger and larger
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Ng Andrew ; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):1337-1345, 2013.
models.
distributed machines needed to run certain networks will be out of reach for many researchers.
speed communications in order to coordinate gradient computation.
layers repeated three times.
momentum.
GPUs
receptive field.
Y = WX can be calculated efficiently by allowing us to use dense matrix multiplication.
for X
blocks
efficiency
but the step size needs to be increased.
assigned to it.
weights are stored on their respective neuron.
intersection of the other GPUs output and the
layers
runs.
function, and full backwards pass
the document at low GPU counts
better with larger systems.
ImageNet
network.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
parallel
performance.
problems like that of the first equation
for a tuned learning rate.
Lipschitz constant) can cause slow convergence.
parameters.
accuracy.
stochastic gradients at the cost of little bias.
denser than the curvature and stochasticity.
second moment used in ADAM
adaptivity.
the gradient is denser than the stochasticity
than SGD, since they use the average Lipschitz constant rather than the maximum one.
ADAMW used tune weight decay.
pod
Trained using the BERT model
16K
smoothly even at extremely a batch size of 64K.
batch size of 64K
to improve accuracy
performance
been shown to work better than standard gradients.
roughly comparable with a standard gradient when compared to LAMB.