Deep Learning on HPC: Performance Factors and Lessons Learned Weijia - PowerPoint PPT Presentation

Deep Learning on HPC: Performance Factors and Lessons Learned Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil’19 Denver, CO � 1

Outline • Motivating applications at TACC • Challenges for running deep learning on HPC. • Scalability and accuracy • Scalability and I/O • Memory error impact • Conclusions and Discussions � 2

Motivating Applications • Traffic Camera Video Analysis • In Collaboration with City of Austin • ~ 540MB/hour MPEG4 video from one camera • ~100GB for a typical study from a single camera [1] “Deep learning methods to leverage tra ffi c monitoring cameras for pedestrian data applications” Weijia Xu, Natalia Ruiz, Ruizhu Huang, Joel Meyer, and Jen Duthie, John Clary, 26th ITS Word Congress, (Best Technical Paper) [2] Detecting Pedestrian Crossing Events in Large Video Data form Tra ffi c Monitoring Cameras , Weijia Xu, Natalia Ruiz, Kelly Pierce, Ruizhu Huang, Joel Meyer, and Jen Duthie, to appear IEEE BigData2019 � 3

Motivating Applications • There are over 400 CCTV IP cameras within city limit of Austin • Mostly just used for manual monitoring • With deep learning, we can • Learn more about traffic pattern • Understand how road are used • Improving pedestrian safety. • A lot of unexpected… � 4

Motivating Applications • Neural image resolution enhancement with super resolution generative adversarial network. • In collaboration with Salk Institute • ~600 GB neural image dataset • Pytorch+FastAI, Each run of the early version takes ~24 hours on 16 NVIDIA V100 GPUs Biorixv’19 Fang, L., Monroe, F ., Novak, S.W., Kirk, L., Schiavon, C.R., Seungyoon, B.Y., Zhang, T., Wu, M., Kastner, K., Kubota, Y.   and Zhang, Z. , 2019. Deep Learning-Based Point-Scanning Super-Resolution Imaging. bioRxiv , p.740548. � 5

Motivating Applications • Face recognition • In Collaboration with NASA JPL • ~100 GB image data • TensorFlow + Horovod • Each run takes ~12 hours on 16 NVIDIA GTX 1080 Ti GPUs [1]DLS’19-1 Mattmann, Chris A., Zhang, Z. . Deep Facial Recognition with TensorFlow,   The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO   [2] Courtesy image from: https://megapixels.cc/datasets/msceleb/ � 6

More DL Applications at TACC • Deep learning is both compute intensive and data intensive. � 7

Scale up vs. Scale out • Scale Up • Better and faster GPU cards / Specialized hardware, e.g. TPU • High acquisition cost to build large cluster • Scale Out • Using more computing nodes • Consistent with traditional HPC operations. • Specific challenges • Accuracy vs scalability • I/O issues � 8

The Race of ResNet50 • 90-epoch ResNet-50 training finished in 20 mins on 2,048 KNL with 74.9% top-1 accuracy • Against 8 GPUs baseline ResNet 50 ImageNet Training Acceleration from 2016~2019 900 800 791 700 600 500 466 400 300 264 200 100 116 1 29 28 56 0 He et al. Goyal et al. Cordeanu et You et al preferred Tencent & Sony Google (Microsoft) (Facebook) al. (SURF & (UBC & networks HKBU Research 1024 TPUv3 Intel) TACC) � 9

Accuracy vs Scalability • To yield high utilization at scale, we need to feed enough data (computation), which results in large batch size • Validation (test) accuracy is sensitive to batch size. • Large batch size can result in degraded validation accuracy • Layer-wise Adaptive Rate Scaling algorithm (LARS) 1 • Intuition: learning rate should be adjusted according to the norm of the weights in each layer [1] You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." � 10 In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper

Scalable Training Algorithm • Using batch size of 32K while preserving validation accuracy � 11

Scalable Training Algorithm Using batch size of 32K on Intel Xeon Phi 7250 (KNL) and Intel Xeon • Platinum 8160 (SKX) nodes You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper � 12

Scalability vs Data I/O • ResNet-50 with ImageNet on 16 Nvidia 1080Ti GPUs, mini-batch of 64 per GPU Lustre Ideal 8704 8704 9000 6750 Tpt (imgs/sec) 4352 4352 4500 2176 2176 2250 2786 2467 544 544 447 270 0 1 4 8 16 Number of Nodes (4 GPUs per node) � 13

I/O on Lustre Dataset # files # dirs total_size file_size ImageNet 1.3 million 2002 140 GB KB-MB Neural   0.6 million 6 500 GB MB Image Reactor   0.17 1 65 GB KB Status million � 14

Deep Learning I/O • DL’s long lasting, repeated, high volume, and highly concurrent file access can easily saturate the metadata and data service of traditional shared file system. • ResNet-50 with Keras, TensorFlow, and Horovod on 16 nodes, each with 4 GPUs • 128K stat() and readdir() operations with 64-way concurrent access • 117M stat(), open(), close() operations with 256-way concurrent access • ~180M read() operations with same concurrency • ~ 8 hour duration [DLS’19-2] Zhang, Z. , Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”,   The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO   � 15

FanStore Design • FanStore is a transient runtime file system that optimizes I/O for distributed DL training. 1 • Data is partitioned (optionally   compressed) and spread   across local storage space • File access functions are   intercepted and handled in   user space • Remote file access is in the   form of MPI round-trip   message [1] Zhang, Z. , Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”,   The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO   � 16

ResNet-50 GPU Results • 4 Nvidia 1080Ti GPUs per node, mini-batch of 64 per GPU Lustre Ideal FanStore 8704 8704 9000 7867 6750 4352 4352 4500 4050 2176 2176 2250 2786 2467 544 544 1902 447 270 544 0 1 4 8 16 Number of Nodes (4 GPUs per node) � 17

ResNet-50 CPU Results • Intel Xeon Platinum 8160 nodes on Stampede2, mini- batch size of 256 per node FanStore Ideal 18000 16384 16384 13500 15109 8192 8192 9000 7710 4096 4096 4500 2048 2048 3901 32 32 1968 32 0 1 64 128 256 512 Number of Nodes � 18

Memory Error in Deep Learning • The impact of memory error on deep learning training is unclear, due to its stochastic nature and mathematical properties • Difficult for computing centers or individual researchers to make hardware procurement decisions • Difficult for users to estimate their confidence in training correctness on ECC-free processors • Potential performance gain by not using ECC • To quantify the impact of memory errors on deep learning training and to investigate alternative solutions for memory error detection [Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM � 19

Technical Approach • Focusing on impact from silent data corruption (SDC) • P(Failure) ≈ P(Failure, SDC) = P(Failure | SDC) x P(SDC) • To evaluate P(Failure | SDC) • Sampling in the experiment design space • Manually flipping the selected bit • Observing validation accuracy and training loss • Estimating P(Failure | SDC) via marginal probability � 20

Testing Applications Mem App SW Version Node Device Memory Run Time Usage nvcaffe 0.16.5 1 2x1080 Ti 11 GB 0.45 GB 4.5 mins ConvNet caffe 1.0.0 1 1x1080 Ti 11 GB 3.9 GB 16 mins LRCN Intel- 1.1.0 512 KNL 96 GB 18.4 GB 8 mins ResNet50 Caffe � 21

Example of ConvNet with Cifar10 ConvNet with Cifar10 dataset baseline • Parameter Value 50,000 training items/10,000 • 200, 10200, 20200, validation items Iteration 30200, 40200, 50200, 60000 Phase forward, backward 60,000 Iterations/120 epochs • Place data, model Batch size: 100 • Layers 1, 2, …, 15 Parameter Layers 1, 2, …, 7 Top-1 Test Accuracy Acceptable • range: 76.52% - 80.83% Data Position 0, mid, last Bit Position 31, 30, 29, 28, 27, 22 Training Loss Acceptable range: • 0.2594 - 0.4975 Repetition 3 � 22

Key Observations • Training failure is independent of iteration number • Used part of training process instead of complete runs. • Errors on less significant bits lead to less training failures • Convolution layers have the most training failures, so we estimate the worst case failure rate assuming every layer is a convolution layer • Training loss in the immediate next iteration is an effective signal to detect catastrophic SDCs. � 23

Memory Error Impact on DL Training Expected   Scaling App P(SDC) P(F|SDC) P(F) Runs per Factor Failure 3.07 *10 -6 5.4 * 10 -8 ConvNet 1.76% 1 18.5 M ResNet50 5.89 *10 -2 1.22% 9 7.18 * 10 -4 1,610 LRCN 5.19 *10 -3 0.61% 110 3.17 * 10- 5 31,500 [Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM � 24

Deep Learning on HPC: Performance Factors and Lessons Learned Weijia - PowerPoint PPT Presentation

Deep Learning on HPC: Performance Factors and Lessons Learned Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil19 Denver, CO 1 Outline Motivating

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Uni.lu HPC School 2019 PS12a: Machine / Deep learning I Keras/Tensorflow CPU/GPU Uni.lu High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

How We See Terrence Sejnowski HHMI Salk Institute University of California, San Diego Levels

Ellipsis Licensing in Sluicing: A QuD Account Matthew Barros and Hadas Kotek Yale University {

A signal propagation perspective for pruning neural networks at initialization Namhoon Lee 1 ,

When is a Table not a Table? Toward the Identification of References to Communicative Artifacts

Think Deep Learning: Overview Ju Sun Computer Science & Engineering University of Minnesota,

A Career in Scientific Publishing Emilie Marcus, PhD CEO, Cell Press Editor-in-Chief, Cell Yale

High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of

Degree Options Career Opportunities and gain not only valuable experience but also exposure to

Deep Learning on HPC: Performance Factors and Lessons Learned Weijia - PowerPoint PPT Presentation

Deep Learning on HPC: Performance Factors and Lessons Learned Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil19 Denver, CO 1 Outline Motivating

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Uni.lu HPC School 2019 PS12a: Machine / Deep learning I Keras/Tensorflow CPU/GPU Uni.lu High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

How We See Terrence Sejnowski HHMI Salk Institute University of California, San Diego Levels

Ellipsis Licensing in Sluicing: A QuD Account Matthew Barros and Hadas Kotek Yale University {

A signal propagation perspective for pruning neural networks at initialization Namhoon Lee 1 ,

When is a Table not a Table? Toward the Identification of References to Communicative Artifacts

Think Deep Learning: Overview Ju Sun Computer Science &amp; Engineering University of Minnesota,

A Career in Scientific Publishing Emilie Marcus, PhD CEO, Cell Press Editor-in-Chief, Cell Yale

High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of

Degree Options Career Opportunities and gain not only valuable experience but also exposure to

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Think Deep Learning: Overview Ju Sun Computer Science & Engineering University of Minnesota,