Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter - PowerPoint PPT Presentation

P ara D nn github.com/Emma926/paradnn A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University Contact: ywang03@g.harvard.edu 3/3/2020

Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter Mattson, Lifeng Nai, David Patterson, Francesco Pontiggia, Parthasarathy Ranganathan, Vijay Reddi, Brennan Saeta, Zak Stone, Anitha Vijayakumar, Shibo Wang, Qiumin Xu, Doe Hyun Yoon, Cliff Young

Challenges with ML Benchmarking ● Diversity in deep learning models used Problem Domains, Models, Datasets ○ ● Pace of field State-of-the-art models evolve every few months ○ ● Varying evaluation metrics Accuracy, Time to train, Latency of inference ○ ● Multi-disciplinary field Algorithms, Systems, Hardware, ML Software Stacks ○

State of the art: MLPerf 0.6 Area Benchmark Dataset Model Reference Implementation Vision Image classification ImageNet ResNet-50 TensorFlow Object detection COCO 2017 Mask R-CNN Pytorch Object detection COCO 2017 SSD-ResNet34 Pytorch Language/ Translation WMT Eng-Germ Transformer TensorFlow Audio Speech recognition GNMT PyTorch WMT Eng-Germ Commerce Recommendation MovieLens-20M NCF PyTorch Action Reinforcement Learning Go Mini-go TensorFlow

Our Methodology P ara D nn

ParaDnn vs MLPerf P ara D nn - Avoid drawing conclusions based on - Good for studying accuracy or several arbitrary models convergence with real datasets - Generate thousands of parameterized , - Represent the specific models some end-to-end models people care about - Prepare hardware designs for future models - Complement the use of existing real-world models, i.e. MLPerf

ParaDnn Canonical Models Fully Connected (FC) # of Layers Input # of Nodes # of Nodes Output CNNs: Residual, Bottleneck # of Res/Bottleneck x 4 Input FC Layer Output Blocks (filter size) RNNs: RNN, LSTM, GRU # of Layers Input RNN or LSTM or GRU cell (size) RNN or LSTM or GRU cell Output

Models

Models - ParaDnn covers a larger range than the real models - from 10k to ~1 billion parameters

Analysis Enabled by ParaDnn - Roofline analysis of TPU v2 - Homogenous Platform Comparison: TPU v2 vs v3 - Heterogeneous Platform Comparison: TPU vs GPU

The Roofline Model David Brooks, Gu-Yeon Wei 13

The Roofline Model Peak FLOPS David Brooks, Gu-Yeon Wei 14

The Roofline Model Peak FLOPS Memory Bandwidth David Brooks, Gu-Yeon Wei 15

The Roofline Model compute-intensive David Brooks, Gu-Yeon Wei 16

The Roofline Model memory-intensive compute-intensive David Brooks, Gu-Yeon Wei 17

Transformer David Brooks, Gu-Yeon Wei 18

FC Models ParaDnn sweeps a large range of models, from memory-bound to compute-bound. David Brooks, Gu-Yeon Wei 19

FC Models Compute-bound David Brooks, Gu-Yeon Wei 20

FC Models Memory-bound David Brooks, Gu-Yeon Wei 21

TPU v2 vs v3? 22

How to upgrade to TPU v3? TPU v2 23

How to upgrade to TPU v3? TPU v3 (FLOPS ) TPU v2 24

How to upgrade to TPU v3? TPU v3 (FLOPS ) TPU v2 TPU v3 (Mem BW ) 25

How to upgrade to TPU v3? TPU v3 (FLOPS Mem BW ) TPU v3 (FLOPS ) TPU v2 TPU v3 (Mem BW ) 26

How to upgrade to TPU v3? TPU v3 (FLOPS Mem BW ) ? x TPU v2 ? x 27

Architecture of TPU v2 vs v3 180 TFLOPS / Board 420 TFLOPS / Board 28 Figure is from https://cloud.google.com/tpu/docs/system-architecture

Google’s Choice of TPU v3 TPU v3 2.3 x TPU v2 ? x 29

TPU v3 vs v2: FC Operation Breakdown 30

TPU v3 vs v2: FC Operation Breakdown Compute-bound: 2.3x speedup 31

TPU v3 vs v2: FC Operation Breakdown Memory-bound: 1.5x speedup 32

TPU v3 vs v2: FC Operation Breakdown Memory-bound, but benefit from 2x memory capacity: 3x speedup 33

Google’s Choice of TPU v3 TPU v3 2.3 x TPU v2 1.5 x 34

TPU v3 vs v2: FC Operation Breakdown ParaDnn provides diverse set of operations, and shows different operations are sensitive to different system component upgrades. 35

TPU vs GPU?

Hardware Platforms 37

Hardware Platforms 300 GB/s per core 38

FC and CNN FC W A FC FC Gradient G Weighted Sum

FC and CNN FC CNN Fewer Weights W W Conv A FC A Larger Conv ops FC Conv Gradient Gradient G G Weighted Weighted Sum Sum

Hardware Platforms 300 GB/s per core 41

FC TPU/GPU Speedup colored with Batch Size 9 0.35 42

FC TPU/GPU Speedup colored with Batch Size 9 TPU is better GPU is better 0.35 43

FC TPU/GPU Speedup colored with Batch Size 9 TPU is better GPU is better 0.35 44

FC TPU/GPU Speedup colored with Node Size 9 More nodes More weights More memory-bound 45

Hardware Platforms 1.44x 300 GB/s per core 46

CNN TPU/GPU Speedup colored with Batch Size 47

CNN TPU/GPU Speedup colored with Batch Size - Up to 6x speedup - TPU architecture and software is highly optimized for CNNs 48

CNN TPU/GPU Speedup colored with Batch Size - All models runs faster on TPU. - Larger batch sizes lead to higher speedups. 49

CNN TPU/GPU Speedup colored with Filters - More filters have higher speedup lower bounds 50

Conclusion - Parameterized methodology: ParaDnn + a set of analysis methods - Single platform analysis: TPU v2 - Homogenous platform comparison: TPU v2 vs v3 - Heterogeneous platform comparison: TPU vs GPU

Limitations of this Work - Does not include: - Inference - Multi-node system: multi-GPU, or TPU pods - Accuracy, convergence - Cloud overhead - Tractability - Limit the range of hyperparameters and datasets - Small batch sizes (<16) and large batch sizes (> 2k) are not studied - Synthetic datasets do not include data infeed overhead - Iterations of TPU loop is 100. Larger numbers can slightly increase the performance.

P ara D nn Available: github.com/Emma926/paradnn Questions?

Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter - PowerPoint PPT Presentation

P ara D nn github.com/Emma926/paradnn A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University Contact: ywang03@g.harvard.edu 3/3/2020 Acknowledgement

Land Acknowledgement Land Acknowledgement Lenape Gayogo h n , Haudenosaunee

Teaching Acknowledgement & Permissions Acknowledgement & Permissions Reading/Language

COVID-19 Community Engagement Funding Announcement Agenda OHAs Acknowledgement to

MANAGING HEALTH RESOURCES: A FOUNDATION WORKSHOP GETTING STARTED Acknowledgement of Country

Teaching g Reading/Language Arts to All Students Tracie Lynn-Zakas tracie.zakas@cms.k12.nc.us

PEOPLE MANAGEMENT SKILLS PROGRAM DAY ONE SESSION 1 WELCOME AND INTRODUCTIONS ACKNOWLEDGEMENT

FINANCIAL MANAGEMENT IN A HEALTH CONTEXT WELCOME Acknowledgement of Country and Elders

Improving Stroke Prevention in Patients With Atrial Fibrillation Acknowledgement Disclosures

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Interventions for School Improvement Acknowledgement and disclaimer Information and materials for

Acknowledgement Until late 20 century OA is traditionally considered a wear and tear

Welcome to Stage 2 Acknowledgement of Country As we gather here today, we acknowledge the Gadigal

Professor Adrian Miller Professor of Indigenous Research IN THIS TALK 1. Acknowledgement to

Tanks Later Impacts and Results of Starting Large Scale FSM Michael McWhirter Acknowledgement:

ACKNOWLEDGEMENT OF COUNTRY We acknowledge and respect the Pambalong clan of the Awabakal people,

Acknowledgement For giving distinction to the National Bioethics Commission of Mexico Amy

Practical Traffic Analysis Attacks on Secure Messaging Applications Alireza Bahramali, Ramin

Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter

Preserving Statistical Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden

1 Analysis Information Where Do Facts Hold? How much information depends on the client

Pr Progr gram T am Trans ansforma,o rma,on f n for A r Aiding iding St Sta,c a,c A

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu

in FPGA HLS to improve Maximum Frequency Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao

The use of SMT in financial news sentiment analysis Thomas Dohmen SemLab SemLab founded in

Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter - PowerPoint PPT Presentation

P ara D nn github.com/Emma926/paradnn A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University Contact: ywang03@g.harvard.edu 3/3/2020 Acknowledgement

Land Acknowledgement Land Acknowledgement Lenape Gayogo h n , Haudenosaunee

Teaching Acknowledgement &amp; Permissions Acknowledgement &amp; Permissions Reading/Language

COVID-19 Community Engagement Funding Announcement Agenda OHAs Acknowledgement to

MANAGING HEALTH RESOURCES: A FOUNDATION WORKSHOP GETTING STARTED Acknowledgement of Country

Teaching g Reading/Language Arts to All Students Tracie Lynn-Zakas tracie.zakas@cms.k12.nc.us

PEOPLE MANAGEMENT SKILLS PROGRAM DAY ONE SESSION 1 WELCOME AND INTRODUCTIONS ACKNOWLEDGEMENT

FINANCIAL MANAGEMENT IN A HEALTH CONTEXT WELCOME Acknowledgement of Country and Elders

Improving Stroke Prevention in Patients With Atrial Fibrillation Acknowledgement Disclosures

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Interventions for School Improvement Acknowledgement and disclaimer Information and materials for

Acknowledgement Until late 20 century OA is traditionally considered a wear and tear

Welcome to Stage 2 Acknowledgement of Country As we gather here today, we acknowledge the Gadigal

Professor Adrian Miller Professor of Indigenous Research IN THIS TALK 1. Acknowledgement to

Tanks Later Impacts and Results of Starting Large Scale FSM Michael McWhirter Acknowledgement:

ACKNOWLEDGEMENT OF COUNTRY We acknowledge and respect the Pambalong clan of the Awabakal people,

Acknowledgement For giving distinction to the National Bioethics Commission of Mexico Amy

Practical Traffic Analysis Attacks on Secure Messaging Applications Alireza Bahramali, Ramin

Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter

Preserving Statistical Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden

1 Analysis Information Where Do Facts Hold? How much information depends on the client

Pr Progr gram T am Trans ansforma,o rma,on f n for A r Aiding iding St Sta,c a,c A

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu

in FPGA HLS to improve Maximum Frequency Licheng Guo*, Jason Lau*, Yuze Chi, Jie Wang, Cody Hao

The use of SMT in financial news sentiment analysis Thomas Dohmen SemLab SemLab founded in

Teaching Acknowledgement & Permissions Acknowledgement & Permissions Reading/Language

in FPGA HLS to improve Maximum Frequency Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao