ML HPC: Optimizing Optimizers for Optimization Workshop on the - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML ↔ HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N IKOLI D RYDEN , Y OSUKE O YAMA , C EDRIC R ENGGLI , AND OTHERS AT SPCL

spcl.inf.ethz.ch @spcl_eth 20 TB/night Source: OpenAI 2

spcl.inf.ethz.ch @spcl_eth A brief intro to supervised deep learning 𝑔(𝑦) Cat Cat 0.54 1.00 Dog Dog 0.28 0.00 Airplane 0.07 Airplane 0.00 Horse 0.33 Horse 0.00 Banana 0.02 Banana 0.00 0.02 Truck 0.00 Truck layer-wise parameter update labeled samples 𝑦 ∈ 𝑌 ⊂ 𝒠 label domain 𝑍 true label 𝑚(𝑦) 𝑔 𝑦 : 𝑌 → 𝑍 2 ℓ 𝑡𝑟 𝑥, 𝑦 = 𝑔 𝑦 − 𝑚 𝑦 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 network structure weights 𝑥 (fixed) (learned) 𝑓 𝑔 𝑦 𝑗 ℓ 𝑑𝑓 𝑥, 𝑦 = − ෍ 𝑚 𝑦 𝑗 ⋅ log σ 𝑙 𝑓 𝑔 𝑦 𝑙 𝑗 4

spcl.inf.ethz.ch @spcl_eth A brief intro to supervised deep learning 𝑔(𝑦) Cat Cat 0.54 1.00 Dog Dog 0.28 0.00 Airplane 0.07 Airplane 0.00 Horse 0.33 Horse 0.00 Banana 0.02 Banana 0.00 0.02 Truck 0.00 Truck layer-wise parameter update labeled samples 𝑦 ∈ 𝑌 ⊂ 𝒠 label domain 𝑍 true label 𝑚(𝑦) 𝑔 𝑦 : 𝑌 → 𝑍 2 ℓ 𝑡𝑟 𝑥, 𝑦 = 𝑔 𝑦 − 𝑚 𝑦 30k-billions 100MiB-32GiB and beyond ≥ TBs of random 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 network structure weights 𝑥 (fixed) (learned) access 𝑓 𝑔 𝑦 𝑗 ℓ 𝑑𝑓 𝑥, 𝑦 = − ෍ 𝑚 𝑦 𝑗 ⋅ log σ 𝑙 𝑓 𝑔 𝑦 𝑙 𝑗 5

spcl.inf.ethz.ch @spcl_eth Trends in deep learning: hardware and multi-node The field is moving fast – trying everything imaginable – survey results from 252 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory Deep Learning is largely on distributed memory today! 8 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

spcl.inf.ethz.ch @spcl_eth Trends in distributed deep learning: node count and communication The field is moving fast – trying everything imaginable – survey results from 252 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! 9 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

spcl.inf.ethz.ch @spcl_eth Computational Principles Data 10

spcl.inf.ethz.ch @spcl_eth Computational Principles Data 11

spcl.inf.ethz.ch @spcl_eth Example: Options for computing convolutional layers Direct Indirect 4 1 9 8 21.9 59.3 53.9 43.9 FFT Winograd 1 -1 0 5 9 9 8 -6.3 16.8 12.3 12 = Direct * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 ℱ 𝒙 𝑥 ෝ 𝑥 𝑠 × 𝑠 𝑛 ′ × 𝑛′ 𝑂 ⋅ 𝐼 ′ ⋅ 𝑋′ × 𝐶 𝑈 ⋅ ⋅ 𝐻 𝑈 ⋅ 𝐶 𝐻 ⋅ Element-wise product ℱ 𝐼′ 𝑂 𝐷 𝑗𝑜 ⋅ 𝐿 𝑧 ⋅ 𝐿 𝑦 im2col = … 𝑮(𝒏, 𝒔) 𝑋′ im2col GEMM, Winograd Channel-wise Domain col2im 𝐼 + summation ℱ −1 𝐵 𝑈 ⋅ ⋅ 𝐵 𝐷 𝑗𝑜 𝑋 𝑛 ′ = 𝑛 + 𝑠 − 1 × 𝑛 × 𝑛 𝐷 𝑗𝑜 𝐷 𝑝𝑣𝑢 𝐷 𝑝𝑣𝑢 Reshape 𝐿 𝑧 𝐷 𝑝𝑣𝑢 ⋅ 𝐷 𝑗𝑜 𝐿 𝑦 𝐷 𝑗𝑜 ⋅ 𝐿 𝑧 ⋅ 𝐿 𝑦 K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Re cognition 2016 M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14 12 A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16

spcl.inf.ethz.ch @spcl_eth Operator Design = * * * Separable convolution A.G. Howard et al. “ MobileNets : Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv 2017. F.N. Iandola et al “ Squeezenet: alexnet- level accuracy with 50x fewer parameters and <0.5MB model size,” ICLR 2017. 13

spcl.inf.ethz.ch @spcl_eth Transformers – Multi-Head Attention A. Vaswani et al. “Attention Is All You Need,” NeurIPS 2017. 14

spcl.inf.ethz.ch @spcl_eth DNN Compilers ▪ Use techniques from compiler construction: DNN → Graph → IR → Transformations → HW Mapping TensorFlow XLA Facebook Glow / TorchScript JIT 16

spcl.inf.ethz.ch @spcl_eth DNN Compilers ▪ Use techniques from compiler construction: DNN → Graph → IR → Transformations → HW Mapping Intel nGraph TVM Stack TensorFlow XLA Facebook Glow / TorchScript JIT 17

spcl.inf.ethz.ch @spcl_eth Partitioning Computation? … … Data Parallelism … … 18

spcl.inf.ethz.ch @spcl_eth Minibatch Stochastic Gradient Descent (SGD) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.03 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck 19 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

spcl.inf.ethz.ch @spcl_eth Partitioning Computation? … … Data Parallelism … … Model Parallelism 1 3 Channel/Filter Spatial Layer PIpeline Parallelism … Proc 3 1 1 1 2 3 3 2 1 Proc 2 1 1 1 2 3 3 2 1 Proc 1 1 1 2 3 1 3 2 1 Idle Idle 20

spcl.inf.ethz.ch @spcl_eth Partitioning Computation? ▪ Simple and efficient solution, easy to implement … … ▪ Duplicate parameters at all processors Data Parallelism … ▪ Affects generalization … ▪ Parameters/domain can be distributed across processors ▪ Good for: large inputs, wide networks Model Parallelism 1 ▪ Complex communication per-layer 3 ▪ Performance hinges on implementation Channel/Filter Spatial Layer ▪ Parameters can be distributed across processors ▪ Good for: deep models, few activations PIpeline Parallelism … ▪ Sparse communication pattern (only pipeline stages) ▪ Consistent model introduces idle- time “Bubble” 21

spcl.inf.ethz.ch @spcl_eth Data Model Hybrid parallelism Parallelism Parallelism … … … Layer (pipeline) Parallelism ▪ Layers/parameters can be distributed across processors ▪ Can distribute minibatch ▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel) ▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful! A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014 J. Dean et al.: Large scale distributed deep networks, NIPS ’12. 25 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

spcl.inf.ethz.ch @spcl_eth Training is not just Training Data/compute redistribution Imbalanced workload over time Nontrivial gradient aggregation K. Osawa et al., “Second -order Optimization Method for Large Mini-batch: Training ResNet- 50 on ImageNet in 35 Epochs”, arXiv 2018. T. Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, arXiv 2017. 27

spcl.inf.ethz.ch @spcl_eth Hyperparameter and Architecture search ▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture ▪ Using Reinforcement Learning [1] (explore/exploit different configurations) ▪ Genetic Algorithms with modified (specialized) mutations [2] ▪ Particle Swarm Optimization [3] and other meta-heuristics ▪ Multi-level optimization Reinforcement Learning [1] Evolutionary Algorithms [4] Model-Based Optimization [5,6] [1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper- parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPS’18 28 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLR’19

spcl.inf.ethz.ch @spcl_eth Hyperparameter and Architecture search ▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture ▪ Using Reinforcement Learning [1] (explore/exploit different configurations) ▪ Genetic Algorithms with modified (specialized) mutations [2] ▪ Particle Swarm Optimization [3] and other meta-heuristics ▪ Multi-level optimization Reinforcement Learning [1] Evolutionary Algorithms [4] Model-Based Optimization [5,6] [1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper- parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPS’18 29 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLR’19

ML HPC: Optimizing Optimizers for Optimization Workshop on the - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

An Introduction to Particle Swarm Multi-Objective Optimizers Carlos A. Coello Coello

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

Python - Data Analysis Essentials Day 2 Giuseppe Accaputo g@accaputo.ch 01.12.2018 Slide 1 IT

Collaboration wont just happen Supervising Co-Teaching Teams: Whose Line is it Deliberate

Targeted end-to-end knowledge graph decomposition Bla krlj, Jan Kralj and Nada Lavra c

High-dimensional integration without Markov chains Alexander Gray Carnegie Mellon University

The vMatrix: Server Switching (work in progress ROC03) Amr A. Awadallah Mendel Rosenblum

Operating System Implications of Fast, Cheap, Non-Volatile Memory Katelin Bailey , Luis Ceze,

Understanding Data Lifetime Presented by: William Enck CSE544: Spring 2007 Based on

Page Frame Reclaiming Don Porter CSE 506 Logical Diagram Binary Memory Threads Formats