ML HPC: Optimizing Optimizers for Optimization Workshop on the - - PowerPoint PPT Presentation

โ–ถ
ml hpc optimizing optimizers for optimization
SMART_READER_LITE
LIVE PREVIEW

ML HPC: Optimizing Optimizers for Optimization Workshop on the - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

TAL BEN-NUN

ML โ†” HPC: Optimizing Optimizers for Optimization

Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom

WITH CONTRIBUTIONS FROM DAN ALISTARH, NIKOLI DRYDEN, YOSUKE OYAMA, CEDRIC RENGGLI, AND OTHERS AT SPCL

Sto Stochasti tic Pe Perfo formance

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

2

20 TB/night

Source: OpenAI

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

4

A brief intro to supervised deep learning

1.00 0.00 0.00 0.00 0.00 0.00 Cat Dog Airplane Truck Horse Banana

labeled samples ๐‘ฆ โˆˆ ๐‘Œ โŠ‚ ๐’  ๐‘” ๐‘ฆ : ๐‘Œ โ†’ ๐‘ label domain ๐‘ network structure (fixed) weights ๐‘ฅ (learned) ๐‘ฅโˆ— = argmin๐‘ฅโˆˆโ„๐‘’ ๐”ฝ๐‘ฆ~๐’  โ„“ ๐‘ฅ, ๐‘ฆ true label ๐‘š(๐‘ฆ)

0.54 0.28 0.02 0.07 0.33 0.02 Cat Dog Airplane Truck Horse Banana

๐‘”(๐‘ฆ) layer-wise parameter update

โ„“๐‘‘๐‘“ ๐‘ฅ, ๐‘ฆ = โˆ’ เท

๐‘—

๐‘š ๐‘ฆ ๐‘— โ‹… log ๐‘“๐‘” ๐‘ฆ ๐‘— ฯƒ๐‘™ ๐‘“๐‘” ๐‘ฆ ๐‘™

โ„“๐‘ก๐‘Ÿ ๐‘ฅ, ๐‘ฆ = ๐‘” ๐‘ฆ โˆ’ ๐‘š ๐‘ฆ

2

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

5

A brief intro to supervised deep learning

1.00 0.00 0.00 0.00 0.00 0.00 Cat Dog Airplane Truck Horse Banana

labeled samples ๐‘ฆ โˆˆ ๐‘Œ โŠ‚ ๐’  ๐‘” ๐‘ฆ : ๐‘Œ โ†’ ๐‘ label domain ๐‘ network structure (fixed) weights ๐‘ฅ (learned) ๐‘ฅโˆ— = argmin๐‘ฅโˆˆโ„๐‘’ ๐”ฝ๐‘ฆ~๐’  โ„“ ๐‘ฅ, ๐‘ฆ true label ๐‘š(๐‘ฆ)

0.54 0.28 0.02 0.07 0.33 0.02 Cat Dog Airplane Truck Horse Banana

๐‘”(๐‘ฆ) layer-wise parameter update

โ„“๐‘‘๐‘“ ๐‘ฅ, ๐‘ฆ = โˆ’ เท

๐‘—

๐‘š ๐‘ฆ ๐‘— โ‹… log ๐‘“๐‘” ๐‘ฆ ๐‘— ฯƒ๐‘™ ๐‘“๐‘” ๐‘ฆ ๐‘™

โ„“๐‘ก๐‘Ÿ ๐‘ฅ, ๐‘ฆ = ๐‘” ๐‘ฆ โˆ’ ๐‘š ๐‘ฆ

2

โ‰ฅTBs of random access 100MiB-32GiB and beyond 30k-billions

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

8

Trends in deep learning: hardware and multi-node

The field is moving fast โ€“ trying everything imaginable โ€“ survey results from 252 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory

Deep Learning is largely on distributed memory today!

  • T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

9

Trends in distributed deep learning: node count and communication

Deep Learning research is converging to MPI!

The field is moving fast โ€“ trying everything imaginable โ€“ survey results from 252 papers in the area of parallel deep learning

  • T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

Communication mode

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

10

Computational Principles

Data

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

11

Computational Principles

Data

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

Indirect

12

Example: Options for computing convolutional layers Direct

๐‘ฅ โ„ฑ โ„ฑ โ„ฑโˆ’1 =

ร—

เท ๐‘ฅ FFT

4 1 9 8 5 9 9 8 0 7 3 4 2 6 3 1

1

  • 1

0.1 -2 3 4 1.1

*

21.9 59.3 53.9 43.9

  • 6.3 16.8 12.3

12 9.6 15.3 25.8 14 0.4 7.1 52.1 53.1

=

Winograd Direct im2col

  • K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Intโ€™l Workshop on Frontiers in Handwriting Recognition 2016
  • M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLRโ€™14
  • A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPRโ€™16

๐ท๐‘—๐‘œ ๐‘‚

Reshape im2col ๐ท๐‘—๐‘œ ๐ท๐‘๐‘ฃ๐‘ข โ‹… ๐ท๐‘—๐‘œ ๐ฟ๐‘ง ๐ฟ๐‘ฆ

๐‘‹ ๐ผ

๐ท๐‘๐‘ฃ๐‘ข ๐ท๐‘—๐‘œ โ‹… ๐ฟ๐‘ง โ‹… ๐ฟ๐‘ฆ

โ€ฆ

๐‘‚ โ‹… ๐ผโ€ฒ โ‹… ๐‘‹โ€ฒ ๐ท๐‘—๐‘œ โ‹… ๐ฟ๐‘ง โ‹… ๐ฟ๐‘ฆ

ร—

GEMM, col2im

๐‘‹โ€ฒ ๐ผโ€ฒ ๐ท๐‘๐‘ฃ๐‘ข

๐’™

๐‘ฎ(๐’, ๐’”) Winograd Domain

Channel-wise summation

+

๐ต๐‘ˆ โ‹… โ‹… ๐ต ๐ป โ‹… โ‹… ๐ป๐‘ˆ ๐ถ๐‘ˆ โ‹… โ‹… ๐ถ

Element-wise product ๐‘› ร— ๐‘› ๐‘  ร— ๐‘  ๐‘›โ€ฒ ร— ๐‘›โ€ฒ

๐‘›โ€ฒ = ๐‘› + ๐‘  โˆ’ 1
slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

13

Operator Design

*

=

* *

Separable convolution

A.G. Howard et al. โ€œMobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,โ€ arXiv 2017. F.N. Iandola et al โ€œSqueezenet: alexnet-level accuracy with 50x fewer parameters and <0.5MB model size,โ€ ICLR 2017.

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

14

Transformers โ€“ Multi-Head Attention

  • A. Vaswani et al. โ€œAttention Is All You Need,โ€ NeurIPS 2017.
slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

โ–ช Use techniques from compiler construction: DNN โ†’ Graph โ†’ IR โ†’ Transformations โ†’ HW Mapping

16

DNN Compilers TensorFlow XLA Facebook Glow / TorchScript JIT

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

โ–ช Use techniques from compiler construction: DNN โ†’ Graph โ†’ IR โ†’ Transformations โ†’ HW Mapping

17

DNN Compilers TensorFlow XLA Facebook Glow / TorchScript JIT

TVM Stack Intel nGraph

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

18

Partitioning Computation?

โ€ฆ โ€ฆ โ€ฆ โ€ฆ

Data Parallelism

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

19

Minibatch Stochastic Gradient Descent (SGD)

0.54 0.28 0.02 0.07 0.03 0.04 0.02 Cat Dog Airplane Truck Horse Bicycle 1.00 0.00 0.00 0.00 0.00 0.00 0.00 Cat Dog Airplane Truck Horse Bicycle

  • T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

20

Partitioning Computation?

โ€ฆ โ€ฆ โ€ฆ โ€ฆ

Data Parallelism Model Parallelism

1 3

Channel/Filter Spatial โ€ฆ

PIpeline Parallelism

Layer Idle Idle 1 2 3 1 2 3 1 2 3 3 2 1 3 2 1 3 2 1 1 1 1 1 1 1 Proc 1 Proc 2 Proc 3

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

21

Partitioning Computation?

โ€ฆ โ€ฆ โ€ฆ โ€ฆ

Data Parallelism Model Parallelism

1 3

Channel/Filter Spatial โ€ฆ

PIpeline Parallelism

Layer

โ–ช Parameters/domain can be distributed across processors

โ–ช Good for: large inputs, wide networks

โ–ช Complex communication per-layer โ–ช Performance hinges on implementation โ–ช Parameters can be distributed across processors

โ–ช Good for: deep models, few activations

โ–ช Sparse communication pattern (only pipeline stages) โ–ช Consistent model introduces idle-time โ€œBubbleโ€ โ–ช Simple and efficient solution, easy to implement โ–ช Duplicate parameters at all processors โ–ช Affects generalization

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

25

Hybrid parallelism

  • A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014
  • J. Dean et al.: Large scale distributed deep networks, NIPSโ€™12.
  • T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

โ–ช Layers/parameters can be distributed across processors โ–ช Can distribute minibatch โ–ช Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel)

โ–ช Enables arbitrary combinations of data, model, and pipeline parallelism โ€“ very powerful! Model Parallelism Data Parallelism Layer (pipeline) Parallelism โ€ฆ โ€ฆ โ€ฆ

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

27

Training is not just Training

  • K. Osawa et al., โ€œSecond-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochsโ€, arXiv 2018.
  • T. Karras et al., โ€œProgressive Growing of GANs for Improved Quality, Stability, and Variationโ€, arXiv 2017.

Imbalanced workload over time Nontrivial gradient aggregation Data/compute redistribution

slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

28

Hyperparameter and Architecture search

Reinforcement Learning [1] Evolutionary Algorithms [4]

โ–ช Meta-optimization of hyper-parameters (momentum) and DNN architecture

โ–ช Using Reinforcement Learning [1] (explore/exploit different configurations) โ–ช Genetic Algorithms with modified (specialized) mutations [2] โ–ช Particle Swarm Optimization [3] and other meta-heuristics โ–ช Multi-level optimization

[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCOโ€™17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLRโ€™18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPSโ€™18 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLRโ€™19

Model-Based Optimization [5,6]

slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

29

Hyperparameter and Architecture search

Reinforcement Learning [1] Evolutionary Algorithms [4]

โ–ช Meta-optimization of hyper-parameters (momentum) and DNN architecture

โ–ช Using Reinforcement Learning [1] (explore/exploit different configurations) โ–ช Genetic Algorithms with modified (specialized) mutations [2] โ–ช Particle Swarm Optimization [3] and other meta-heuristics โ–ช Multi-level optimization

[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCOโ€™17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLRโ€™18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPSโ€™18 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLRโ€™19

Model-Based Optimization [5,6]

slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

31

Updating parameters in distributed data parallelism

Central Decentral

parameter server (sharded) ๐‘ฅโ€™ = ๐‘ฃ(๐‘ฅ, ๐›ผ๐‘ฅ)

๐‘ฅ ๐›ผ๐‘ฅ

Training Agent Training Agent Training Agent Training Agent Training Agent Training Agent Training Agent Training Agent

collective allreduce of ๐’™

๐‘ˆ = 2๐‘€ log2 ๐‘„ + 2๐›ฟ๐‘›๐ป(๐‘„ โˆ’ 1)/๐‘„ ๐‘ˆ = 2๐‘€ + 2๐‘„ ๐›ฟ๐‘›/๐‘ก ๐ป

  • Collective operations
  • Topologies
  • Neighborhood collectives
  • RMA?

Hierarchical Parameter Server

  • S. Gupta et al.: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:

A Systematic

  • Study. ICDMโ€™16
slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

โ–ช Trades off โ€œstatistical performanceโ€ for โ€œhardware performanceโ€

32

Parameter (and Model) consistency - centralized

parameter server (sharded) ๐‘ฅโ€™ = ๐‘ฃ(๐‘ฅ, ๐›ผ๐‘ฅ)

๐‘ฅ ๐›ผ๐‘ฅ

Training Agent Training Agent Training Agent Training Agent

Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous

๐‘ฅ 1 ๐‘ฅ 1

Time

Parameter Server

Synchronization

๐‘ฅ 2 ๐‘ฅ 2

Agent 1 Agent m

. . .

๐‘ฅ ๐‘ˆ ๐‘ฅ 0

โ€ฆ

Sync.

Time

Parameter Server

Agent 1 Agent m

. . .

๐‘ฅ ๐‘ˆ ๐‘ฅ 0

โ€ฆ

๐‘ฅ 1,๐‘› ๐‘ฅ 2,๐‘› ๐‘ฅ 2,1 ๐‘ฅ 1,1 ๐‘ฅ 3,1 ๐‘ฅ 3,๐‘›

  • Max. Staleness

Time

Agent 1 Agent m

. . .

๐‘ฅ 1,1

๐‘ฅ 1,๐‘› ๐‘ฅ 2,๐‘›

๐‘ฅ 2,1 ๐‘ฅ 3,1 ๐‘ฅ 4,1

Parameter Server

๐‘ฅ 0 ๐‘ฅ ๐‘ˆ

โ€ฆ

Sync.

โ–ช Parameter exchange frequency can be controlled, while still attaining convergence:

Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)

  • J. Dean et al.: Large scale distributed deep networks, NIPSโ€™12.
  • F. Niu et al.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPSโ€™11.
slide-24
SLIDE 24

spcl.inf.ethz.ch @spcl_eth

โ–ช Parameter exchange frequency can be controlled, while still attaining convergence: โ–ช Using Gossip Algorithms [Jin et al. 2016] or Partial Collectives [Li et al. 2020]

33

Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous

Training Agent Training Agent Training Agent Training Agent

collective allreduce of ๐’™

Time

All- Reduce

Agent 1 Agent m

. . .

โ€ฆ โ€ฆ . . . Merge

๐‘ฅ 1,1

๐‘ฅ 1,๐‘› ๐‘ฅ 2,๐‘›

  • Max. Staleness

๐‘ฅ(0) ๐‘ฅ(๐‘ˆ)

๐‘ฅ 2,1 ๐‘ฅ 3,1 ๐‘ฅ 4,1

All- Reduce ๐‘ฅ 1

Time ๐‘ฅ(0)

All- Reduce ๐‘ฅ ๐‘ˆ ๐‘ฅ 2 ๐‘ฅ 2

Agent 1 Agent m

. . .

๐‘ฅ 1 ๐‘ฅ ๐‘ˆ

โ€ฆ โ€ฆ

All- Reduce

Time

Agent 1 Agent m

๐‘ฅ 1,๐‘› ๐‘ฅ 2,๐‘› ๐‘ฅ 2,1 ๐‘ฅ 1,1 ๐‘ฅ 3,1 ๐‘ฅ 3,๐‘›

Agent r Agent k

๐‘ฅ 1,๐‘  ๐‘ฅ 2,๐‘  ๐‘ฅ 3,๐‘  ๐‘ฅ 4,๐‘  ๐‘ฅ 5,๐‘  ๐‘ฅ 1,๐‘™ ๐‘ฅ 2,๐‘™ ๐‘ฅ 3,๐‘™

Peter H. Jin et al., โ€œHow to scale distributed deep learning?โ€, NIPS MLSystems 2016 Shigang Li et al., โ€œTaming unbalanced training workloads in deep learning with partial collective operationsโ€, PPoPP 2020

Parameter (and Model) consistency - decentralized

Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)

slide-25
SLIDE 25

spcl.inf.ethz.ch @spcl_eth

34

Parameter consistency in deep learning

Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)

๐‘ฅ ๐‘ข+1,๐‘— = ๐‘ฅ ๐‘ข,๐‘— โˆ’ ๐œƒ๐›ผ๐‘ฅ ๐‘ข,๐‘— โˆ’ ๐›ฝ ๐‘ฅ ๐‘ข,๐‘— โˆ’ เทฅ ๐‘ฅ๐‘ข เทฅ ๐‘ฅ๐‘ข+1 = 1 โˆ’ ๐›พ เทฅ ๐‘ฅ๐‘ข + ๐›พ ๐‘› เท

๐‘—=1 ๐‘›

๐‘ฅ ๐‘ข,๐‘—

๐‘ฅ 1,1

Time

Parameter Server

Agent 1 Agent m

. . .

๐‘ฅ ๐‘ˆ ๐‘ฅ 0

โ€ฆ

Sync.

๐‘ฅ 2,1 ๐‘ฅ 3,1 ๐‘ฅ 4,1 ๐‘ฅ 5,1 ๐‘ฅ 6,1 ๐‘ฅ 1,๐‘› ๐‘ฅ 2,๐‘› ๐‘ฅ 3,๐‘› ๐‘ฅ 4,๐‘› ๐‘ฅ 5,๐‘› ๐‘ฅ 6,๐‘›

Elastic Average

  • S. Zhang et al.: Deep learning with Elastic Averaging SGD, NIPSโ€™15

Using physical forces between different versions of ๐‘ฅ:

slide-26
SLIDE 26

spcl.inf.ethz.ch @spcl_eth

35

Parameter consistency in deep learning

Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)

Avg.

0.54 0.28 0.02 0.07 0.33 0.04 0.02 Cat Dog Airplane Truck Horse Bicycle

  • T. G. Dietterich: Ensemble Methods in Machine Learning, MCS 2000
slide-27
SLIDE 27

spcl.inf.ethz.ch @spcl_eth

โ–ช Different options how to optimize updates

โ–ช Send ๐›ผ๐‘ฅ, receive ๐‘ฅ โ–ช Send FC factors (๐‘๐‘šโˆ’1, ๐‘๐‘š), compute ๐›ผ๐‘ฅ on parameter server Broadcast factors to not receive full w โ–ช Use lossy compression when sending, accumulate error locally!

โ–ช Quantization

โ–ช Quantize weight updates and potentially weights โ–ช Main trick is stochastic rounding [1] โ€“ expectation is more accurate Enables low precision (half, quarter) to become standard โ–ช TernGrad - ternary weights [2], 1-bit SGD [3], โ€ฆ

โ–ช Sparsification

โ–ช Do not send small weight updates or only send top-k [4] Accumulate omitted gradients locally

37

Communication optimizations

parameter server (sharded) ๐‘ฅโ€™ = ๐‘ฃ(๐‘ฅ, ๐›ผ๐‘ฅ)

๐‘ฅ ๐›ผ๐‘ฅ

Training Agent Training Agent Training Agent Training Agent

[1] S. Gupta et al. Deep Learning with Limited Numerical Precision, ICMLโ€™15 [2] F. Li and B. Liu. Ternary Weight Networks, arXiv 2016 [3] F. Seide et al. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014 [4] C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, arXiv 2018 source: ai.intel.com

slide-28
SLIDE 28

spcl.inf.ethz.ch @spcl_eth

38

SparCML โ€“ Quantized sparse allreduce for decentral updates

๐›ผ๐‘ฅ1 ๐›ผ๐‘ฅ2 ๐›ผ๐‘ฅ3 ๐›ผ๐‘ฅ4

+ + + +

  • C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, SC 2019

Microsoft Speech Production Workload Results โ€“ 2 weeks โ†’ 2 days!

Six epochs, 60 million params

slide-29
SLIDE 29

spcl.inf.ethz.ch @spcl_eth

42

20 TB/night

Source: OpenAI

slide-30
SLIDE 30

spcl.inf.ethz.ch @spcl_eth

Opportunities

43

C/C++ FORTRAN Python Java CUDA OpenCL . . . SSA Representation (LLVM IR) Neural Code Comprehension

DNN DNN DNN DNN

Learnable Representation

Malicious Code Detection Guided Programming Code Optimization Hardware Mapping

Anti-Virus IDE Compiler

High-Level (Downstream) Tasks

Frontend

Source Code

slide-31
SLIDE 31

spcl.inf.ethz.ch @spcl_eth

Inspiration from Natural Language Processing (NLP) The Naturalness of Software Hypothesis1 Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.

  • 1. Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. A survey of machine learning for big code and naturalness. CoRR, abs/1709.06182, 2017.

44

slide-32
SLIDE 32

spcl.inf.ethz.ch @spcl_eth

Natural Language Programming Languages

45

The domestic cat is a small, typically furry, carnivorous mammal. They are often called house cats when kept as indoor pets or simply cats when there is no need to distinguish them from other felids and felines. They are

  • ften

valued by humans for companionship and for their ability to

  • hunt. There are more than seventy cat breeds

recognized by various cat registries.

Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat

int fibonacci(int n){ if ((n==1)||(n==0)) return(n); else return fibonacci(n-1) + fibonacci(n-2); }

Read sequentially Distinct structural features

slide-33
SLIDE 33

spcl.inf.ethz.ch @spcl_eth

Natural Language Programming Languages

46

The domestic cat is a small, typically furry, carnivorous mammal. They are often called house cats when kept as indoor pets or simply cats when there is no need to distinguish them from other felids and felines. They are

  • ften

valued by humans for companionship and for their ability to

  • hunt. There are more than seventy cat breeds

recognized by various cat registries.

Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat

int fibonacci(int n){ if ((n==1)||(n==0)) return(n); else return fibonacci(n-1) + fibonacci(n-2); }

Read sequentially Distinct structural features Local references Long range dependencies

... int f = fibonacci(my_number); ...

slide-34
SLIDE 34

spcl.inf.ethz.ch @spcl_eth

Natural Language Programming Languages

47

The domestic cat is a small, typically furry, carnivorous mammal. They are often called house cats when kept as indoor pets or simply cats when there is no need to distinguish them from other felids and felines. They are

  • ften

valued by humans for companionship and for their ability to

  • hunt. There are more than seventy cat breeds

recognized by various cat registries.

Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat

int fibonacci(int n){ if ((n==1)||(n==0)) return(n); else return fibonacci(n-1) + fibonacci(n-2); }

Read sequentially Distinct structural features Local references Long range dependencies Words come from a set vocabulary High rate of neologisms

my_number p flower fib my_func

slide-35
SLIDE 35

spcl.inf.ethz.ch @spcl_eth

Natural Language Programming Languages

48

The domestic cat is a small, typically furry, carnivorous mammal. They are often called house cats when kept as indoor pets or simply cats when there is no need to distinguish them from other felids and felines. They are

  • ften

valued by humans for companionship and for their ability to

  • hunt. There are more than seventy cat breeds

recognized by various cat registries.

Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat

int fibonacci(int n){ if ((n==1)||(n==0)) return(n); else return fibonacci(n-1) + fibonacci(n-2); }

Read sequentially Distinct structural features Local references Long range dependencies Words come from a set vocabulary High rate of neologisms Semantically robust Semantically brittle

0 1 1 2 3 5 8 13 21 34 ... 0 1 2 4 8 16 32 64 128 256 ... X 1

slide-36
SLIDE 36

spcl.inf.ethz.ch @spcl_eth

Representations of code

49

if return + fib(n-1) fib(n-2) return n

Source Code Abstract Syntax Tree (AST) Assembly Static Single Assignment (SSA)

define i32 @_Z9fibonaccii(i32 %n) #0 !dbg !4 { %1 = or i32 %n, 1, !dbg !16 %2 = icmp eq i32 %1, 1, !dbg !16 br i1 %2, label %9, label %3, !dbg !16 %4 = add nsw i32 %n, -1, !dbg !18 %5 = tail call i32 @_Z9fibonaccii(i32 %4), !dbg !19 %6 = add nsw i32 %n, -2, !dbg !20 %7 = tail call i32 @_Z9fibonaccii(i32 %6), !dbg !21 %8 = add nsw i32 %7, %5, !dbg !22 ret i32 %8, !dbg !23 ret i32 %n, !dbg !23 }

slide-37
SLIDE 37

spcl.inf.ethz.ch @spcl_eth

Representations of code

50

if return + fib(n-1) fib(n-2) return n

Source Code Abstract Syntax Tree (AST) Assembly Static Single Assignment (SSA)

define i32 @_Z9fibonaccii(i32 %n) #0 !dbg !4 { %1 = or i32 %n, 1, !dbg !16 %2 = icmp eq i32 %1, 1, !dbg !16 br i1 %2, label %9, label %3, !dbg !16 %4 = add nsw i32 %n, -1, !dbg !18 %5 = tail call i32 @_Z9fibonaccii(i32 %4), !dbg !19 %6 = add nsw i32 %n, -2, !dbg !20 %7 = tail call i32 @_Z9fibonaccii(i32 %6), !dbg !21 %8 = add nsw i32 %7, %5, !dbg !22 ret i32 %8, !dbg !23 ret i32 %n, !dbg !23 }

IR2Vec (LLVM) [Keerthy et al. 2019] inst2vec (LLVM) [Ben-Nun et al. 2018]

  • V. Raychev et al., โ€œPredicting Program Properties from โ€˜Big Codeโ€™โ€ POPL 2015.
  • C. Cummins et al., โ€œEnd-to-end Deep Learning of Optimization Heuristicsโ€, PACT 2017.
  • U. Alon et al., โ€œcode2vec: Learning Distributed Representations of Codeโ€, POPL 2018.
  • T. Ben-Nun et al., โ€œNeural Code Comprehension: A Learnable Representation of Code Semanticsโ€, NeurIPS 2018.
  • Q. Le et al., โ€œDeep learning at the shallow end: Malware classification for non-domain expertsโ€, DFRWS 2018.
  • V. Keerthy et al., โ€œIR2Vec: A Flow Analysisbased Scalable Infrastructure for Program Encodingsโ€, arXiv 2019.
  • A. Brauckmann et al., โ€œCompiler-Based Graph Representations for Deep Learning Models of Codeโ€, CC 2020.

CDFG (LLVM) [Brauckmann et al. 2020] AST paths [Raychev et al. 2015] [Le et al. 2018] DeepTune [Cummins et al. 2017] ProGraML (LLVM, XLA) [WIP] code2vec [Alon et al. 2018]

slide-38
SLIDE 38

spcl.inf.ethz.ch @spcl_eth

51

Representations of code

code2vec inst2vec CDFG

slide-39
SLIDE 39

spcl.inf.ethz.ch @spcl_eth

52

Self-Supervised Models of Code

slide-40
SLIDE 40

spcl.inf.ethz.ch @spcl_eth

53

ProGraML Overview

slide-41
SLIDE 41

spcl.inf.ethz.ch @spcl_eth

54

ProGraML Overview

slide-42
SLIDE 42

spcl.inf.ethz.ch @spcl_eth

55

ProGraML Overview

slide-43
SLIDE 43

spcl.inf.ethz.ch @spcl_eth

56

ProGraML on compiler tasks

slide-44
SLIDE 44

spcl.inf.ethz.ch @spcl_eth

57

Case Study: Algorithm Classification (104 Classes)

  • L. Mou et al., โ€œConvolutional neural networks over tree structures for programming language processingโ€, AAAI 2016.
slide-45
SLIDE 45

spcl.inf.ethz.ch @spcl_eth

58

Case Study: Heterogeneous Device Mapping

CPU or GPU? GPU thread-block size

slide-46
SLIDE 46

spcl.inf.ethz.ch @spcl_eth

59

Case Study: Branch Prediction

slide-47
SLIDE 47

spcl.inf.ethz.ch @spcl_eth

โ–ช A supercomputing problem - amenable to established tools and tricks from HPC โ–ช Concurrency is easy to attain, hard to program beyond data-parallelism โ–ช Main bottleneck in distributed is communication โ€“ reduction by using the robustness of SGD โ–ช Co-design is prevalent โ–ช Very different environment from traditional HPC

โ–ช Trade-off accuracy for performance!

โ–ช Main objective is generalization

โ–ช Performance-centric view in HPC can be harmful for accuracy

70

Summary โ€“ HPC โ†’ ML

https://www.arxiv.org/abs/1802.09941

slide-48
SLIDE 48

spcl.inf.ethz.ch @spcl_eth

โ–ช Categorizing and understanding code is essential for various tasks โ–ช Reasoning about code requires different tools than natural languages โ–ช Using classical compiler construction, structure can be recovered โ–ช Results are promising for various classes of downstream tasks

71

Summary โ€“ ML โ†’ HPC

https://arxiv.org/abs/1806.07336