Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter - - PowerPoint PPT Presentation

acknowledgement
SMART_READER_LITE
LIVE PREVIEW

Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter - - PowerPoint PPT Presentation

P ara D nn github.com/Emma926/paradnn A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University Contact: ywang03@g.harvard.edu 3/3/2020 Acknowledgement


slide-1
SLIDE 1

A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University

3/3/2020 Contact: ywang03@g.harvard.edu

ParaDnn

github.com/Emma926/paradnn

slide-2
SLIDE 2

Acknowledgement

Frank Chen, Glenn Holloway, Dan Janni, Peter Mattson, Lifeng Nai, David Patterson, Francesco Pontiggia, Parthasarathy Ranganathan, Vijay Reddi, Brennan Saeta, Zak Stone, Anitha Vijayakumar, Shibo Wang, Qiumin Xu, Doe Hyun Yoon, Cliff Young

slide-3
SLIDE 3

Challenges with ML Benchmarking

  • Diversity in deep learning models used

Problem Domains, Models, Datasets

  • Pace of field

State-of-the-art models evolve every few months

  • Varying evaluation metrics

Accuracy, Time to train, Latency of inference

  • Multi-disciplinary field

Algorithms, Systems, Hardware, ML Software Stacks

slide-4
SLIDE 4

State of the art: MLPerf 0.6

Area Benchmark Dataset Model Reference Implementation Vision Image classification ImageNet ResNet-50 TensorFlow Object detection COCO 2017 Mask R-CNN Pytorch Object detection COCO 2017 SSD-ResNet34 Pytorch Language/ Audio Translation WMT Eng-Germ Transformer TensorFlow Speech recognition WMT Eng-Germ GNMT PyTorch Commerce Recommendation MovieLens-20M NCF PyTorch Action Reinforcement Learning Go Mini-go TensorFlow

slide-5
SLIDE 5

State of the art: MLPerf 0.6

Area Benchmark Dataset Model Reference Implementation Vision Image classification ImageNet ResNet-50 TensorFlow Object detection COCO 2017 Mask R-CNN Pytorch Object detection COCO 2017 SSD-ResNet34 Pytorch Language/ Audio Translation WMT Eng-Germ Transformer TensorFlow Speech recognition WMT Eng-Germ GNMT PyTorch Commerce Recommendation MovieLens-20M NCF PyTorch Action Reinforcement Learning Go Mini-go TensorFlow

slide-6
SLIDE 6

Our Methodology

ParaDnn

slide-7
SLIDE 7

Our Methodology

ParaDnn

slide-8
SLIDE 8

ParaDnn vs MLPerf

  • Avoid drawing conclusions based on

several arbitrary models

  • Generate thousands of parameterized,

end-to-end models

  • Prepare hardware designs for future

models

  • Complement the use of existing real-world

models, i.e. MLPerf

  • Good for studying accuracy or

convergence with real datasets

  • Represent the specific models some

people care about

ParaDnn

slide-9
SLIDE 9

ParaDnn Canonical Models

Fully Connected (FC) CNNs: Residual, Bottleneck RNNs: RNN, LSTM, GRU

# of Nodes # of Nodes Input Output # of Layers # of Res/Bottleneck Blocks (filter size) Input Output FC Layer x 4 RNN or LSTM or GRU cell (size) Input Output # of Layers RNN or LSTM or GRU cell

slide-10
SLIDE 10

Models

slide-11
SLIDE 11

Models

  • ParaDnn covers a larger range than the real models
  • from 10k to ~1 billion parameters
slide-12
SLIDE 12

Analysis Enabled by ParaDnn

  • Roofline analysis of TPU v2
  • Homogenous Platform Comparison: TPU v2 vs v3
  • Heterogeneous Platform Comparison: TPU vs GPU
slide-13
SLIDE 13

The Roofline Model

13

David Brooks, Gu-Yeon Wei

slide-14
SLIDE 14

The Roofline Model

14

David Brooks, Gu-Yeon Wei

Peak FLOPS

slide-15
SLIDE 15

The Roofline Model

15

David Brooks, Gu-Yeon Wei

Peak FLOPS Memory Bandwidth

slide-16
SLIDE 16

The Roofline Model

16

David Brooks, Gu-Yeon Wei

compute-intensive

slide-17
SLIDE 17

The Roofline Model

17

David Brooks, Gu-Yeon Wei

compute-intensive memory-intensive

slide-18
SLIDE 18

Transformer

18

David Brooks, Gu-Yeon Wei

slide-19
SLIDE 19

FC Models

19

David Brooks, Gu-Yeon Wei

ParaDnn sweeps a large range of models, from memory-bound to compute-bound.

slide-20
SLIDE 20

FC Models

20

David Brooks, Gu-Yeon Wei

Compute-bound

slide-21
SLIDE 21

FC Models

21

David Brooks, Gu-Yeon Wei

Memory-bound

slide-22
SLIDE 22

TPU v2 vs v3?

22

slide-23
SLIDE 23

How to upgrade to TPU v3?

23

TPU v2

slide-24
SLIDE 24

How to upgrade to TPU v3?

24

TPU v2 TPU v3 (FLOPS )

slide-25
SLIDE 25

How to upgrade to TPU v3?

25

TPU v2 TPU v3 (FLOPS ) TPU v3 (Mem BW )

slide-26
SLIDE 26

How to upgrade to TPU v3?

26

TPU v2 TPU v3 (Mem BW ) TPU v3 (FLOPS ) TPU v3 (FLOPS Mem BW )

slide-27
SLIDE 27

How to upgrade to TPU v3?

27

TPU v2 ? x ? x TPU v3 (FLOPS Mem BW )

slide-28
SLIDE 28

Architecture of TPU v2 vs v3

28

Figure is from https://cloud.google.com/tpu/docs/system-architecture

180 TFLOPS / Board 420 TFLOPS / Board

slide-29
SLIDE 29

Google’s Choice of TPU v3

29

TPU v2 TPU v3 2.3 x ? x

slide-30
SLIDE 30

TPU v3 vs v2: FC Operation Breakdown

30

slide-31
SLIDE 31

TPU v3 vs v2: FC Operation Breakdown

31

Compute-bound: 2.3x speedup

slide-32
SLIDE 32

TPU v3 vs v2: FC Operation Breakdown

32

Memory-bound: 1.5x speedup

slide-33
SLIDE 33

TPU v3 vs v2: FC Operation Breakdown

33

Memory-bound, but benefit from 2x memory capacity: 3x speedup

slide-34
SLIDE 34

Google’s Choice of TPU v3

34

TPU v2 TPU v3 2.3 x 1.5 x

slide-35
SLIDE 35

TPU v3 vs v2: FC Operation Breakdown

35

ParaDnn provides diverse set of operations, and shows different operations are sensitive to different system component upgrades.

slide-36
SLIDE 36

TPU vs GPU?

slide-37
SLIDE 37

Hardware Platforms

37

slide-38
SLIDE 38

Hardware Platforms

38

300 GB/s per core

slide-39
SLIDE 39

FC and CNN

FC

FC W A FC Gradient Weighted Sum G

slide-40
SLIDE 40

FC and CNN

FC CNN

FC W A FC Gradient Weighted Sum G

Conv

A Conv Gradient Weighted Sum G

W

Fewer Weights Larger Conv ops

slide-41
SLIDE 41

Hardware Platforms

41

300 GB/s per core

slide-42
SLIDE 42

FC TPU/GPU Speedup colored with Batch Size

9 0.35

42

slide-43
SLIDE 43

FC TPU/GPU Speedup colored with Batch Size

9 0.35

TPU is better GPU is better

43

slide-44
SLIDE 44

FC TPU/GPU Speedup colored with Batch Size

9 0.35

TPU is better GPU is better

44

slide-45
SLIDE 45

FC TPU/GPU Speedup colored with Node Size

9

45

More nodes More weights More memory-bound

slide-46
SLIDE 46

Hardware Platforms

46

300 GB/s per core 1.44x

slide-47
SLIDE 47

CNN TPU/GPU Speedup colored with Batch Size

47

slide-48
SLIDE 48

CNN TPU/GPU Speedup colored with Batch Size

  • Up to 6x speedup
  • TPU architecture and software

is highly optimized for CNNs

48

slide-49
SLIDE 49

CNN TPU/GPU Speedup colored with Batch Size

  • All models runs faster on TPU.
  • Larger batch sizes lead to

higher speedups.

49

slide-50
SLIDE 50

CNN TPU/GPU Speedup colored with Filters

  • More filters have higher

speedup lower bounds

50

slide-51
SLIDE 51

Conclusion

  • Parameterized methodology: ParaDnn + a set of analysis methods
  • Single platform analysis: TPU v2
  • Homogenous platform comparison: TPU v2 vs v3
  • Heterogeneous platform comparison: TPU vs GPU
slide-52
SLIDE 52

Limitations of this Work

  • Does not include:
  • Inference
  • Multi-node system: multi-GPU, or TPU pods
  • Accuracy, convergence
  • Cloud overhead
  • Tractability
  • Limit the range of hyperparameters and datasets
  • Small batch sizes (<16) and large batch sizes (> 2k) are not studied
  • Synthetic datasets do not include data infeed overhead
  • Iterations of TPU loop is 100. Larger numbers can slightly increase the

performance.

slide-53
SLIDE 53

Questions? ParaDnn

Available: github.com/Emma926/paradnn