Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor - - PowerPoint PPT Presentation

checkmate
SMART_READER_LITE
LIVE PREVIEW

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor - - PowerPoint PPT Presentation

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez checkmateai.github.io BigGAN (2018)


slide-1
SLIDE 1

checkmateai.github.io

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

Paras Jain

Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez

Checkmate

slide-2
SLIDE 2

checkmateai.github.io

2 BigGAN (2018)

Image generation

Brock et al. 2019

VideoBERT (2019)

Video generation

Sun et al. 2019

GPT-2 (2019)

Text generation

Radford et al. 2019

slide-3
SLIDE 3

checkmateai.github.io

400 800 1200 1600

R e s n e t 5 R

  • F

C N D e e p L a b V 2 B E R T L a r g e G P T

  • 2

Parameter count (106)

Emerging trend: Rapid growth in model size

Figure adapted from NVIDIA

3

slide-4
SLIDE 4

checkmateai.github.io AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201 nsoe, 201 RoR, 2018 iGAN, 2018 0G 5G 10G 15G

e-G eo se

GPU memory m

?

RAM usage per-GPU

State-of-the-art models have hit a memory capacity wall.

4

Limited GPU memory is slowing progress in new deep learning models!

Chen et al. 2016 Gomez et al. 2017 Pohlen et al. 2017 Liu et al. 2019 Dai et al. 2019 Child et al. 2019 Cited memory as limiting factor

slide-5
SLIDE 5

checkmateai.github.io AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201 nsoe, 201 RoR, 2018 iGAN, 2018 0G 5G 10G 15G

e-G eo se

GPU memory m

?

RAM usage per-GPU

State-of-the-art models have hit a memory capacity wall.

5

Limited GPU memory is slowing progress in new deep learning models!

Chen et al. 2016 Gomez et al. 2017 Pohlen et al. 2017 Liu et al. 2019 Dai et al. 2019 Child et al. 2019 Cited memory as limiting factor

How do we efficiently train large models beyond memory limits? Problem:

slide-6
SLIDE 6

checkmateai.github.io

6

TOPS per GiB capacity

Compute is outstripping DRAM capacity growth

slide-7
SLIDE 7

checkmateai.github.io

Backprop is optimized for compute efficiency, not RAM usage

7

RAM Compute

Compute-optimized backprop

slide-8
SLIDE 8

checkmateai.github.io

8

RAM Compute

Compute-optimized backprop RAM-optimized backprop

Ideal: scalable algorithm for backprop that adapts to RAM constraints

slide-9
SLIDE 9

checkmateai.github.io

This work: optimal space-time tradeoff for backpropagation

9

RAM Compute

Compute-optimized backprop

Checkmate explores

  • ptimal trade-off

5x larger inputs w/ 2x cost

RAM-optimized backprop

slide-10
SLIDE 10

checkmateai.github.io

10

RAM Compute RAM-hungry backprop policy

Keep all layers in RAM

Compute-optimized backprop

slide-11
SLIDE 11

checkmateai.github.io

A B C D E

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

Time RAM used

B C D E

∇E

11

Label

RAM-hungry backpropagation policy

Keep all layers in RAM

A

slide-12
SLIDE 12

checkmateai.github.io

A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

A B C D E

∇E ∇D

Time RAM used

12

RAM-hungry backpropagation policy

Keep all layers in RAM

slide-13
SLIDE 13

checkmateai.github.io

A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

A B C D E

∇E ∇D ∇C

Time RAM used

13

RAM-hungry backpropagation policy

Keep all layers in RAM

slide-14
SLIDE 14

checkmateai.github.io

A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

A B C D E

∇E ∇D ∇C ∇B ∇A

Time RAM used

14

Peak RAM

RAM-hungry backpropagation policy

Keep all layers in RAM

slide-15
SLIDE 15

checkmateai.github.io

RAM Compute RAM-optimized backpropagation policy

Recompute all layers as needed 15

RAM-optimized backprop

slide-16
SLIDE 16

checkmateai.github.io

Time RAM used

How can we use less memory? Free early & recompute RAM-optimized backpropagation policy

Recompute all layers 16

Peak RAM (no recomputation) A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

A B C D

slide-17
SLIDE 17

checkmateai.github.io

Time RAM used

A B C B D A E C

∇E

∇D

How can we use less memory? Free early & recompute RAM-optimized backpropagation policy

Recompute all layers 17

Peak RAM (no recomputation) A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

slide-18
SLIDE 18

checkmateai.github.io

Time RAM used

A B C D E

∇E

∇D

A B C

How can we use less memory? Free early & recompute RAM-optimized backpropagation policy

Recompute all layers 18

Peak RAM (no recomputation) A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

B A C

slide-19
SLIDE 19

checkmateai.github.io

Time RAM used

A B C D E

∇E

∇D

A B C

∇C

Peak RAM

How can we use less memory? Free early & recompute RAM-optimized backpropagation policy

Recompute all layers 19

Peak RAM (no recomputation) A B C D E

Label

∇E ∇D ∇C ∇B ∇A

Loss

Forward Pass Backward Pass

B A C

slide-20
SLIDE 20

checkmateai.github.io

20

Loss

E

∇E

D

∇D

C

∇C

B

∇B

A

∇A

Forward Pass Backward Pass

C

∇C

C

∇C

C

∇C

C

∇C

How to choose which layers to recompute?

slide-21
SLIDE 21

checkmateai.github.io

21

Label

Forward Pass

How to choose which layers to recompute?

slide-22
SLIDE 22

checkmateai.github.io

22

Label

Backward Pass Forward Pass

How to choose which layers to recompute?

slide-23
SLIDE 23

checkmateai.github.io

23

Label

106× slower

  • 1. Variable runtime per layer

Challenges of heuristics:

slide-24
SLIDE 24

checkmateai.github.io

24

Label

  • 1. Variable runtime per layer

Challenges of heuristics:

  • 2. Variable RAM usage per layer

103× more RAM

slide-25
SLIDE 25

checkmateai.github.io

25

Label

  • 1. Variable runtime per layer

Challenges of heuristics:

  • 2. Variable RAM usage per layer
  • 3. Real DNNs are non-linear
slide-26
SLIDE 26

checkmateai.github.io

26

Prior work is suboptimal in general setting!

Greedy heuristic

[Chen 2016] [XLA authors 2017, 2020]

Divide-and-conquer heuristic

[Griewank 2000] [Kowarz 2006] [Siskind 2018] [Kumar 2019]

Optimal for specific architecture

[Gruslys 2016] [Feng 2018] [Beaumont 2019]

  • 3. Real DNNs are non-linear
  • 2. Variable RAM usage per layer
  • 1. Variable runtime per layer

Challenges:

slide-27
SLIDE 27

checkmateai.github.io

27

RAM

Checkpoint every node Recompute all layers

Compute Can we optimally trade-off RAM for compute?

  • 3. DAG flexibility
  • 2. RAM-aware
  • 1. Hardware-aware

Let’s be:

slide-28
SLIDE 28

checkmateai.github.io

28

A system for optimal tensor rematerialization

Accae c mdel Fleible each ace Oimal le

Ineger Linear Program

Nea-imal le

To phae ronding

Gah eie

Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, TPU support

slide-29
SLIDE 29

checkmateai.github.io

29

A system for optimal tensor rematerialization

Accae c mdel Fleible each ace Oimal le

Ineger Linear Program

Nea-imal le

To phae ronding

Gah eie

slide-30
SLIDE 30

checkmateai.github.io

30

A system for optimal tensor rematerialization

A B C ∇ C ∇ B ∇ A

slide-31
SLIDE 31

checkmateai.github.io

t=1 t=2 t=3 t=4 t=5 t=6

31

A system for optimal tensor rematerialization

A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A

Stage Layer

𝑆!,# ∈ {0, 1}

slide-32
SLIDE 32

checkmateai.github.io

A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A

t=1 t=2 t=3 t=4 t=5 t=6

32

A system for optimal tensor rematerialization

𝑆!,# ∈ {0, 1}

Stage Layer

slide-33
SLIDE 33

checkmateai.github.io

t=1 t=2 t=3 t=4 t=5 t=6

33

A system for optimal tensor rematerialization

𝑆!,# ∈ {0, 1}

Stage Layer Layer

𝑇!,# ∈ {0, 1}

slide-34
SLIDE 34

checkmateai.github.io

t=1 t=2 t=3 t=4 t=5 t=6

34

A system for optimal tensor rematerialization

𝑆!,# ∈ {0, 1}

Stage Layer Layer

𝑇!,# ∈ {0, 1}

R = What is computed? S = What is in memory?

slide-35
SLIDE 35

checkmateai.github.io

t=1 t=2 t=3 t=4 t=5 t=6

35

A system for optimal tensor rematerialization

𝑆!,# ∈ {0, 1}

Stage Layer Layer

𝑇!,# ∈ {0, 1}

Example of optimal “S” (SegNet)

slide-36
SLIDE 36

checkmateai.github.io

36

A system for optimal tensor rematerialization

Accae c mdel Fleible each ace Oimal le

Ineger Linear Program

Nea-imal le

To phae ronding

Gah eie

slide-37
SLIDE 37

checkmateai.github.io

37

A system for optimal tensor rematerialization

Use R matrix to create linear objective Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables

min

+,,,- $ $ 𝐷. 𝑆/,.

Minimize forward + backward cost

slide-38
SLIDE 38

checkmateai.github.io

38

A system for optimal tensor rematerialization

“A layer’s dependencies must be computed before evaluation” “A layer must be computed before it can be stored in RAM” Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables

min

+,,,- $ $ 𝐷. 𝑆/,.

Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness

slide-39
SLIDE 39

checkmateai.github.io

Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables

39

A system for optimal tensor rematerialization

Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables Memory usage in stage 𝑢 𝑉!,# ∈ ℝ$ Constrain memory via an implicit variable to model memory usage at each stage

min

+,,,- $ $ 𝐷. 𝑆/,.

Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit

slide-40
SLIDE 40

checkmateai.github.io

40

A system for optimal tensor rematerialization

Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables Memory usage in stage 𝑢 𝑉!,# ∈ ℝ$ Constrain memory via an implicit variable to model memory usage at each stage

min

+,,,- $ $ 𝐷. 𝑆/,.

Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit Memory accounting details in paper

slide-41
SLIDE 41

checkmateai.github.io

A system for optimal tensor rematerialization How long is the solve time?

9 hours 😴

min

+,,,- $ $ 𝐷. 𝑆/,.

Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit

slide-42
SLIDE 42

checkmateai.github.io 42

A system for optimal tensor rematerialization

min

+,,,- $ $ 𝐷. 𝑆/,.

Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit

Partition schedule into frontier-advancing stages

𝑆!,! = 1 𝑆, 𝑇, 𝑉 lower triangular Tractability

9 hours → 0.2 seconds

Prunes n! permutations

  • f nodes
slide-43
SLIDE 43

checkmateai.github.io

43

A system for optimal tensor rematerialization

Accae c mdel Fleible each ace Oimal le

Ineger Linear Program

Nea-imal le

To phae ronding

Gah eie

slide-44
SLIDE 44

checkmateai.github.io

ILP optimization is NP-hard (combinatorial search)

Polynomial-time approximation? Proposed method: Two-Phase Rounding Round S, solve other variables optimally Insight: Given S, optimal 𝐒 easy to compute

  • 1. Relax boolean constraints
  • 2. Solve LP
  • 3. Round solution

How to maintain feasibility?

slide-45
SLIDE 45

checkmateai.github.io

AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201 nsoe, 201 RoR, 2018 iGAN, 2018 0G 5G 10G 15G e-G eo se GPU memory m

?

Evaluation: Questions

  • 1. What is the memory vs compute trade-off?
  • 2. How much can we increase batch/model size?
  • 3. How well does two-phase rounding do?
slide-46
SLIDE 46

checkmateai.github.io

Checkmate Checkmate

Evaluation: What is the memory vs compute trade-off?

Best heuristic

10 20 1.0 1.1 1.2 1. Budget (GB) Compute Overed ()

GPU Memory available (GB) U-Net, batch size 32 Best heuristic

10 20 30 10 11 12 13 1 1 Budget (GB) Compute Overed ()

MobileNet, batch size 512 GPU Memory available (GB) 1.2x speedup!

slide-47
SLIDE 47

checkmateai.github.io

Evaluation: How much can we increase batch size?

VGG19

224x224 images

Layer

𝑆!,#

Square root heuristic Batch size 197

1.18x larger! Stage Layer

No rematerialization Batch size 167

𝑆!,#

Layer

Checkmate Batch size 289

1.73x larger! 10 sec solve

𝑆!,#

slide-48
SLIDE 48

checkmateai.github.io 16 29 21 167 98 215 18 51 33 197 116 452 35 51 43 266 199 640 61 60 62 289 225 1105

0x 1x 2x 3x 4x 5x

U-Net FCN8 SegNet VGG19 ResNet50 MobileNet

Normalized batch size Checkpoint all AP √n

  • Lin. greedy

Checkmate (ours)

Evaluation: How much can we increase batch size?

1.73x larger! 5.1x larger!

slide-49
SLIDE 49

checkmateai.github.io 16 29 21 167 98 215 18 51 33 197 116 452 35 51 43 266 199 640 61 60 62 289 225 1105

0x 1x 2x 3x 4x 5x

U-Net FCN8 SegNet VGG19 ResNet50 MobileNet

Normalized batch size Checkpoint all AP √n

  • Lin. greedy

Checkmate (ours)

Evaluation: How much can we increase batch size?

1.73x larger! 5.1x larger!

*Ongoing work: BERT

2.3x larger batch size over TF2.0

Train BERT-Large w/o model parallelism

slide-50
SLIDE 50

checkmateai.github.io

AP √n AP greedy Griewank log n Two-phase LP rounding MobileNet 1.14× 1.07× 7.07× 1.06× VGG16 1.28× 1.06× 1.44× 1.01× VGG19 1.54× 1.39× 1.75× 1.00× U-Net 1.27× 1.23×

  • 1.03×

ResNet50 1.20× 1.25×

  • 1.05×

Evaluation: How well does 2P rounding approximate ILP?

Within 6% of optimal cost (geomean)

43x speedup for ResNet50 440x speedup for MobileNet

slide-51
SLIDE 51

checkmateai.github.io

Checkmate

Key ideas:

  • GPU memory limits are preventing the

development of new deep learning models.

  • We present the first general solution for optimal

& near-optimal graph rematerialization.

  • Formulation supports arbitrary DAGs and is

both hardware-aware and memory-aware

  • Integration with just one line of code

Code and paper: checkmateai.github.io Email me: parasj@berkeley.edu

Paras Jain, Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez