checkmateai.github.io
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
Paras Jain
Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez
Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor - - PowerPoint PPT Presentation
Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez checkmateai.github.io BigGAN (2018)
checkmateai.github.io
Paras Jain
Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez
checkmateai.github.io
2 BigGAN (2018)
Image generation
Brock et al. 2019
VideoBERT (2019)
Video generation
Sun et al. 2019
GPT-2 (2019)
Text generation
Radford et al. 2019
checkmateai.github.io
400 800 1200 1600
R e s n e t 5 R
C N D e e p L a b V 2 B E R T L a r g e G P T
Parameter count (106)
Figure adapted from NVIDIA
3
checkmateai.github.io AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201 nsoe, 201 RoR, 2018 iGAN, 2018 0G 5G 10G 15G
e-G eo se
GPU memory m
RAM usage per-GPU
State-of-the-art models have hit a memory capacity wall.
4
Limited GPU memory is slowing progress in new deep learning models!
Chen et al. 2016 Gomez et al. 2017 Pohlen et al. 2017 Liu et al. 2019 Dai et al. 2019 Child et al. 2019 Cited memory as limiting factor
checkmateai.github.io AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201 nsoe, 201 RoR, 2018 iGAN, 2018 0G 5G 10G 15G
e-G eo se
GPU memory m
RAM usage per-GPU
State-of-the-art models have hit a memory capacity wall.
5
Limited GPU memory is slowing progress in new deep learning models!
Chen et al. 2016 Gomez et al. 2017 Pohlen et al. 2017 Liu et al. 2019 Dai et al. 2019 Child et al. 2019 Cited memory as limiting factor
checkmateai.github.io
6
TOPS per GiB capacity
Compute is outstripping DRAM capacity growth
checkmateai.github.io
Backprop is optimized for compute efficiency, not RAM usage
7
Compute-optimized backprop
checkmateai.github.io
8
Compute-optimized backprop RAM-optimized backprop
Ideal: scalable algorithm for backprop that adapts to RAM constraints
checkmateai.github.io
9
Compute-optimized backprop
Checkmate explores
5x larger inputs w/ 2x cost
RAM-optimized backprop
checkmateai.github.io
10
Keep all layers in RAM
Compute-optimized backprop
checkmateai.github.io
A B C D E
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
Time RAM used
B C D E
∇E
11
Label
Keep all layers in RAM
A
checkmateai.github.io
A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
A B C D E
∇E ∇D
Time RAM used
12
Keep all layers in RAM
checkmateai.github.io
A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
A B C D E
∇E ∇D ∇C
Time RAM used
13
Keep all layers in RAM
checkmateai.github.io
A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
A B C D E
∇E ∇D ∇C ∇B ∇A
Time RAM used
14
Peak RAM
Keep all layers in RAM
checkmateai.github.io
Recompute all layers as needed 15
RAM-optimized backprop
checkmateai.github.io
Time RAM used
Recompute all layers 16
Peak RAM (no recomputation) A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
A B C D
checkmateai.github.io
Time RAM used
A B C B D A E C
∇E
∇D
Recompute all layers 17
Peak RAM (no recomputation) A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
checkmateai.github.io
Time RAM used
A B C D E
∇E
∇D
A B C
Recompute all layers 18
Peak RAM (no recomputation) A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
B A C
checkmateai.github.io
Time RAM used
A B C D E
∇E
∇D
A B C
∇C
Peak RAM
Recompute all layers 19
Peak RAM (no recomputation) A B C D E
Label
∇E ∇D ∇C ∇B ∇A
Loss
Forward Pass Backward Pass
B A C
checkmateai.github.io
20
Loss
E
∇E
D
∇D
C
∇C
B
∇B
A
∇A
Forward Pass Backward Pass
C
∇C
C
∇C
C
∇C
C
∇C
checkmateai.github.io
21
Label
Forward Pass
checkmateai.github.io
22
Label
Backward Pass Forward Pass
checkmateai.github.io
23
Label
checkmateai.github.io
24
Label
checkmateai.github.io
25
Label
checkmateai.github.io
26
Greedy heuristic
[Chen 2016] [XLA authors 2017, 2020]
Divide-and-conquer heuristic
[Griewank 2000] [Kowarz 2006] [Siskind 2018] [Kumar 2019]
Optimal for specific architecture
[Gruslys 2016] [Feng 2018] [Beaumont 2019]
checkmateai.github.io
27
Checkpoint every node Recompute all layers
Let’s be:
checkmateai.github.io
28
Accae c mdel Fleible each ace Oimal le
Ineger Linear Program
Nea-imal le
To phae ronding
Gah eie
Hardware + RAM aware Solve for 10s-1hr Train for 1mo GPU, CPU, TPU support
checkmateai.github.io
29
Accae c mdel Fleible each ace Oimal le
Ineger Linear Program
Nea-imal le
To phae ronding
Gah eie
checkmateai.github.io
30
A B C ∇ C ∇ B ∇ A
checkmateai.github.io
t=1 t=2 t=3 t=4 t=5 t=6
31
A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A
Stage Layer
𝑆!,# ∈ {0, 1}
checkmateai.github.io
A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A
t=1 t=2 t=3 t=4 t=5 t=6
32
𝑆!,# ∈ {0, 1}
Stage Layer
checkmateai.github.io
t=1 t=2 t=3 t=4 t=5 t=6
33
𝑆!,# ∈ {0, 1}
Stage Layer Layer
𝑇!,# ∈ {0, 1}
checkmateai.github.io
t=1 t=2 t=3 t=4 t=5 t=6
34
𝑆!,# ∈ {0, 1}
Stage Layer Layer
𝑇!,# ∈ {0, 1}
checkmateai.github.io
t=1 t=2 t=3 t=4 t=5 t=6
35
𝑆!,# ∈ {0, 1}
Stage Layer Layer
𝑇!,# ∈ {0, 1}
checkmateai.github.io
36
Accae c mdel Fleible each ace Oimal le
Ineger Linear Program
Nea-imal le
To phae ronding
Gah eie
checkmateai.github.io
37
Use R matrix to create linear objective Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables
min
+,,,- $ $ 𝐷. 𝑆/,.
Minimize forward + backward cost
checkmateai.github.io
38
“A layer’s dependencies must be computed before evaluation” “A layer must be computed before it can be stored in RAM” Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables
min
+,,,- $ $ 𝐷. 𝑆/,.
Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness
checkmateai.github.io
Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables
39
Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables Memory usage in stage 𝑢 𝑉!,# ∈ ℝ$ Constrain memory via an implicit variable to model memory usage at each stage
min
+,,,- $ $ 𝐷. 𝑆/,.
Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit
checkmateai.github.io
40
Layer 𝑗 (re)computed in stage 𝑢 Layer 𝑗 stored for stage 𝑢 𝑆!,# ∈ {0, 1} 𝑇!,# ∈ {0, 1} Decision variables Memory usage in stage 𝑢 𝑉!,# ∈ ℝ$ Constrain memory via an implicit variable to model memory usage at each stage
min
+,,,- $ $ 𝐷. 𝑆/,.
Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit Memory accounting details in paper
checkmateai.github.io
min
+,,,- $ $ 𝐷. 𝑆/,.
Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit
checkmateai.github.io 42
min
+,,,- $ $ 𝐷. 𝑆/,.
Minimize forward + backward cost 𝑆!,$ ≤ 𝑆!,# + 𝑇!,# 𝑇!,# ≤ 𝑆!%&,# + 𝑇!%&,# Correctness 𝑉!,' ≤ budget, … Memory limit
𝑆!,! = 1 𝑆, 𝑇, 𝑉 lower triangular Tractability
Prunes n! permutations
checkmateai.github.io
43
Accae c mdel Fleible each ace Oimal le
Ineger Linear Program
Nea-imal le
To phae ronding
Gah eie
checkmateai.github.io
Polynomial-time approximation? Proposed method: Two-Phase Rounding Round S, solve other variables optimally Insight: Given S, optimal 𝐒 easy to compute
How to maintain feasibility?
checkmateai.github.io
AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201 nsoe, 201 RoR, 2018 iGAN, 2018 0G 5G 10G 15G e-G eo se GPU memory m?
checkmateai.github.io
Checkmate Checkmate
Best heuristic
10 20 1.0 1.1 1.2 1. Budget (GB) Compute Overed ()
GPU Memory available (GB) U-Net, batch size 32 Best heuristic
10 20 30 10 11 12 13 1 1 Budget (GB) Compute Overed ()
MobileNet, batch size 512 GPU Memory available (GB) 1.2x speedup!
checkmateai.github.io
VGG19
224x224 images
Layer
𝑆!,#
Square root heuristic Batch size 197
1.18x larger! Stage Layer
No rematerialization Batch size 167
𝑆!,#
Layer
Checkmate Batch size 289
1.73x larger! 10 sec solve
𝑆!,#
checkmateai.github.io 16 29 21 167 98 215 18 51 33 197 116 452 35 51 43 266 199 640 61 60 62 289 225 1105
0x 1x 2x 3x 4x 5x
U-Net FCN8 SegNet VGG19 ResNet50 MobileNet
Normalized batch size Checkpoint all AP √n
Checkmate (ours)
1.73x larger! 5.1x larger!
checkmateai.github.io 16 29 21 167 98 215 18 51 33 197 116 452 35 51 43 266 199 640 61 60 62 289 225 1105
0x 1x 2x 3x 4x 5x
U-Net FCN8 SegNet VGG19 ResNet50 MobileNet
Normalized batch size Checkpoint all AP √n
Checkmate (ours)
1.73x larger! 5.1x larger!
Train BERT-Large w/o model parallelism
checkmateai.github.io
AP √n AP greedy Griewank log n Two-phase LP rounding MobileNet 1.14× 1.07× 7.07× 1.06× VGG16 1.28× 1.06× 1.44× 1.01× VGG19 1.54× 1.39× 1.75× 1.00× U-Net 1.27× 1.23×
ResNet50 1.20× 1.25×
Within 6% of optimal cost (geomean)
43x speedup for ResNet50 440x speedup for MobileNet
checkmateai.github.io
Key ideas:
development of new deep learning models.
& near-optimal graph rematerialization.
both hardware-aware and memory-aware
Code and paper: checkmateai.github.io Email me: parasj@berkeley.edu
Paras Jain, Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez