Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor - PowerPoint PPT Presentation

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez checkmateai.github.io

BigGAN (2018) VideoBERT (2019) GPT-2 (2019) Image generation Video generation Text generation Brock et al. 2019 Radford et al. 2019 Sun et al. 2019 2 checkmateai.github.io

1600 1200 Emerging trend: Parameter count (10 6 ) 800 Rapid growth in model size 400 0 R R D B G e - e E P F s e R T C n p T - e N L 2 L t a 5 a b 0 r g V e 2 3 Figure adapted from NVIDIA checkmateai.github.io

? State-of-the-art models have hit a memory capacity wall . �e�-G�� e�o�� s��e 15G� GPU memory ��m�� Cited memory as limiting factor 10G� Chen et al. 2016 Liu et al. 2019 RAM usage Gomez et al. 2017 Dai et al. 2019 per-GPU Pohlen et al. 2017 Child et al. 2019 5G� 0G� Limited GPU memory is AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201� ��ns�o��e�, 201� Ro��R��, 2018 �i�GAN, 2018 slowing progress in new deep learning models! 4 checkmateai.github.io

? State-of-the-art models have hit a memory capacity wall . �e�-G�� e�o�� s��e 15G� Problem: GPU memory ��m�� Cited memory as limiting factor 10G� How do we efficiently train large Chen et al. 2016 Liu et al. 2019 RAM usage Gomez et al. 2017 Dai et al. 2019 per-GPU Pohlen et al. 2017 Child et al. 2019 5G� models beyond memory limits? 0G� Limited GPU memory is AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201� ��ns�o��e�, 201� Ro��R��, 2018 �i�GAN, 2018 slowing progress in new deep learning models! 5 checkmateai.github.io

Compute is outstripping DRAM TOPS per GiB capacity growth capacity 6 checkmateai.github.io

Backprop is optimized for compute efficiency, not RAM usage RAM Compute-optimized backprop Compute 7 checkmateai.github.io

Ideal: scalable algorithm for backprop that adapts to RAM constraints RAM Compute-optimized backprop RAM-optimized backprop Compute 8 checkmateai.github.io

This work: optimal space-time tradeoff for backpropagation RAM Compute-optimized backprop Checkmate explores RAM-optimized backprop optimal trade-off 5x larger inputs w/ 2x cost Compute 9 checkmateai.github.io

RAM-hungry backprop policy Keep all layers in RAM RAM Compute-optimized backprop Compute 10 checkmateai.github.io

RAM-hungry backpropagation policy Keep all layers in RAM RAM used ∇ E Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 11 checkmateai.github.io

RAM-hungry backpropagation policy Keep all layers in RAM RAM used ∇ E ∇ D Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 12 checkmateai.github.io

RAM-hungry backpropagation policy Keep all layers in RAM RAM used ∇ E ∇ D ∇ C Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 13 checkmateai.github.io

RAM-hungry backpropagation policy Keep all layers in RAM RAM used Peak RAM ∇ E ∇ D ∇ C ∇ B ∇ A Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 14 checkmateai.github.io

RAM-optimized backpropagation policy Recompute all layers as needed RAM RAM-optimized backprop Compute 15 checkmateai.github.io

RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time How can we use less memory? Free early & recompute 16 checkmateai.github.io

RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label Loss C D E ∇ E ∇ D ∇ A ∇ B ∇ C ∇ D ∇ E B B C A Backward Pass A Time How can we use less memory? Free early & recompute 17 checkmateai.github.io

RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label Loss C D E A B C ∇ E ∇ D ∇ A ∇ B ∇ C ∇ D ∇ E B B C A Backward Pass A Time How can we use less memory? Free early & recompute 18 checkmateai.github.io

RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label Peak RAM Loss C D E A B C ∇ C ∇ E ∇ D ∇ A ∇ B ∇ C ∇ D ∇ E B B C A Backward Pass A Time How can we use less memory? Free early & recompute 19 checkmateai.github.io

How to choose which layers to recompute? Forward Pass A B C C C C C D E Loss ∇ A ∇ B ∇ C ∇ C ∇ C ∇ C ∇ C ∇ D ∇ E Backward Pass 20 checkmateai.github.io

How to choose which layers to recompute? Forward Pass Label 21 checkmateai.github.io

How to choose which layers to recompute? Forward Pass Label Backward Pass 22 checkmateai.github.io

Label Challenges of heuristics: 1. Variable runtime per layer 10 6 × slower 23 checkmateai.github.io

Label 10 3 × Challenges of heuristics: more RAM 1. Variable runtime per layer 2. Variable RAM usage per layer 24 checkmateai.github.io

Label Challenges of heuristics: 1. Variable runtime per layer 2. Variable RAM usage per layer 3. Real DNNs are non-linear 25 checkmateai.github.io

Prior work is suboptimal in general setting! Greedy heuristic [Chen 2016] Challenges: [XLA authors 2017, 2020] 1. Variable runtime per layer Divide-and-conquer heuristic [Griewank 2000] [Kowarz 2006] 2. Variable RAM usage per layer [Siskind 2018] [Kumar 2019] 3. Real DNNs are non-linear Optimal for specific architecture [Gruslys 2016] [Feng 2018] [Beaumont 2019] 26 checkmateai.github.io

Can we optimally trade-off RAM for compute? RAM Let’s be: Checkpoint every node 1. Hardware-aware 2. RAM-aware Recompute 3. DAG flexibility all layers Compute 27 checkmateai.github.io

A system for optimal tensor rematerialization Solve for 10s-1hr GPU, CPU, Hardware + Train for 1mo TPU support RAM aware Acc��a�e G�a�h O��imal ��l�e� c�� m�del �e��i�e In�eger Linear Program Nea�-��imal ��l�e� Fle�ible T�o pha�e ro�nding �ea�ch ��ace 28 checkmateai.github.io

A system for optimal tensor rematerialization Acc��a�e G�a�h O��imal ��l�e� c�� m�del �e��i�e In�eger Linear Program Nea�-��imal ��l�e� Fle�ible T�o pha�e ro�nding �ea�ch ��ace 29 checkmateai.github.io

A system for optimal tensor rematerialization A B C ∇ C ∇ B ∇ A 30 checkmateai.github.io

A system for optimal tensor rematerialization Layer t=1 A B C ∇ C ∇ B ∇ A 𝑆 !,# ∈ {0, 1} t=2 A B C ∇ C ∇ B ∇ A t=3 A B C ∇ C ∇ B ∇ A Stage t=4 A B C ∇ C ∇ B ∇ A t=5 A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A t=6 31 checkmateai.github.io

A system for optimal tensor rematerialization Layer t=1 A B C ∇ C ∇ B ∇ A 𝑆 !,# ∈ {0, 1} t=2 A B C ∇ C ∇ B ∇ A t=3 A B C ∇ C ∇ B ∇ A Stage t=4 A B C ∇ C ∇ B ∇ A t=5 A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A t=6 32 checkmateai.github.io

A system for optimal tensor rematerialization Layer Layer t=1 𝑇 !,# ∈ {0, 1} 𝑆 !,# ∈ {0, 1} t=2 t=3 Stage t=4 t=5 t=6 33 checkmateai.github.io

A system for optimal tensor rematerialization Layer Layer t=1 𝑇 !,# ∈ {0, 1} 𝑆 !,# ∈ {0, 1} t=2 S = What is in R = What is t=3 Stage memory? computed? t=4 t=5 t=6 34 checkmateai.github.io

Example of optimal “S” (SegNet) A system for optimal tensor rematerialization Layer Layer t=1 𝑇 !,# ∈ {0, 1} 𝑆 !,# ∈ {0, 1} t=2 t=3 Stage t=4 t=5 t=6 35 checkmateai.github.io

A system for optimal tensor rematerialization Acc��a�e G�a�h O��imal ��l�e� c�� m�del �e��i�e In�eger Linear Program Nea�-��imal ��l�e� Fle�ible T�o pha�e ro�nding �ea�ch ��ace 36 checkmateai.github.io

A system for optimal tensor rematerialization Minimize forward + Decision variables backward cost 𝑇 !,# ∈ {0, 1} Layer 𝑗 stored for stage 𝑢 +,,,- $ $ 𝐷 . 𝑆 /,. min 𝑆 !,# ∈ {0, 1} Layer 𝑗 (re)computed in stage 𝑢 Use R matrix to create linear objective 37 checkmateai.github.io

A system for optimal tensor rematerialization Minimize forward + Decision variables backward cost 𝑇 !,# ∈ {0, 1} Layer 𝑗 stored for stage 𝑢 +,,,- $ $ 𝐷 . 𝑆 /,. min 𝑆 !,# ∈ {0, 1} Layer 𝑗 (re)computed in stage 𝑢 Correctness 𝑆 !,$ ≤ 𝑆 !,# + 𝑇 !,# “A layer’s dependencies must 𝑇 !,# ≤ 𝑆 !%&,# + 𝑇 !%&,# be computed before evaluation” “A layer must be computed before it can be stored in RAM” 38 checkmateai.github.io

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor - PowerPoint PPT Presentation

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez checkmateai.github.io BigGAN (2018)

CheckMate Inline Check Valve Red Valve Company CheckMate Check Valve Red Valve Company 1

Checkmate? Review of the Epidemiology and our Last Line of Defense for Carbapenem-Resistant

Checkmate: A Rapid Yeast Mating Type Detector Rose-Hulman Institute of Technology Ben Deschaine,

Nivolumab + Ipilimumab, Nivolumab + Chemotherapy, and Chemotherapy in Chemo-Naive Patients With

T ag- I solated M emory B ringing Fine-grained E nclaves to R ISC- V Samuel Weiser Mario Werner

1 Register-memory Register-register (Load-store) There is no implicit operand Both operands are

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Roadmap Integers & floats Machine code & C C: Java: x86 assembly Car c = new Car();

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

Improving Virtual Prototyping and Certification with Implicit Finite Element Method at Scale Seid

Mobile Device and Platform Security Part II John Mitchell Two lectures on mobile security

Dynamic Memory Alloca/on: Advanced Concepts 15-213:

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

CPSC 213 Introduction to Computer Systems Unit 3 Course Review 1 Learning Goals 1 Memory

Building WebGPU with Rust Fosdem, 2th Feb 2020 Dzmitry Malyshau @kvark (Mozilla / Graphics

An implicit graph Suppose we represent each node in graph

Technologies for Dementia Andrea Wilkinson & Mark Chignell TRO 2017 Annual Conference

ECE 3574: Applied Software Design Threads Today we are going to start looking at threads,

The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms & policies

4.1 Polygon Meshes and Implicit Surfaces Hao Li http://cs420.hao-li.com 1 Geometric

MEMORY SYNCHRONIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Chapter 9 : Linguistics in Prague and Vienna between the wars John A. Goldsmith May 5 , 2016 1

Whats up with the fruit? Visual literacy is the ability

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor - PowerPoint PPT Presentation

Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez checkmateai.github.io BigGAN (2018)

CheckMate Inline Check Valve Red Valve Company CheckMate Check Valve Red Valve Company 1

Checkmate? Review of the Epidemiology and our Last Line of Defense for Carbapenem-Resistant

Checkmate: A Rapid Yeast Mating Type Detector Rose-Hulman Institute of Technology Ben Deschaine,

Nivolumab + Ipilimumab, Nivolumab + Chemotherapy, and Chemotherapy in Chemo-Naive Patients With

T ag- I solated M emory B ringing Fine-grained E nclaves to R ISC- V Samuel Weiser Mario Werner

1 Register-memory Register-register (Load-store) There is no implicit operand Both operands are

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Roadmap Integers &amp; floats Machine code &amp; C C: Java: x86 assembly Car c = new Car();

CS356 Unit 10 Memory Allocation &amp; Heap Management 10.2 BASIC OS CONCEPTS &amp; TERMINOLOGY

Improving Virtual Prototyping and Certification with Implicit Finite Element Method at Scale Seid

Mobile Device and Platform Security Part II John Mitchell Two lectures on mobile security

Dynamic Memory Alloca/on: Advanced Concepts 15-213:

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

CPSC 213 Introduction to Computer Systems Unit 3 Course Review 1 Learning Goals 1 Memory

Building WebGPU with Rust Fosdem, 2th Feb 2020 Dzmitry Malyshau @kvark (Mozilla / Graphics

An implicit graph Suppose we represent each node in graph

Technologies for Dementia Andrea Wilkinson &amp; Mark Chignell TRO 2017 Annual Conference

ECE 3574: Applied Software Design Threads Today we are going to start looking at threads,

The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms &amp; policies

4.1 Polygon Meshes and Implicit Surfaces Hao Li http://cs420.hao-li.com 1 Geometric

MEMORY SYNCHRONIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Chapter 9 : Linguistics in Prague and Vienna between the wars John A. Goldsmith May 5 , 2016 1

Whats up with the fruit? Visual literacy is the ability

Roadmap Integers & floats Machine code & C C: Java: x86 assembly Car c = new Car();

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

Technologies for Dementia Andrea Wilkinson & Mark Chignell TRO 2017 Annual Conference

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms & policies