Salus
Fine-grained GPU Sharing Primitives for Deep Learning Applications
Advisor: Mosharaf Chowdhury 2020-03-03 By Peifeng Yu
Salus Fine-grained GPU Sharing Primitives for Deep Learning - - PowerPoint PPT Presentation
Salus Fine-grained GPU Sharing Primitives for Deep Learning Applications Advisor: Mosharaf Chowdhury 2020-03-03 By Peifeng Yu Deep Learning Becomes Ubiquitous Computer vision Natural language processing Speech Robotics
Fine-grained GPU Sharing Primitives for Deep Learning Applications
Advisor: Mosharaf Chowdhury 2020-03-03 By Peifeng Yu
Deep Learning Becomes Ubiquitous
Applications
2
A Brief Introduction to Deep Learning
Errors
✗
Dog Cat Raccoon
3
A Brief Introduction to Deep Learning
4
Cat
Accelerate Deep Learning with GPUs
5
Neural Networks GPUs Inherently Parallel Matrix Operations FLOPS
Exclusive Access to GPU
An application can have multiple GPUs, but each GPU usually belongs to exactly one application at a time.
Advantages
Disadvantages
6
Exclusive Access:
7
Lack of Flexibility
Exclusive Access: Lack of Flexibility
8
Exclusive Access: Lack of Flexibility
Model Peak Memory Usage VAE 28M Super Resolution 529M Deep Speech 3993M Inception4 11355M
9
How Can We Efficiently Share a GPU for Deep Learning Applications?
Approach Efficiency Dynamic Memory Flexible Scheduling Static Partitioning (SP) No No Yes Multi-Process Service (MPS) Yes No No
GPU Sharing
11
Approach Efficiency Dynamic Memory Flexible Scheduling Static Partitioning (SP) No No Yes
12
Approach Efficiency Dynamic Memory Flexible Scheduling Static Partitioning (SP) No No Yes Multi-Process Service (MPS) Yes No No Ideal Yes Yes Yes Approach Efficiency Dynamic Memory Flexible Scheduling Static Partitioning (SP) No No Yes Multi-Process Service (MPS) Yes No No Approach Efficiency Dynamic Memory Flexible Scheduling
Design Goals
Minimize deployment overhead
Fine-grained GPU Sharing Primitives for Deep Learning
A consolidated execution service enabling sharing primitives
without modifying any
with the goal to
13
Others … CNTK PyTorch Tensorflow Deep Learning Frameworks Deep Learning Frameworks Salus Execution Service ASIC … FPGA GPU CPU User scripts
in DL Stack
14
Salus
Salus Adaptor
Salus Execution Service
Transfer computation graph
Consolidates all GPU accesses
15
GPU User Script DL Framework Salus Adaptor User Script DL Framework … … Salus Adaptor
Salus in One Slide Salus
Memory Manager Session Scheduler
16
Sharing Primitives
17
Sharing Primitives: Efficient Job Switching
Existing Approaches Time Scale Stop and restart (checkpointing) 10~100s Generate snapshot[1] ~1s
[1]: W. Xiao et al. “Gandiva: Introspective Cluster Scheduling for Deep Learning”. In: OSDI. 2018.
Bottleneck: data (memory) transfer
18
Understand DL Job Memory
19
Understand DL Job Memory
Why not keep multiple jobs’ model in memory for fast switching?
20
Sharing Primitives: Efficient Job Switching
Job switching is done by determine which job’s iteration to run next.
A trade-off between maximum utilization and execution performance
21
Sharing Primitives
Time Memory
22
Job 2 Job 1
Sharing Primitives
Time Memory
Lane 0 Lane 1
23
Job 3 Job 2 Job 1
Sharing Primitives: Memory Sharing
= = Contin
uous phys ysical ical me memor mory y + GP GPU U str tream am GP GPU U lan ane e
24
GPU Lane: Best Fit & Safety Condition
!
"
𝑄" + max
"
𝑈
" ≤ 𝐷+
𝑄": Model and framework-internal memory for job 𝑗 𝑈": Ephemeral memory for job 𝑗 𝐷+: Memory capacity of lane 𝑚
25
GPU Lane: Best Fit & Safety Condition
!
"
𝑄" + !
"
𝑈
" ≤ 𝐷+
Static Partitioning: 𝑄": Model and framework-internal memory for job 𝑗 𝑈": Ephemeral memory for job 𝑗 𝐷+: Memory capacity of lane 𝑚
26
Salus Scheduling Polices
FIFO is suboptimal
With Salus
27
Deployment and evaluation on Intel E5-2670 with 2x NVIDIA Tesla P100 with 15 workloads
28
A ProductionTrace
29
[1]: G. Juncheng et al. “Tiresias: A GPU Cluster Manager for Distributed Deep Learning”. In: NSDI. 2019.
Sub-second Level Switching
30
Hyper-parameter Exploration
31
Pack Inference Applications
32
Fine-grained GPU Sharing Primitives for Deep Learning
Open sourced at: https://github.com/SymbioticLab/Salus
33