Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - - PowerPoint PPT Presentation
Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - - PowerPoint PPT Presentation
Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs Shared Clusters for Big Data
Shared Clusters for Big Data Systems
– Dynamic resource sharing across multiple frameworks, apps and users
– Examples - Google cluster (Omega), Mesos, Hadoop YARN, Bing’s Dryad Dedicated Clusters Shared Cluster improved utilization and data sharing, reduced cost 3
Shared Hardware Cluster Manager (e.g. YARN, Mesos) Batch (MR) Streaming (Storm) In-Memory (Spark) Graph (Giraph) Online (Vertica)
Hardware Batch (MR) Cluster Hardware Streaming (Storm) Cluster Hardware Online (Vertica) Cluster
Preemption in Shared Clusters
– Coordinate resource sharing, guarantee QoS and enforce fairness – Problem: preemption in shared clusters is expensive!
– Simply kill and restart jobs later – Significant resource waste – Delays completion time of long running or low priority jobs
3 preemption
Real World Examples
– Google Cluster: 12.4% of scheduled tasks preempted and up to 30k CPU-hours (35% of total capacity) wasted! – Microsoft Dryad cluster[1]: ~21% jobs killed – Facebook Hadoop cluster[2]: repeatedly kill and restart long running jobs
4 Task Priority
- Num. of
Tasks Percent Evicted Free (0-1) 28.4M 20.26% Middle (2-8) 17.3M 0.55% Production (9-11) 1.70M 1.02% Latency Sensitivity
- Num. of
Tasks Percent Evicted 0 (lowest) 37.4M 11.76% 1 5.94M 18.87% 2 3.70M 8.14% 3 (highest) 0.28M 14.80%
Even latency-sensitive tasks are evicted 29-day trace from Google: 672,000 jobs on 12,500 machines
20 40 60 80 100 5 10 15 20 25 30
Preemption Rate [%] Time [Day]
Preemption Rate Timeline
Low Priority Medium Priority High Priority
[1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.
Many tasks preempted
Real World Examples
– Google Cluster: 12.4% of scheduled tasks preempted and up to 30k CPU-hours (35% of total capacity) wasted! – Microsoft Dryad cluster[1]: ~21% jobs killed – Facebook Hadoop cluster[2]: repeatedly kill and restart long running jobs
5 Task Priority
- Num. of
Tasks Percent Evicted Free (0-1) 28.4M 20.26% Middle (2-8) 17.3M 0.55% Production (9-11) 1.70M 1.02% Latency Sensitivity
- Num. of
Tasks Percent Evicted 0 (lowest) 37.4M 11.76% 1 5.94M 18.87% 2 3.70M 8.14% 3 (highest) 0.28M 14.80%
Even latency-sensitive tasks are evicted 29-day trace from Google: 672,000 jobs on 12,500 machines
20 40 60 80 100 5 10 15 20 25 30
Preemption Rate [%] Time [Day]
Preemption Rate Timeline
Low Priority Medium Priority High Priority
200 400 600 800 1000 1 2 3 4 5 6 7 8 9 >10 Distinct Tasks [thousands]
- Num. of Preemptions
Preemption Frequency Distribution
[1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.
Many tasks preempted 43% preempted more than once 17% 10 times
- r more
Checkpointing-based Preemptive Scheduling
Our solution: use checkpoint/restore for preemption instead of kill/restart
Use system level, application-transparent checkpointing mechanism
– Linux CRIU (Checkpoint-Restore In Userspace) – Distributed and remote checkpoint/restart
Leverage fast storage such as NVM for efficient checkpointing
– Store checkpoints on NVM (NVMFS or NVRAM)
Adaptive preemption policies and optimization techniques
– Combine checkpoint and kill, local and remote checkpointing/resumption – Incremental checkpointing with memory trackers
6
Application-transparent Suspend-Resume
Checkpointing using CRIU (Checkpoint/Restore In Userspace) – Freeze a running program and suspend it in memory or output to disk – Saves sockets, threads, namespaces, memory mappings, pipes Dump
– Build process tree from /proc/$pid/task/$tid/children and seize them with ptrace – Collect VMA areas, file descriptor numbers, registers, etc… of each process
Restore
– Read process tree from file and start saved processes with clone() call – New memory map created filled with checkpointed data 7
Suspend-Resume with DFS and NVM
Support distributed and remote checkpoint-resume
– Save checkpoints on HDFS
Checkpoint with NVM
– Use NVM as fast disk
– Save CRIU checkpoints in NVM- based file systems (e.g, PMFS)
– Use NVM as virtual memory (NVRAM)
– Copy checkpoints from DRAM to NVM using memory operations – Shadow buffer
Incremental checkpointing
8
Node A Node B Node C NVRAM NVRAM Distributed Shared NVRAM
process address space DRAM DRAM DRAM
NVRAM memory copy
Node A
process address space HDD, SSD, NVM DRAM checkpoint files Dump
Node B
process address space HDD, SSD, NVM DRAM checkpoint files
Node C
process address space HDD, SSD, NVM DRAM checkpoint files Restore
Distributed File System
Suspend and Restore Performance
9
100 200 300 400 500 600 2 4 6 8 10
Total Dump Restore Time [s] Checkpoint Size [GB]
Local File System
HDD SSD NVM 100 200 300 400 500 600 2 4 6 8 10
Total Dump Restore Time [s] Checkpoint Size [GB]
HDFS
HDD SSD PMFS
Benefits of Incremental Checkpointing
5GB initial dump size, change 10% of the memory and dump again
10
Storage First Checkpoint Second Checkpoint HDD 169.18s 15.34s SSD 43.73s 4.08s PMFS 2.92s 0.28s
Google Trace-driven Simulation
11
500 1000 1500 2000 2500 3000 3500
Wasted CPU Capacity [core-hours] Preempt Method
Resource Wastage
Preempt-Kill Basic-HDD Basic-SSD Basic-NVM 3800 3850 3900 3950 4000 4050 4100
Energy [kW*h) Preempt Method
Energy Consumption
0.2 0.4 0.6 0.8 1 1.2 1.4
Lowest Priority Medium Priority Highest Priority Normalized Response Time
Performance
1 2 3 4 5 6 1 2 3 4 5
Normalized Completion Time
Checkpoint Bandwidth (GB/s) Low Priority Job Performance
0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5
Normalized Completion Time
Checkpoint Bandwidth (GB/s) High Priority Job Performance Wait Kill Checkpoint 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 Normalized Power Consumption Checkpoint Bandwidth (GB/s) Energy Consumption
23%
- 6%
76% 5% 74%
Adaptive Policies and Optimization
Adaptive preemption dynamically selects victim tasks and preemption mechanisms (checkpoint or kill) based on the progress of each task and its checkpoint/restore overhead. Adaptive resumption restores preempted jobs/tasks locally or remotely according to their overheads and available resources. Incremental checkpointing with memory trackers
12
Adaptive Preemption Algorithms
13
Performance Improvement with Adaptive Policies
14
0.2 0.4 0.6 0.8 1 Lowest Priority Medium Priority Highest Priority Normalized Response Time
HDD
Basic Adaptive 0.2 0.4 0.6 0.8 1 Lowest Priority Medium Priority Highest Priority Normalized Response Time
SSD
0.2 0.4 0.6 0.8 1 Lowest Priority Medium Priority Highest Priority Normalized Response Time
NVM
36% 55% 29% 12% 17% 8% 3% 8%
- 0.5%
Implementation with Hadoop YARN
YARN – cluster resource manager
– Global resource scheduler (ResourceManager) – Submit ApplicationMasters (jobs) to RM – Supports capacity and fair scheduling
DistributedShell
– Comes standard with YARN – Runs a shell command in a set of containers in a distributed and parallel manner
15
YARN ApplicationMaster Application Preemption Manager
- 2. Preemption
Request
- 3. Suspend
- 6. Resume
- 3. Suspend
- 6. Resume
- 4. Suspend
Complete YARN Resource Manager YARN Cluster Scheduler
- 5. Container
Request YARN NodeManager
HDD, SSD, NVM (PMFS)
HDFS Task CRIU Task
dump restore
YARN NodeManager
HDD, SSD, NVM (PMFS)
HDFS Task CRIU Task
dump restore
- 1. New
Job
Testbed and Experiment Setup
– 8 node Hadoop YARN cluster
– Dual socket Xeon 5650 CPU (6 cores/each) – 96GB memory (48GB emulated NVM using PMFS) – 500GB HDD (un-optimized) – 120GB SSD – 24 concurrent containers (1 CPU/2 GB memory)
– Workload
– Modeled after Facebook workload[1] – Mix of high/low priority jobs (7,000+ tasks) 16
[1] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.
Comparison of Different Preemption Policies on YARN
17
50 100 150 200 250 CPU Wastage [core-hours]
Resource Wastage
Kill Chk-HDD Chk-SSD Chk-NVM 2 4 6 8 10 Energy [kW*h]
Energy Consumption
2 4 6 8 10 12 14 16 Low Priority High Priority Response Time [min]
Average Job Performance
0.25 0.5 0.75 1 5 10 15 20 25 30 Response Time [min]
Basic Preemption CDF
Kill Chk-HDD Chk-SSD Chk-NVM
- 22%
61% 35% 67%
Benefits of Adaptive Preemption
18
2 4 6 8 10 12 14 Low Priority High Priority Response Time [min]
HDD
Basic Adaptive 1 2 3 4 5 6 7 8 Low Priority High Priority Response Time [min]
SSD
1 2 3 4 5 6 7 Low Priority High Priority Response Time [min]
NVM
28% 7% 16% 7% 20% 14%
Overhead of Checkpoint-based Preemption
CPU overhead is negligible, but I/O overhead is significant on slow storage
19
10 20 30 40 50 60 70 80 90 100 HDD SSD NVM
CPU Overhead [%]
CPU Overhead
Basic Adaptive 10 20 30 40 50 60 70 80 90 100 HDD SSD NVM
I/O Overhead [%]
I/O Overhead
Basic Adaptive
Conclusion and Future Work
– Preemption in shared clusters is expensive and preemption using application-transparent checkpointing is able to improve resource efficiency and overall application performance. – Adaptive preemption that combines checkpoint and kill can further improve the performance and reduce the preemption cost. – By leveraging emerging fast storage technologies such as NVM, even more savings can be achieved.
Future Work
– A wide range of applications – Checkpointing with NVRAM – Integration with other cluster scheduling policies
20
Thank you Contact: jack.li@cc.gatech.edu yuan.chen@hpe.com
21