Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - - PowerPoint PPT Presentation

shared clusters
SMART_READER_LITE
LIVE PREVIEW

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - - PowerPoint PPT Presentation

Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs Shared Clusters for Big Data


slide-1
SLIDE 1

Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters

Jack Li, Calton Pu Georgia Institute of Technology Yuan Chen, Vanish Talwar, Dejan Milojicic Hewlett Packard Labs

slide-2
SLIDE 2

Shared Clusters for Big Data Systems

– Dynamic resource sharing across multiple frameworks, apps and users

– Examples - Google cluster (Omega), Mesos, Hadoop YARN, Bing’s Dryad Dedicated Clusters Shared Cluster improved utilization and data sharing, reduced cost 3

Shared Hardware Cluster Manager (e.g. YARN, Mesos) Batch (MR) Streaming (Storm) In-Memory (Spark) Graph (Giraph) Online (Vertica)

Hardware Batch (MR) Cluster Hardware Streaming (Storm) Cluster Hardware Online (Vertica) Cluster

slide-3
SLIDE 3

Preemption in Shared Clusters

– Coordinate resource sharing, guarantee QoS and enforce fairness – Problem: preemption in shared clusters is expensive!

– Simply kill and restart jobs later – Significant resource waste – Delays completion time of long running or low priority jobs

3 preemption

slide-4
SLIDE 4

Real World Examples

– Google Cluster: 12.4% of scheduled tasks preempted and up to 30k CPU-hours (35% of total capacity) wasted! – Microsoft Dryad cluster[1]: ~21% jobs killed – Facebook Hadoop cluster[2]: repeatedly kill and restart long running jobs

4 Task Priority

  • Num. of

Tasks Percent Evicted Free (0-1) 28.4M 20.26% Middle (2-8) 17.3M 0.55% Production (9-11) 1.70M 1.02% Latency Sensitivity

  • Num. of

Tasks Percent Evicted 0 (lowest) 37.4M 11.76% 1 5.94M 18.87% 2 3.70M 8.14% 3 (highest) 0.28M 14.80%

Even latency-sensitive tasks are evicted 29-day trace from Google: 672,000 jobs on 12,500 machines

20 40 60 80 100 5 10 15 20 25 30

Preemption Rate [%] Time [Day]

Preemption Rate Timeline

Low Priority Medium Priority High Priority

[1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

Many tasks preempted

slide-5
SLIDE 5

Real World Examples

– Google Cluster: 12.4% of scheduled tasks preempted and up to 30k CPU-hours (35% of total capacity) wasted! – Microsoft Dryad cluster[1]: ~21% jobs killed – Facebook Hadoop cluster[2]: repeatedly kill and restart long running jobs

5 Task Priority

  • Num. of

Tasks Percent Evicted Free (0-1) 28.4M 20.26% Middle (2-8) 17.3M 0.55% Production (9-11) 1.70M 1.02% Latency Sensitivity

  • Num. of

Tasks Percent Evicted 0 (lowest) 37.4M 11.76% 1 5.94M 18.87% 2 3.70M 8.14% 3 (highest) 0.28M 14.80%

Even latency-sensitive tasks are evicted 29-day trace from Google: 672,000 jobs on 12,500 machines

20 40 60 80 100 5 10 15 20 25 30

Preemption Rate [%] Time [Day]

Preemption Rate Timeline

Low Priority Medium Priority High Priority

200 400 600 800 1000 1 2 3 4 5 6 7 8 9 >10 Distinct Tasks [thousands]

  • Num. of Preemptions

Preemption Frequency Distribution

[1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

Many tasks preempted 43% preempted more than once 17% 10 times

  • r more
slide-6
SLIDE 6

Checkpointing-based Preemptive Scheduling

Our solution: use checkpoint/restore for preemption instead of kill/restart

Use system level, application-transparent checkpointing mechanism

– Linux CRIU (Checkpoint-Restore In Userspace) – Distributed and remote checkpoint/restart

Leverage fast storage such as NVM for efficient checkpointing

– Store checkpoints on NVM (NVMFS or NVRAM)

Adaptive preemption policies and optimization techniques

– Combine checkpoint and kill, local and remote checkpointing/resumption – Incremental checkpointing with memory trackers

6

slide-7
SLIDE 7

Application-transparent Suspend-Resume

Checkpointing using CRIU (Checkpoint/Restore In Userspace) – Freeze a running program and suspend it in memory or output to disk – Saves sockets, threads, namespaces, memory mappings, pipes Dump

– Build process tree from /proc/$pid/task/$tid/children and seize them with ptrace – Collect VMA areas, file descriptor numbers, registers, etc… of each process

Restore

– Read process tree from file and start saved processes with clone() call – New memory map created filled with checkpointed data 7

slide-8
SLIDE 8

Suspend-Resume with DFS and NVM

Support distributed and remote checkpoint-resume

– Save checkpoints on HDFS

Checkpoint with NVM

– Use NVM as fast disk

– Save CRIU checkpoints in NVM- based file systems (e.g, PMFS)

– Use NVM as virtual memory (NVRAM)

– Copy checkpoints from DRAM to NVM using memory operations – Shadow buffer

Incremental checkpointing

8

Node A Node B Node C NVRAM NVRAM Distributed Shared NVRAM

process address space DRAM DRAM DRAM

NVRAM memory copy

Node A

process address space HDD, SSD, NVM DRAM checkpoint files Dump

Node B

process address space HDD, SSD, NVM DRAM checkpoint files

Node C

process address space HDD, SSD, NVM DRAM checkpoint files Restore

Distributed File System

slide-9
SLIDE 9

Suspend and Restore Performance

9

100 200 300 400 500 600 2 4 6 8 10

Total Dump Restore Time [s] Checkpoint Size [GB]

Local File System

HDD SSD NVM 100 200 300 400 500 600 2 4 6 8 10

Total Dump Restore Time [s] Checkpoint Size [GB]

HDFS

HDD SSD PMFS

slide-10
SLIDE 10

Benefits of Incremental Checkpointing

5GB initial dump size, change 10% of the memory and dump again

10

Storage First Checkpoint Second Checkpoint HDD 169.18s 15.34s SSD 43.73s 4.08s PMFS 2.92s 0.28s

slide-11
SLIDE 11

Google Trace-driven Simulation

11

500 1000 1500 2000 2500 3000 3500

Wasted CPU Capacity [core-hours] Preempt Method

Resource Wastage

Preempt-Kill Basic-HDD Basic-SSD Basic-NVM 3800 3850 3900 3950 4000 4050 4100

Energy [kW*h) Preempt Method

Energy Consumption

0.2 0.4 0.6 0.8 1 1.2 1.4

Lowest Priority Medium Priority Highest Priority Normalized Response Time

Performance

1 2 3 4 5 6 1 2 3 4 5

Normalized Completion Time

Checkpoint Bandwidth (GB/s) Low Priority Job Performance

0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5

Normalized Completion Time

Checkpoint Bandwidth (GB/s) High Priority Job Performance Wait Kill Checkpoint 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 Normalized Power Consumption Checkpoint Bandwidth (GB/s) Energy Consumption

23%

  • 6%

76% 5% 74%

slide-12
SLIDE 12

Adaptive Policies and Optimization

Adaptive preemption dynamically selects victim tasks and preemption mechanisms (checkpoint or kill) based on the progress of each task and its checkpoint/restore overhead. Adaptive resumption restores preempted jobs/tasks locally or remotely according to their overheads and available resources. Incremental checkpointing with memory trackers

12

slide-13
SLIDE 13

Adaptive Preemption Algorithms

13

slide-14
SLIDE 14

Performance Improvement with Adaptive Policies

14

0.2 0.4 0.6 0.8 1 Lowest Priority Medium Priority Highest Priority Normalized Response Time

HDD

Basic Adaptive 0.2 0.4 0.6 0.8 1 Lowest Priority Medium Priority Highest Priority Normalized Response Time

SSD

0.2 0.4 0.6 0.8 1 Lowest Priority Medium Priority Highest Priority Normalized Response Time

NVM

36% 55% 29% 12% 17% 8% 3% 8%

  • 0.5%
slide-15
SLIDE 15

Implementation with Hadoop YARN

YARN – cluster resource manager

– Global resource scheduler (ResourceManager) – Submit ApplicationMasters (jobs) to RM – Supports capacity and fair scheduling

DistributedShell

– Comes standard with YARN – Runs a shell command in a set of containers in a distributed and parallel manner

15

YARN ApplicationMaster Application Preemption Manager

  • 2. Preemption

Request

  • 3. Suspend
  • 6. Resume
  • 3. Suspend
  • 6. Resume
  • 4. Suspend

Complete YARN Resource Manager YARN Cluster Scheduler

  • 5. Container

Request YARN NodeManager

HDD, SSD, NVM (PMFS)

HDFS Task CRIU Task

dump restore

YARN NodeManager

HDD, SSD, NVM (PMFS)

HDFS Task CRIU Task

dump restore

  • 1. New

Job

slide-16
SLIDE 16

Testbed and Experiment Setup

– 8 node Hadoop YARN cluster

– Dual socket Xeon 5650 CPU (6 cores/each) – 96GB memory (48GB emulated NVM using PMFS) – 500GB HDD (un-optimized) – 120GB SSD – 24 concurrent containers (1 CPU/2 GB memory)

– Workload

– Modeled after Facebook workload[1] – Mix of high/low priority jobs (7,000+ tasks) 16

[1] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

slide-17
SLIDE 17

Comparison of Different Preemption Policies on YARN

17

50 100 150 200 250 CPU Wastage [core-hours]

Resource Wastage

Kill Chk-HDD Chk-SSD Chk-NVM 2 4 6 8 10 Energy [kW*h]

Energy Consumption

2 4 6 8 10 12 14 16 Low Priority High Priority Response Time [min]

Average Job Performance

0.25 0.5 0.75 1 5 10 15 20 25 30 Response Time [min]

Basic Preemption CDF

Kill Chk-HDD Chk-SSD Chk-NVM

  • 22%

61% 35% 67%

slide-18
SLIDE 18

Benefits of Adaptive Preemption

18

2 4 6 8 10 12 14 Low Priority High Priority Response Time [min]

HDD

Basic Adaptive 1 2 3 4 5 6 7 8 Low Priority High Priority Response Time [min]

SSD

1 2 3 4 5 6 7 Low Priority High Priority Response Time [min]

NVM

28% 7% 16% 7% 20% 14%

slide-19
SLIDE 19

Overhead of Checkpoint-based Preemption

CPU overhead is negligible, but I/O overhead is significant on slow storage

19

10 20 30 40 50 60 70 80 90 100 HDD SSD NVM

CPU Overhead [%]

CPU Overhead

Basic Adaptive 10 20 30 40 50 60 70 80 90 100 HDD SSD NVM

I/O Overhead [%]

I/O Overhead

Basic Adaptive

slide-20
SLIDE 20

Conclusion and Future Work

– Preemption in shared clusters is expensive and preemption using application-transparent checkpointing is able to improve resource efficiency and overall application performance. – Adaptive preemption that combines checkpoint and kill can further improve the performance and reduce the preemption cost. – By leveraging emerging fast storage technologies such as NVM, even more savings can be achieved.

Future Work

– A wide range of applications – Checkpointing with NVRAM – Integration with other cluster scheduling policies

20

slide-21
SLIDE 21

Thank you Contact: jack.li@cc.gatech.edu yuan.chen@hpe.com

21