Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - PowerPoint PPT Presentation

Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs

Shared Clusters for Big Data Systems – Dynamic resource sharing across multiple frameworks, apps and users – Examples - Google cluster (Omega), Mesos , Hadoop YARN, Bing’s Dryad improved utilization and data sharing, reduced cost Streaming In-Memory Graph Online Batch (MR) (Storm) (Spark) (Giraph) (Vertica) Batch Streaming Online (MR) (Storm) (Vertica) Cluster Cluster Cluster Cluster Manager (e.g. YARN, Mesos) Shared Hardware Hardware Hardware Hardware Dedicated Clusters Shared Cluster 3

Preemption in Shared Clusters – Coordinate resource sharing, guarantee QoS and enforce fairness preemption – Problem: preemption in shared clusters is expensive! – Simply kill and restart jobs later – Significant resource waste – Delays completion time of long running or low priority jobs 3

Real World Examples 29-day trace from Google: 672,000 jobs on 12,500 machines Preemption Rate Timeline 100 Preemption Rate [%] Task Priority Num. of Percent 80 Tasks Evicted 60 Free (0-1) 28.4M 20.26% 40 20 Middle (2-8) 17.3M 0.55% 0 Production (9-11) 1.70M 1.02% 0 5 10 15 20 25 30 Time [Day] Many tasks preempted Low Priority Medium Priority High Priority Latency Num. of Percent Sensitivity Tasks Evicted – Google Cluster: 12.4% of scheduled tasks 0 (lowest) 37.4M 11.76% preempted and up to 30k CPU-hours (35% of total capacity) wasted! 1 5.94M 18.87% – Microsoft Dryad cluster [1] : ~21% jobs killed 2 3.70M 8.14% – Facebook Hadoop cluster [2] : repeatedly kill 3 (highest) 0.28M 14.80% and restart long running jobs Even latency-sensitive tasks are evicted [1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. 4 [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

Real World Examples 29-day trace from Google: 672,000 jobs on 12,500 machines Preemption Rate Timeline Preemption Frequency Distribution 100 1000 Preemption Rate [%] Task Priority Num. of Percent Distinct Tasks 80 800 [thousands] Tasks Evicted 60 600 Free (0-1) 28.4M 20.26% 43% preempted 40 400 more than once 17% 10 times 20 Middle (2-8) 17.3M 0.55% 200 or more 0 Production (9-11) 1.70M 1.02% 0 5 10 15 20 25 30 0 Time [Day] 1 2 3 4 5 6 7 8 9 >10 Many tasks preempted Num. of Preemptions Low Priority Medium Priority High Priority Latency Num. of Percent Sensitivity Tasks Evicted – Google Cluster: 12.4% of scheduled tasks 0 (lowest) 37.4M 11.76% preempted and up to 30k CPU-hours (35% of total capacity) wasted! 1 5.94M 18.87% – Microsoft Dryad cluster [1] : ~21% jobs killed 2 3.70M 8.14% – Facebook Hadoop cluster [2] : repeatedly kill 3 (highest) 0.28M 14.80% and restart long running jobs Even latency-sensitive tasks are evicted [1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. 5 [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

Checkpointing-based Preemptive Scheduling Our solution: use checkpoint/restore for preemption instead of kill/restart Use system level, application-transparent checkpointing mechanism – Linux CRIU ( C heckpoint- R estore I n U serspace) – Distributed and remote checkpoint/restart Leverage fast storage such as NVM for efficient checkpointing – Store checkpoints on NVM (NVMFS or NVRAM) Adaptive preemption policies and optimization techniques – Combine checkpoint and kill, local and remote checkpointing/resumption – Incremental checkpointing with memory trackers 6

Application-transparent Suspend-Resume Checkpointing using CRIU (Checkpoint/Restore In Userspace) – Freeze a running program and suspend it in memory or output to disk – Saves sockets, threads, namespaces, memory mappings, pipes Dump – Build process tree from /proc/$pid/task/$tid/children and seize them with ptrace – Collect VMA areas, file descriptor numbers, registers, etc… of each process Restore – Read process tree from file and start saved processes with clone() call – New memory map created filled with checkpointed data 7

Suspend-Resume with DFS and NVM Support distributed and remote Node A Node B Node C checkpoint-resume process address process address process address space space space – Save checkpoints on HDFS DRAM DRAM DRAM Dump Restore Checkpoint with NVM checkpoint checkpoint checkpoint – Use NVM as fast disk files files files – Save CRIU checkpoints in NVM- HDD, SSD, NVM HDD, SSD, NVM HDD, SSD, NVM based file systems (e.g, PMFS) Distributed File System – Use NVM as virtual memory Node A Node B Node C (NVRAM) process address space – Copy checkpoints from DRAM to NVM DRAM DRAM using memory operations DRAM memory copy – Shadow buffer NVRAM Incremental checkpointing NVRAM NVRAM Distributed Shared NVRAM 8

Suspend and Restore Performance Local File System HDFS 600 600 Total Dump Restore Time [s] Total Dump Restore Time [s] 500 500 400 400 300 300 200 200 100 100 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Checkpoint Size [GB] Checkpoint Size [GB] HDD SSD NVM HDD SSD PMFS 9

Benefits of Incremental Checkpointing 5GB initial dump size, change 10% of the memory and dump again Storage First Checkpoint Second Checkpoint HDD 169.18s 15.34s SSD 43.73s 4.08s PMFS 2.92s 0.28s 10

Google Trace-driven Simulation Performance Resource Wastage Energy Consumption 3500 4100 1.4 Wasted CPU Capacity Normalized Response Preempt-Kill 23% 1.2 -6% 3000 4050 Basic-HDD Energy [kW*h) 1 2500 [core-hours] 4000 Basic-SSD 0.8 2000 Time Basic-NVM 3950 0.6 74% 5% 1500 76% 0.4 3900 1000 0.2 3850 500 0 Lowest Medium Highest 0 3800 Preempt Method Preempt Method Priority Priority Priority High Priority Job Performance Low Priority Job Performance Energy Consumption 3.5 6 4 Wait Normalized Completion Time Normalized Completion Time 3.5 3 Normalized Power 5 Kill 3 Consumption 2.5 4 2.5 Checkpoint 2 2 3 1.5 1.5 2 1 1 1 0.5 0.5 0 0 0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Checkpoint Bandwidth (GB/s) Checkpoint Bandwidth (GB/s) Checkpoint Bandwidth (GB/s) 11

Adaptive Policies and Optimization Adaptive preemption dynamically selects victim tasks and preemption mechanisms (checkpoint or kill) based on the progress of each task and its checkpoint/restore overhead. Adaptive resumption restores preempted jobs/tasks locally or remotely according to their overheads and available resources. Incremental checkpointing with memory trackers 12

Adaptive Preemption Algorithms 13

Performance Improvement with Adaptive Policies SSD HDD 8% 1 1 12% Normalized Response Time Normalized Response Time 17% 29% 0.8 0.8 36% 0.6 55% 0.6 0.4 0.4 0.2 0.2 0 Lowest Priority Medium Priority Highest Priority 0 Basic Adaptive Lowest Priority Medium Priority Highest Priority NVM -0.5% 3% 1 8% Normalized Response Time 0.8 0.6 0.4 0.2 0 Lowest Priority Medium Priority Highest Priority 14

Implementation with Hadoop YARN YARN – cluster resource manager DistributedShell – Global resource scheduler – Comes standard with YARN (ResourceManager) – Runs a shell command in a set of – Submit ApplicationMasters (jobs) to RM containers in a distributed and – Supports capacity and fair scheduling parallel manner YARN ApplicationMaster 2. Preemption YARN Resource Manager 1. New Request Application Preemption Job YARN Cluster Scheduler Manager 5. Container 3. Suspend Request 6. Resume 3. Suspend YARN NodeManager YARN NodeManager 6. Resume CRIU CRIU dump dump restore restore Task Task Task Task 4. Suspend Complete HDFS HDFS HDD, SSD, NVM (PMFS) HDD, SSD, NVM (PMFS) 15

Testbed and Experiment Setup – 8 node Hadoop YARN cluster – Dual socket Xeon 5650 CPU (6 cores/each) – 96GB memory (48GB emulated NVM using PMFS) – 500GB HDD (un-optimized) – 120GB SSD – 24 concurrent containers (1 CPU/2 GB memory) – Workload – Modeled after Facebook workload [1] – Mix of high/low priority jobs (7,000+ tasks) [1] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011. 16

Comparison of Different Preemption Policies on YARN Resource Wastage Energy Consumption Average Job Performance 250 10 16 Kill Response Time [min] 14 200 8 Chk-HDD 35% 12 CPU Wastage Energy [kW*h] [core-hours] 61% 10 67% Chk-SSD 150 6 8 Chk-NVM -22% 6 100 4 4 2 50 2 0 Low Priority High Priority 0 0 Basic Preemption CDF 1 0.75 Kill 0.5 Chk-HDD Chk-SSD 0.25 Chk-NVM 0 0 5 10 15 20 25 30 17 Response Time [min]

Benefits of Adaptive Preemption HDD SSD 14 8 Response Time [min] Response Time [min] 12 7 16% 28% 10 6 7% 7% 8 5 6 4 4 3 2 2 0 1 Low Priority High Priority 0 Basic Adaptive Low Priority High Priority NVM 7 Response Time [min] 6 20% 5 14% 4 3 2 1 0 18 Low Priority High Priority

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - PowerPoint PPT Presentation

Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs Shared Clusters for Big Data

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

A Shared Service Perspective From Morris County Shared Services April 7, 2009 A Shared Service

Shared Leadership and Shared Responsibility: Successful Shared Governance CUNY: John Jay College

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

CORFU A Shared Log Design for Flash Clusters Micha l Czerski SR.12/13 November 6, 2012

Shared Lives: the connection test Alex Fox, CEO Shared Lives Plus www.SharedLivesPlus.org.uk

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational

Indicative Rating. . www.arcratings.com LOCAL EXPERTISE, SHARED INSIGHT LOCAL EXPERTISE, SHARED

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Interprocess Communication Mechanisms shared storage shared virtual memory shared files

leave Jo Broadbent 11 March 2015 Getting to grips with shared parental leave How will shared

Shared buffer laboratory 2 implements a shared buffer Process loop Ke yboard wait for

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

TRAINING NEURAL TRAINING NEURAL NETWORKS ON THE NETWORKS ON THE EDGE EDGE Navjot Kukreja,

Distributed Real-Time Stream Processing: Why and How Petr Zapletal @petr_zapletal NE Scala 2016

FS Consistency & Journaling Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

CHECKPOINT/CLEARIDLE Overarching Goal Mobile clients need to provide a smooth responsive

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - PowerPoint PPT Presentation

Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs Shared Clusters for Big Data

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

A Shared Service Perspective From Morris County Shared Services April 7, 2009 A Shared Service

Shared Leadership and Shared Responsibility: Successful Shared Governance CUNY: John Jay College

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

CORFU A Shared Log Design for Flash Clusters Micha l Czerski SR.12/13 November 6, 2012

Shared Lives: the connection test Alex Fox, CEO Shared Lives Plus www.SharedLivesPlus.org.uk

Infrastructure &amp; Shared Services Director Infrastructure &amp; Shared Services Organisational

Indicative Rating. . www.arcratings.com LOCAL EXPERTISE, SHARED INSIGHT LOCAL EXPERTISE, SHARED

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Interprocess Communication Mechanisms shared storage shared virtual memory shared files

leave Jo Broadbent 11 March 2015 Getting to grips with shared parental leave How will shared

Shared buffer laboratory 2 implements a shared buffer Process loop Ke yboard wait for

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

TRAINING NEURAL TRAINING NEURAL NETWORKS ON THE NETWORKS ON THE EDGE EDGE Navjot Kukreja,

Distributed Real-Time Stream Processing: Why and How Petr Zapletal @petr_zapletal NE Scala 2016

FS Consistency &amp; Journaling Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

CHECKPOINT/CLEARIDLE Overarching Goal Mobile clients need to provide a smooth responsive

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational

FS Consistency & Journaling Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)