[PPT] - NEPTUNE Scheduling Suspendable Tasks for Unified Stream/Batch PowerPoint Presentation

SLIDE 1

NEPTUNE

Scheduling Suspendable Tasks for Unified Stream/Batch Applications

SoCC, Santa Cruz, California, November 2019

Panagiotis Garefalakis

Imperial College London

pgaref@imperial.ac.uk

Konstantinos Karanasos

Microsoft

kokarana@microsoft.com

Peter Pietzuch

Imperial College London

prp@imperial.ac.uk

SLIDE 2

Unified application example

Panagiotis Garefalakis - Imperial College London 2

Inference Job

Low-latency responses Trained Model Historical data Real-time data

Training Job

Iterate

Stream Batch

Application

SLIDE 3

Evolution of analytics frameworks

Panagiotis Garefalakis - Imperial College London 3

Batch frameworks

2014 2010 2018

Frameworks with hybrid stream/batch applications Stream frameworks Unified stream/batch frameworks

Structured Streaming

SLIDE 4

Requirements

> Latency: Execute inference job with minimum delay > Throughput: Batch jobs should not be compromised > Efficiency: Achieve high cluster resource utilization

Stream/Batch application requirements

Panagiotis Garefalakis - Imperial College London 4

Challenge: schedule stream/batch jobs to satisfy their diverse requirements

SLIDE 5

Stream/Batch application scheduling

Panagiotis Garefalakis - Imperial College London 5

2xT

Inference (stream) Job

2xT 3T T

Training (batch) Job

Stage1

T

Stage2

T

2x 2x

3T 3T 3T

Stage1

T T

Stage2 4x 3x

Application Code Driver DAG Scheduler

submit

App Context

run job

SLIDE 6

Stream/Batch application scheduling

Panagiotis Garefalakis - Imperial College London 6

2xT

Inference (stream) Job

2xT 3T T

Training (batch) Job

3T 3T 3T T T T T 4T 3T

executor 1 executor 2

8T T T T Wasted resources Cores 2T 6T

Stage1

T

Stage2

T

2x 2x

3T 3T 3T

Stage1

T T

Stage2 4x 3x

> Static allocation: dedicate resources to each job

Resources can not be shared across jobs

SLIDE 7

Stream/Batch application scheduling

Panagiotis Garefalakis - Imperial College London 7

2xT 2xT 3T T 4T 8T 2T 6T

Stage1

T

Stage2

T

2x 2x

3T 3T 3T

Stage1

T T

Stage2 4x 3x

> FIFO: first job runs to completion

3T 3T 3T 3T T T T T T T

Long batch jobs increase stream job latency

Cores T

Inference (stream) Job Training (batch) Job

shared executors

SLIDE 8

Stream/Batch application scheduling

Panagiotis Garefalakis - Imperial College London 8

2xT 2xT 3T T 4T 8T 2T 6T

Stage1

T

Stage2

T

2x 2x

3T 3T 3T

Stage1

T T

Stage2 4x 3x

> FAIR: weight share resources across jobs

Cores 3T 3T 3T 3T T T T T T T T

queuing

Better packing with non-optimal latency

Inference (stream) Job Training (batch) Job

shared executors

SLIDE 9

Stream/Batch application scheduling

Panagiotis Garefalakis - Imperial College London 9

2xT 2xT 3T T 4T 8T 2T 6T

Stage1

T

Stage2

T

2x 2x

3T 3T 3T

Stage1

T T

Stage2 4x 3x

> KILL: avoid queueing by preempting batch tasks

Cores 3T 3T 3T 3T T T T T T T 3T T 3T

Better latency at the expense of extra work

Inference (stream) Job Training (batch) Job

shared executors

SLIDE 10

Stream/Batch application scheduling

Panagiotis Garefalakis - Imperial College London 10

2xT 2xT 3T T 4T 8T 2T 6T

Stage1

T

Stage2

T

2x 2x

3T 3T 3T

Stage1

T T

Stage2 4x 3x

> NEPTUNE: minimize queueing and wasted work!

Cores

Inference (stream) Job Training (batch) Job

shared executors

3T 3T 3T 3T T T T T T 2T 2T T T

SLIDE 11

> How to minimize queuing for latency-sensitive jobs and wasted work?

Implement suspendable tasks

> How to natively support stream/batch applications?

Provide a unified execution framework

> How to satisfy different stream/batch application requirements and high-level objectives?

Introduces custom scheduling policies

Challenges

Panagiotis Garefalakis - Imperial College London 11

SLIDE 12

> How to minimize queuing for latency-sensitive jobs and wasted work?

Implement suspendable tasks

> How to natively support stream/batch applications?

Provide a unified execution framework

> How to satisfy different stream/batch application requirements and high-level objectives?

Introduces custom scheduling policies

NEPTUNE

Execution framework for Stream/Batch applications

Panagiotis Garefalakis - Imperial College London 12

Support suspendable tasks Introduce pluggable scheduling policies Unified execution framework on top of

Structured Streaming

SLIDE 13

Typical tasks

Panagiotis Garefalakis - Imperial College London 13

Executor Stack Task run

Value Context Iterator Function

> Tasks: apply a function to a partition of data > Subroutines that run in executor to completion > Preemption problem:

> Loss of progress (kill) > Unpredictable preemption times (checkpointing)

State

SLIDE 14

Suspendable tasks

Panagiotis Garefalakis - Imperial College London 14

Function Context Iterator

Coroutine Stack call yield

> Idea: use coroutines

> Separate stacks to store task state > Yield points handing over control to the executor

> Cooperative preemption:

> Suspend and resume in milliseconds > Work-preserving

> Transparent to the user

Executor Stack Task run

Value State Context https://github.com/storm-enroute/coroutines

SLIDE 15

Execution framework

Panagiotis Garefalakis - Imperial College London 15

> Idea: centralized scheduler with pluggable policies > Problem: not just assign but also suspend and resume

Executor Executor

DAG scheduler Task Scheduler

Scheduling policy

Executor

Tasks

Low-pri job High-pri job Running Paused suspend & run task

App + job priorities

Low High

Tasks

Incrementalizer Optimizer

launch task metrics

SLIDE 16

Scheduling policies

Panagiotis Garefalakis - Imperial College London 16

> Idea: policies trigger task suspension and resumption

> Guarantee that stream tasks bypass batch tasks > Satisfy higher-level objectives i.e. balance cluster load > Avoid starvation by suspending up to a number of times

> Load-balancing (LB): takes into account executors’ memory conditions and equalize the number of tasks per node > Locality- and memory aware (LMA): respect task locality preferences in addition to load-balancing

SLIDE 17

> Built as an extension to 2.4.0 (https://github.com/lsds/Neptune) > Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines > Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities) > Added additional Executor performance metrics as part of the heartbeat mechanism

Implementation

Panagiotis Garefalakis - Imperial College London 17

SLIDE 18

> Cluster

– 75 nodes with 4 cores and 32 GB of memory each

> Workloads

– LDA: ML training/inference application uncovering hidden topics from a group of documents – Yahoo Streaming Benchmark: ad-analytics on a stream of ad impressions – TPC-H decision support benchmark

Azure deployment

Panagiotis Garefalakis - Imperial College London 18

SLIDE 19

DIFF-EXEC FIFO FAIR KILL NEP-CL NEP-LB PRI-ONLY 1 2 3 4 5 6

Streaming latency (s)

Benefit of NEPTUNE in stream latency

Panagiotis Garefalakis - Imperial College London 19

> LDA: training (batch) job using all available resources, with

a latency-sensitive inference (stream) using 15% of resources

NEPTUNE achieves latencies comparable to the ideal for the latency-sensitive jobs

LB Neptune LMA Neptune Isolation KILL FAIR FIFO Static allocation 37% 13% 61% 54%

99th median 5th

SLIDE 20

Impact of resource demands in performance

Panagiotis Garefalakis - Imperial College London 20

Past to future

> YSB: increasing stream job resource demands while batch

job using all available resources 0% 20% 40% 60% 80% 100%

Cores used for Streaming

2 4 6

Streaming latency (s)

3.85 3.88 3.90 3.92 3.95

Batch (M events/s)

1.5%

Efficiently share resources with low impact on throughput

SLIDE 21

NEPTUNE supports complex unified applications with diverse job requirements! > Suspendable tasks using coroutines > Pluggable scheduling policies > Continuous unified analytics

Thank you! Questions? Panagiotis Garefalakis pgaref@imperial.ac.uk

Summary

https://github.com/lsds/Neptune