Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier - - PowerPoint PPT Presentation

pigeo eon a an effec ective d e distrib ibuted ed hier
SMART_READER_LITE
LIVE PREVIEW

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier - - PowerPoint PPT Presentation

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier erarchic ical Datacen enter Job ob S Sched eduler ler Zhijun Wang, Huiyang Li, Zhongwei Li, Xiaocui Sun, Jia Rao, Hao Che and Hong Jiang University of Texas at Arlington 1


slide-1
SLIDE 1

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier erarchic ical Datacen enter Job

  • b S

Sched eduler ler

Zhijun Wang, Huiyang Li, Zhongwei Li, Xiaocui Sun, Jia Rao, Hao Che and Hong Jiang University of Texas at Arlington

1

slide-2
SLIDE 2

Datacenter job scheduling challenges-I

  • Large scale

Cluster size is large Tens of thousands of nodes/workers The number of tasks in a job can be larger Tens of thousands of tasks in a job

  • - More than 50K tasks in a job in the Cloudera trace

2

slide-3
SLIDE 3

Datacenter job scheduling challenges-II

  • Heterogeneous workload

Short jobs (e.g., user facing applications )

  • --call for short response time

Long jobs (e.g., Data backup)

  • -call for mean response time guarantee

3

slide-4
SLIDE 4

Centralized job scheduling

4

Scalability problem

A scheduler manages all the workers’ resources in a cluster

short job task queue

Scheduler

long job task queue

Workers Job

slide-5
SLIDE 5

Distributed scheduling-Sparrow

  • Low efficinecy: unbalanced probing

5

Scheduler Workers

A scheduler needs to maintain all probes.

Job

task queue

slide-6
SLIDE 6

6

Hybrid scheduling-Eagle, Hawk

  • All short jobs are put to reserved workers
  • Scalability problem

Centralized Scheduler

Workers

task queue

Distributed Scheduler Long job Short job Reserved workers:

  • nly serve short job tasks
slide-7
SLIDE 7

7

Pigeon

  • Contributions
  • 1. Introduce a master level for task distribution

New architecture, hierarchical job scheduler

  • 2. Fully solve scalability problem
  • 3. High efficiency
slide-8
SLIDE 8

8

Overview of Pigeon

Distributed Scheduler

Master

Jo b

group of workers

Task

Centrally manage a group of workers Dispatch tasks to workers

Reserved workers

Master is job agnostic

slide-9
SLIDE 9

9

Job scheduling in Pigeon

Job

workers

Weighted fair queue (W) Idle worker list

Distributed Scheduler

slide-10
SLIDE 10

10

Why is Pigeon better?

Solve key challenges in existing schedulers Scalable: greatly reduce status maintenance costs in job schedulers Efficiency: Remove head-of-line blocking Have statistical multiplexing gain w ithin a group

Group size 100: # of master is 1% # of workers, reduce 99% status maintenance cost Group size 100: run at 90% load, the probability of a task finding an idle worker in a group is 1-0.9100 =99.99734!!

slide-11
SLIDE 11

11

Modeling and Analysis

Consider a single type of jobs, the fanout degree in a job is less than the number of masters. The task queuing time

in a master is a M/M/K queue (K is the group size)

Zero queueing time: job w ithout queuing time, The task execution time in a job is the same Running at 30% higher utilization

slide-12
SLIDE 12

12

Evaluation--Implementation

 Spark plug-in, Amazon EC2 cloud  120-w orker cluster (3 groups in Pigeon)  Measurement metrics: 50th, 90th and 99th percentile short and long job completion time  Compare w ith state-of-the-art schedulers: Eagle and Sparrow  Source codes: https://github.com/ruby-/pigeon/

slide-13
SLIDE 13

13

Pigeon vs Eagle--Implementation

Eagle normalized to Pigeon

20x~30x short job performance gains

slide-14
SLIDE 14

14

Pigeon vs Sparrow--Implementation

Pigeon w orks in a real cluster

Sparrow normalized to Pigeon

slide-15
SLIDE 15

15

Evaluation—Large Scale Simulation

 Event-driven simulator  Google, Yahoo and Cloudera traces  Cluster size 3000--19000 w orkers  Measurement metrics: 50th, 90th and 99th percentile short and long job completion time  Compare w ith state-of-the-art hybrid scheduler: Eagle

slide-16
SLIDE 16

16

Google trace

Slow dow n=job completion time / job execution time

Big performance gains for short job at high loads Slightly better performance gains for long jobs

Eagle

Pigeon is really scalable and efficient

slide-17
SLIDE 17

17

Conclusion

Pigeon: a new distributed and hierarchical job scheduler, new scheduling architecture

  • 1. Excellent scalability

better than existing schedulers

  • 2. High efficiency with multiplexing
slide-18
SLIDE 18

Thank you! Questions ??

18