Applied research group Systems+database people building prototypes, - - PowerPoint PPT Presentation

applied research group
SMART_READER_LITE
LIVE PREVIEW

Applied research group Systems+database people building prototypes, - - PowerPoint PPT Presentation

Applied research group Systems+database people building prototypes, publishing papers Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to


slide-1
SLIDE 1
slide-2
SLIDE 2
slide-3
SLIDE 3

Applied research group

Systems+database people building prototypes, publishing papers

slide-4
SLIDE 4

Applied research group

Systems+database people building prototypes, publishing papers

Collaborating with Big Data product group at MS

Shipping our code to production

slide-5
SLIDE 5

Applied research group

Systems+database people building prototypes, publishing papers

Collaborating with Big Data product group at MS

Shipping our code to production

Open-sourcing our code

Apache Hadoop, REEF, Heron

slide-6
SLIDE 6

Resource management Distributed tiered storage Query

  • ptimization

Log analytics Stream processing

slide-7
SLIDE 7

Resource management Distributed tiered storage Query

  • ptimization

Log analytics Stream processing

slide-8
SLIDE 8

Node Manager Node Manager Node Manager

slide-9
SLIDE 9

Node Manager Node Manager Node Manager

slide-10
SLIDE 10

Node Manager Node Manager Node Manager

slide-11
SLIDE 11

Node Manager Node Manager Node Manager

slide-12
SLIDE 12

Node Manager Node Manager Node Manager

  • 1. Request
slide-13
SLIDE 13

Node Manager Node Manager Node Manager

  • 1. Request
  • 2. Allocation
slide-14
SLIDE 14

Node Manager Node Manager Node Manager

  • 1. Request
  • 2. Allocation
  • 3. Start task
slide-15
SLIDE 15

Node Manager Node Manager Node Manager

  • 1. Request
  • 2. Allocation
  • 3. Start task
slide-16
SLIDE 16

Node Manager Node Manager Node Manager

  • 1. Request
  • 2. Allocation
  • 3. Start task
slide-17
SLIDE 17

Node Manager Node Manager Node Manager

  • 1. Request
  • 2. Allocation
  • 3. Start task
  • Do we really need a

Resource Manager?

slide-18
SLIDE 18

Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps

YARN

MR v2

Tez Giraph Storm Dryad REEF

...

Hive / Pig

Hadoop 1.x (MapReduce)

MR v1

Hive / Pig

Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System

HDFS 1 HDFS 2

Hardware

Ad-hoc app Ad-hoc app

Scope

  • n

YARN Spark

  • monolithic

Heron

slide-19
SLIDE 19

Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps

YARN

MR v2

Tez Giraph Storm Dryad REEF

...

Hive / Pig

Hadoop 1.x (MapReduce)

MR v1

Hive / Pig

Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System

HDFS 1 HDFS 2

Hardware

Ad-hoc app Ad-hoc app

Scope

  • n

YARN Spark

  • monolithic
  • Reuse of RM

component

Heron

slide-20
SLIDE 20

Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps

YARN

MR v2

Tez Giraph Storm Dryad REEF

...

Hive / Pig

Hadoop 1.x (MapReduce)

MR v1

Hive / Pig

Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System

HDFS 1 HDFS 2

Hardware

Ad-hoc app Ad-hoc app

Scope

  • n

YARN Spark

  • monolithic
  • Reuse of RM

component

  • YARN

layering abstractions

Heron

slide-21
SLIDE 21
slide-22
SLIDE 22

But is all this good enough for the Microsoft clusters?

slide-23
SLIDE 23

High resource utilization Scalability Workload heterogeneity Production jobs and predictability

slide-24
SLIDE 24

100% Utilization

slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
  • Wide variety
slide-28
SLIDE 28
  • Wide variety
slide-29
SLIDE 29
  • Wide variety
slide-30
SLIDE 30

deadlines recurring

>60%

  • Predictability
  • ver-provisioned
slide-31
SLIDE 31
  • Rayon/Morpheus:
  • Mercury/Yaq:
  • YARN Federation:
  • Medea:

4 Hadoop committers in CISL 404 patches as of last night

slide-32
SLIDE 32
  • Rayon/Morpheus:
  • Mercury/Yaq:
  • YARN Federation:
  • Medea:

4 Hadoop committers in CISL 404 patches as of last night

slide-33
SLIDE 33

[Hadoop 3.0; ATC 2015, EuroSys 2016]

slide-34
SLIDE 34

N1 N2 RM

slide-35
SLIDE 35

N1 N2 RM j1

slide-36
SLIDE 36

N1 N2 RM j1

slide-37
SLIDE 37

N1 N2 RM j2

slide-38
SLIDE 38

N1 N2 RM j2

slide-39
SLIDE 39

N1 N2 RM j2

slide-40
SLIDE 40

N1 N2 RM j2

slide-41
SLIDE 41

N1 N2 RM j2

slide-42
SLIDE 42

N1 N2 RM j2

  • Feedback delays

idle between allocations

slide-43
SLIDE 43
  • Feedback delays

idle between allocations

5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%

N1 N2 RM j2

slide-44
SLIDE 44
  • Feedback delays

idle between allocations

  • Actual

5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%

N1 N2 RM j2

slide-45
SLIDE 45
  • Introduce task queuing at nodes
  • Mask feedback delays
  • Improve cluster utilization
  • Improve task throughput (by up to 40%)
  • Container types
  • GUARANTEED and OPPORTUNISTIC
  • Keep guarantees for important jobs
  • Use opportunistic execution to improve utilization
slide-46
SLIDE 46

N1 N2 RM

slide-47
SLIDE 47

N1 N2 RM

slide-48
SLIDE 48

N1 N2 RM j1

slide-49
SLIDE 49

N1 N2 RM j1

slide-50
SLIDE 50

N1 N2 RM j2

slide-51
SLIDE 51

N1 N2 RM j2

slide-52
SLIDE 52

N1 N2 RM j2

slide-53
SLIDE 53

N1 N2 RM j2

slide-54
SLIDE 54

N1 N2 RM j2

slide-55
SLIDE 55

N1 N2 RM j2

slide-56
SLIDE 56

N1 N2 RM j2

slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
  • So all we need to do

is use long queues?

slide-61
SLIDE 61
slide-62
SLIDE 62

can be detrimental for job completion times

  • Despite the utilization gains
slide-63
SLIDE 63

can be detrimental for job completion times

  • Despite the utilization gains

Proper queue management techniques are required

slide-64
SLIDE 64

N1 N2 N3

slide-65
SLIDE 65

N1 N2 N3

slide-66
SLIDE 66

N1 N2 N3

slide-67
SLIDE 67

N1 N2 N3

slide-68
SLIDE 68

Place tasks to node queues Prioritize task execution (queue reordering) Bound queue lengths

slide-69
SLIDE 69

Place tasks to node queues Prioritize task execution (queue reordering) Bound queue lengths

Yaq improves median job completion time by 1.7x over YARN

slide-70
SLIDE 70

N1 N2 N3 RM

slide-71
SLIDE 71

N1 N2 N3 RM

queue length

slide-72
SLIDE 72

N1 N2 N3 RM

queue length

slide-73
SLIDE 73

queue length

N1 N2 N3 RM

slide-74
SLIDE 74

N1 N2 N3 RM

queue length queue wait time

slide-75
SLIDE 75

N1 N2 N3 RM

queue length queue wait time

slide-76
SLIDE 76
  • Shortest Remaining Job First (SRJF)
  • Least Remaining Tasks First (LRTF)
slide-77
SLIDE 77
  • Shortest Remaining Job First (SRJF)
  • Least Remaining Tasks First (LRTF)

N1 N2 N3 RM

j2: 5 tasks j3: 9 tasks j1: 21 tasks

slide-78
SLIDE 78
  • Shortest Remaining Job First (SRJF)
  • Least Remaining Tasks First (LRTF)

N1 N2 N3 RM

j2: 5 tasks j3: 9 tasks j1: 21 tasks

slide-79
SLIDE 79
  • Shortest Remaining Job First (SRJF)
  • Least Remaining Tasks First (LRTF)

N1 N2 N3 RM

j2: 5 tasks j3: 9 tasks j1: 21 tasks

slide-80
SLIDE 80
  • Shortest Remaining Job First (SRJF)
  • Least Remaining Tasks First (LRTF)

job-aware

N1 N2 N3 RM

j2: 5 tasks j3: 9 tasks j1: 21 tasks

slide-81
SLIDE 81

lower throughput longer job completion times

slide-82
SLIDE 82
  • 1.7x improvement in

median JCT over YARN

slide-83
SLIDE 83
  • Container types

distributed scheduling any distributed scheduler

  • ver-commitment

multi-tenancy

  • Pricing
slide-84
SLIDE 84

cluster utilization queue management techniques job completion time

slide-85
SLIDE 85