Applied research group Systems+database people building prototypes, - - PowerPoint PPT Presentation
Applied research group Systems+database people building prototypes, - - PowerPoint PPT Presentation
Applied research group Systems+database people building prototypes, publishing papers Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to
Applied research group
Systems+database people building prototypes, publishing papers
Applied research group
Systems+database people building prototypes, publishing papers
Collaborating with Big Data product group at MS
Shipping our code to production
Applied research group
Systems+database people building prototypes, publishing papers
Collaborating with Big Data product group at MS
Shipping our code to production
Open-sourcing our code
Apache Hadoop, REEF, Heron
Resource management Distributed tiered storage Query
- ptimization
Log analytics Stream processing
Resource management Distributed tiered storage Query
- ptimization
Log analytics Stream processing
Node Manager Node Manager Node Manager
Node Manager Node Manager Node Manager
Node Manager Node Manager Node Manager
Node Manager Node Manager Node Manager
Node Manager Node Manager Node Manager
- 1. Request
Node Manager Node Manager Node Manager
- 1. Request
- 2. Allocation
Node Manager Node Manager Node Manager
- 1. Request
- 2. Allocation
- 3. Start task
Node Manager Node Manager Node Manager
- 1. Request
- 2. Allocation
- 3. Start task
Node Manager Node Manager Node Manager
- 1. Request
- 2. Allocation
- 3. Start task
Node Manager Node Manager Node Manager
- 1. Request
- 2. Allocation
- 3. Start task
- Do we really need a
Resource Manager?
Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps
YARN
MR v2
Tez Giraph Storm Dryad REEF
...
Hive / Pig
Hadoop 1.x (MapReduce)
MR v1
Hive / Pig
Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System
HDFS 1 HDFS 2
Hardware
Ad-hoc app Ad-hoc app
Scope
- n
YARN Spark
- monolithic
Heron
Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps
YARN
MR v2
Tez Giraph Storm Dryad REEF
...
Hive / Pig
Hadoop 1.x (MapReduce)
MR v1
Hive / Pig
Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System
HDFS 1 HDFS 2
Hardware
Ad-hoc app Ad-hoc app
Scope
- n
YARN Spark
- monolithic
- Reuse of RM
component
Heron
Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps
YARN
MR v2
Tez Giraph Storm Dryad REEF
...
Hive / Pig
Hadoop 1.x (MapReduce)
MR v1
Hive / Pig
Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System
HDFS 1 HDFS 2
Hardware
Ad-hoc app Ad-hoc app
Scope
- n
YARN Spark
- monolithic
- Reuse of RM
component
- YARN
layering abstractions
Heron
But is all this good enough for the Microsoft clusters?
High resource utilization Scalability Workload heterogeneity Production jobs and predictability
100% Utilization
- Wide variety
- Wide variety
- Wide variety
deadlines recurring
>60%
- Predictability
- ver-provisioned
- Rayon/Morpheus:
- Mercury/Yaq:
- YARN Federation:
- Medea:
4 Hadoop committers in CISL 404 patches as of last night
- Rayon/Morpheus:
- Mercury/Yaq:
- YARN Federation:
- Medea:
4 Hadoop committers in CISL 404 patches as of last night
[Hadoop 3.0; ATC 2015, EuroSys 2016]
N1 N2 RM
N1 N2 RM j1
N1 N2 RM j1
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
- Feedback delays
idle between allocations
- Feedback delays
idle between allocations
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%
N1 N2 RM j2
- Feedback delays
idle between allocations
- Actual
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%
N1 N2 RM j2
- Introduce task queuing at nodes
- Mask feedback delays
- Improve cluster utilization
- Improve task throughput (by up to 40%)
- Container types
- GUARANTEED and OPPORTUNISTIC
- Keep guarantees for important jobs
- Use opportunistic execution to improve utilization
N1 N2 RM
N1 N2 RM
N1 N2 RM j1
N1 N2 RM j1
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
N1 N2 RM j2
- So all we need to do
is use long queues?
can be detrimental for job completion times
- Despite the utilization gains
can be detrimental for job completion times
- Despite the utilization gains
Proper queue management techniques are required
N1 N2 N3
N1 N2 N3
N1 N2 N3
N1 N2 N3
Place tasks to node queues Prioritize task execution (queue reordering) Bound queue lengths
Place tasks to node queues Prioritize task execution (queue reordering) Bound queue lengths
Yaq improves median job completion time by 1.7x over YARN
N1 N2 N3 RM
N1 N2 N3 RM
queue length
N1 N2 N3 RM
queue length
queue length
N1 N2 N3 RM
N1 N2 N3 RM
queue length queue wait time
N1 N2 N3 RM
queue length queue wait time
- Shortest Remaining Job First (SRJF)
- Least Remaining Tasks First (LRTF)
- Shortest Remaining Job First (SRJF)
- Least Remaining Tasks First (LRTF)
N1 N2 N3 RM
j2: 5 tasks j3: 9 tasks j1: 21 tasks
- Shortest Remaining Job First (SRJF)
- Least Remaining Tasks First (LRTF)
N1 N2 N3 RM
j2: 5 tasks j3: 9 tasks j1: 21 tasks
- Shortest Remaining Job First (SRJF)
- Least Remaining Tasks First (LRTF)
N1 N2 N3 RM
j2: 5 tasks j3: 9 tasks j1: 21 tasks
- Shortest Remaining Job First (SRJF)
- Least Remaining Tasks First (LRTF)
job-aware
N1 N2 N3 RM
j2: 5 tasks j3: 9 tasks j1: 21 tasks
lower throughput longer job completion times
- 1.7x improvement in
median JCT over YARN
- Container types
distributed scheduling any distributed scheduler
- ver-commitment
multi-tenancy
- Pricing