Riffle: Optimized Shuffle Service for Large-Scale Data Analytics
Haoyu Zhang Brian Cho Ergin Seyfe Avery Ching Michael J. Freedman
Princeton University Facebook
Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data - - PowerPoint PPT Presentation
Princeton University Facebook Haoyu Zhang Brian Cho Ergin Seyfe Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data Analytics Michael J. Freedman Batch analytics systems are widely used Large-scale SQL queries
Princeton University Facebook
2
3
4
5
[*] Apache Spark @Scale: A 60 TB+ Production Use Case. https://tinyurl.com/yadx29gl
6
5000 10000
1umber RI TasNs
1000 2000 3000 4000
ShuIIOe Time (sec)
ShuIIOe Time
40 80 120
5eTuest CRunt / 106
I/2 5eTuest
5000 10000
1umber of TasNs
500 1000 1500
Size (KB)
SKuffle )etFK Size
7
8
300 400 500 600 700 800 900 1000 2000 4000 8000 10000
1umber of 0aS 7asNs
1000 2000 3000
7ime (sec)
6huffle 6Sill 300 400 500 600 700 800 900 1000 2000 4000 8000 10000
1umber of 0aS 7asNs
1000 2000 3000
7ime (sec)
6huffle 6Sill 300 400 500 600 700 800 900 1000 2000 4000 8000 10000
1umber of 0aS 7asNs
1000 2000 3000
7ime (sec)
6huffle 6Sill
9
10
Worker Node Worker Node Task Task Tasks Worker Machine Task Task Task Task File System Executor Executor Riffle Shuffle Service Driver Job / Task Scheduler Riffle Merge Scheduler assign report task statuses report merge statuses send merge requests
11
Optimized Shuffle Service
merge request map map map reduce reduce reduce reduce reduce reduce reduce map map map merge request
Application Driver
Merge Scheduler Worker-Side Merger
12
1R 0erge 5 10 20 40
1-Way 0erge
100 200 300 400 500
Time (sec)
0aS SWage 5educe SWage
1R 0erge 5 10 20 40
1-Way 0erge
1500 3000 4500 6000
6ize (KB)
5ead BlRcN 6ize
2000 4000 6000 8000
5equesW CRunW
1umber Rf 5eads
13
1R 0erge 5 10 20 40
1-Way 0erge
1500 3000 4500 6000
6ize (KB)
5ead BlRcN 6ize
2000 4000 6000 8000
5equesW CRunW
1umber Rf 5eads
1R 0erge 5 10 20 40
1-Way 0erge
1500 3000 4500 6000
6ize (KB)
5ead BlRcN 6ize
2000 4000 6000 8000
5equesW CRunW
1umber Rf 5eads
1R 0erge 5 10 20 40
1-Way 0erge
100 200 300 400 500
Time (sec)
0aS SWage 5educe SWage
1R 0erge 5 10 20 40
1-Way 0erge
100 200 300 400 500
Time (sec)
0aS SWage 5educe SWage
14
Best-effort merge (95%)
Block 65 Block 66 … Block 67 … Block 65 Block 66 … Block 67 … Block 65 Block 66 … Block 67 … … Block 65-1 Block 65-2 Block 65-m … Block 66-1 Block 66-2 Block 66-m …
Buffered Read Buffered Write Merge
1 k Merger Merger Merger Merger Merger … Job 1 Driver Job 2 Driver … Job k Driver request 1 request k
15
Data Map Reduce Block Description 1 167.6 GB 915 200 983 K ad 2 1.15 TB 7,040 1,438 120 K measurement 3 2.7 TB 8,064 2,500 147 K measurement 4 267 TB 36,145 20,011 360 K ad
16
JRb1 JRb2 JRb3 5 10 15 20 6KuIIOe I2 5equests / 106
1R 0erJe 512K 10 20 40
JRb4
200 400 600 800
17
JoE1 JoE2 JoE3 50 100 TotDl TDsN Execution Time / DDys
1o 0erJe 512K 10 20 40
JoE4
400 800 1200
18
Time / CPU Days
19
merge request JoE1 JoE2 JoE3 50 100 TotDl TDsN Execution Time / DDys
1o 0erJe 512K 10 20 40
JoE4
400 800 1200
Block 1 Block 2 Block R … Block 1 Block 2 Block R … Block 1 Block 2 Block R … … Block 1 Block 2 Block R … N files Block 1 Block 2 Block R … Block 1 Block 2 Block R … Block 1 Block 2 Block R … … Block 1 Block 2 Block R … total average block size > merge threshold
21
22
Thread 1 Thread 2 Thread 3 Merger time