Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data - - PowerPoint PPT Presentation

riffle optimized shuffle service for
SMART_READER_LITE
LIVE PREVIEW

Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data - - PowerPoint PPT Presentation

Princeton University Facebook Haoyu Zhang Brian Cho Ergin Seyfe Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data Analytics Michael J. Freedman Batch analytics systems are widely used Large-scale SQL queries


slide-1
SLIDE 1

Riffle: Optimized Shuffle Service for Large-Scale Data Analytics

Haoyu Zhang Brian Cho Ergin Seyfe Avery Ching Michael J. Freedman

Princeton University Facebook

slide-2
SLIDE 2

Batch analytics systems are widely used

  • Large-scale SQL queries
  • Custom batch jobs
  • Pre-/Post-processing for ML

At

10s of PB new data is generated

every day for batch processing

100s of TB data is added to be

processed by a single job

2

slide-3
SLIDE 3

Batch analytics jobs: logical graph

map filter map join, groupBy filter narrow dependency wide dependency

3

slide-4
SLIDE 4

Batch analytics jobs: DAG execution plan

Stage 1 Stage 2

  • Shuffle: all-to-all communication between stages
  • >10x larger than available memory, strong fault tolerance requirements

→ on-disk shuffle files

4

slide-5
SLIDE 5

The case for tiny tasks

  • Benefits of slicing jobs into small tasks
  • Improve parallelism [Tinytasks HotOS 13] [Subsampling IC2E 14] [Monotask SOSP 17]
  • Improve load balancing [Sparrow SOSP 13]
  • Reduce straggler effect [Dolly NSDI 13] [SparkPerf NSDI 15]

5

slide-6
SLIDE 6

The case against tiny tasks

  • Engineering experience often argues against running too many tasks
  • Medium scale → very large scale (10x larger than memory space)
  • Single-stage jobs → multi-stage jobs (> 50%)

Although we were able to run the Spark job with such a high number of tasks, we found that there is significant performance degradation when the number of tasks is too high.

[*] Apache Spark @Scale: A 60 TB+ Production Use Case. https://tinyurl.com/yadx29gl

6

slide-7
SLIDE 7

Shuffle I/O grows quadratically with data

5000 10000

1umber RI TasNs

1000 2000 3000 4000

ShuIIOe Time (sec)

ShuIIOe Time

40 80 120

5eTuest CRunt / 106

I/2 5eTuest

5000 10000

1umber of TasNs

500 1000 1500

Size (KB)

SKuffle )etFK Size

  • Large amount of fragmented I/O requests
  • Adversarial workload for hard drives!

7

slide-8
SLIDE 8

Strawman: tune number of tasks in a job

  • Tasks spill intermediate data to disk if data splits exceed memory capacity
  • Larger task execution reduces shuffle I/O, but increases spill I/O

8

slide-9
SLIDE 9

Strawman: tune number of tasks in a job

  • Need to retune when input data volume changes for each individual job
  • Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17]
  • straggler problems, imbalanced workload, garbage collection overhead

300 400 500 600 700 800 900 1000 2000 4000 8000 10000

1umber of 0aS 7asNs

1000 2000 3000

7ime (sec)

6huffle 6Sill 300 400 500 600 700 800 900 1000 2000 4000 8000 10000

1umber of 0aS 7asNs

1000 2000 3000

7ime (sec)

6huffle 6Sill 300 400 500 600 700 800 900 1000 2000 4000 8000 10000

1umber of 0aS 7asNs

1000 2000 3000

7ime (sec)

6huffle 6Sill

9

slide-10
SLIDE 10

10

Small Tasks Bulky Tasks Large Amount of Fragmented Shuffle I/O Fewer, Sequential Shuffle I/O

slide-11
SLIDE 11

Riffle: optimized shuffle service

  • Riffle shuffle service: a long running instance on each physical node
  • Riffle scheduler: keeps track of shuffle files and issues merge requests

Worker Node Worker Node Task Task Tasks Worker Machine Task Task Task Task File System Executor Executor Riffle Shuffle Service Driver Job / Task Scheduler Riffle Merge Scheduler assign report task statuses report merge statuses send merge requests

11

slide-12
SLIDE 12

Riffle: optimized shuffle service

  • When receiving a merge request
  • 1. Combines small shuffle files into

larger ones

  • 2. Keeps original file layout
  • Reducers fetch fewer, large blocks

instead of many, small blocks

Optimized Shuffle Service

merge request map map map reduce reduce reduce reduce reduce reduce reduce map map map merge request

Application Driver

Merge Scheduler Worker-Side Merger

12

slide-13
SLIDE 13

1R 0erge 5 10 20 40

1-Way 0erge

100 200 300 400 500

Time (sec)

0aS SWage 5educe SWage

Results with merge operations on synthetic workload

  • Riffle reduces number of fetch requests by 10x
  • Reduce stage -393s, map stage +169s → job completes 35% faster

1R 0erge 5 10 20 40

1-Way 0erge

1500 3000 4500 6000

6ize (KB)

5ead BlRcN 6ize

2000 4000 6000 8000

5equesW CRunW

1umber Rf 5eads

13

slide-14
SLIDE 14

1R 0erge 5 10 20 40

1-Way 0erge

1500 3000 4500 6000

6ize (KB)

5ead BlRcN 6ize

2000 4000 6000 8000

5equesW CRunW

1umber Rf 5eads

1R 0erge 5 10 20 40

1-Way 0erge

1500 3000 4500 6000

6ize (KB)

5ead BlRcN 6ize

2000 4000 6000 8000

5equesW CRunW

1umber Rf 5eads

1R 0erge 5 10 20 40

1-Way 0erge

100 200 300 400 500

Time (sec)

0aS SWage 5educe SWage

Best-effort merge: mixing merged and unmerged files

  • Reduce stage -393s, map stage +52s → job completes 53% faster
  • Riffle finishes job with only ~50% of cluster resources!

1R 0erge 5 10 20 40

1-Way 0erge

100 200 300 400 500

Time (sec)

0aS SWage 5educe SWage

14

Best-effort merge (95%)

slide-15
SLIDE 15

Additional enhancements

  • Handling merge operation failures
  • Efficient memory management
  • Balance merge requests in clusters

Block 65 Block 66 … Block 67 … Block 65 Block 66 … Block 67 … Block 65 Block 66 … Block 67 … … Block 65-1 Block 65-2 Block 65-m … Block 66-1 Block 66-2 Block 66-m …

Buffered Read Buffered Write Merge

1 k Merger Merger Merger Merger Merger … Job 1 Driver Job 2 Driver … Job k Driver request 1 request k

15

slide-16
SLIDE 16

Experiment setup

  • Testbed: Spark on a 100-node cluster
  • 56 CPU cores, 256GB RAM, 10Gbps Ethernet links
  • Each node runs 14 executors, each with 4 cores, 14GB RAM
  • Workload: 4 representative production jobs at Facebook

Data Map Reduce Block Description 1 167.6 GB 915 200 983 K ad 2 1.15 TB 7,040 1,438 120 K measurement 3 2.7 TB 8,064 2,500 147 K measurement 4 267 TB 36,145 20,011 360 K ad

16

slide-17
SLIDE 17

Reduction in shuffle I/O requests

  • Riffle reduces # of I/O requests by 5--10x for medium / large scale jobs

JRb1 JRb2 JRb3 5 10 15 20 6KuIIOe I2 5equests / 106

1R 0erJe 512K 10 20 40

JRb4

200 400 600 800

17

slide-18
SLIDE 18

Savings in end-to-end job completion time

  • Map stage time is almost not affected (with best-effort merge)
  • Reduces job completion time by 20--40% for medium / large jobs

JoE1 JoE2 JoE3 50 100 TotDl TDsN Execution Time / DDys

1o 0erJe 512K 10 20 40

JoE4

400 800 1200

18

Time / CPU Days

slide-19
SLIDE 19

Conclusion

  • Shuffle I/O becomes scaling bottleneck for multi-stage jobs
  • Efficiently schedule merge operations, mitigate merge stragglers
  • Riffle is deployed for Facebook’s production jobs processing PBs of data

19

merge request JoE1 JoE2 JoE3 50 100 TotDl TDsN Execution Time / DDys

1o 0erJe 512K 10 20 40

JoE4

400 800 1200

slide-20
SLIDE 20

Thanks!

Haoyu Zhang haoyuz@cs.princeton.edu http://www.haoyuzhang.org

slide-21
SLIDE 21

Riffle merge policies

Block 1 Block 2 Block R … Block 1 Block 2 Block R … Block 1 Block 2 Block R … … Block 1 Block 2 Block R … N files Block 1 Block 2 Block R … Block 1 Block 2 Block R … Block 1 Block 2 Block R … … Block 1 Block 2 Block R … total average block size > merge threshold

21

slide-22
SLIDE 22

Best-effort merge

  • Observation: slowdown in map stage is mostly due to stragglers
  • Best-effort merge: mixing merged and unmerged shuffle files
  • When number of finished merge requests is larger than a user

specified percentage threshold, stop waiting for more merge results

22

Thread 1 Thread 2 Thread 3 Merger time