Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data - PowerPoint PPT Presentation

Princeton University Facebook Haoyu Zhang Brian Cho Ergin Seyfe Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data Analytics Michael J. Freedman

Batch analytics systems are widely used • Large-scale SQL queries • Custom batch jobs • Pre-/Post-processing for ML At 10s of PB new data is generated every day for batch processing 100s of TB data is added to be processed by a single job 2

Batch analytics jobs: logical graph narrow dependency wide dependency map filter join, filter groupBy map 3

Batch analytics jobs: DAG execution plan Stage 1 Stage 2 • Shuffle: all-to-all communication between stages • >10x larger than available memory, strong fault tolerance requirements → on-disk shuffle files 4

The case for tiny tasks • Benefits of slicing jobs into small tasks • Improve parallelism [Tinytasks HotOS 13] [Subsampling IC2E 14] [Monotask SOSP 17] • Improve load balancing [Sparrow SOSP 13] • Reduce straggler effect [Dolly NSDI 13] [SparkPerf NSDI 15] 5

The case against tiny tasks Although we were able to run the Spark job with such a high number of tasks, we found that there is significant performance degradation when the number of tasks is too high. • Engineering experience often argues against running too many tasks • Medium scale → very large scale (10x larger than memory space) • Single-stage jobs → multi-stage jobs (> 50%) [*] Apache Spark @Scale: A 60 TB+ Production Use Case. https://tinyurl.com/yadx29gl 6

Shuffle I/O grows quadratically with data ShuIIOe Time I/2 5eTuest SKuffle )etFK Size 5eTuest CRunt / 10 6 ShuIIOe Time (sec) 4000 1500 120 Size (KB) 3000 1000 80 2000 500 40 1000 0 0 0 0 5000 10000 0 5000 10000 1umber RI TasNs 1umber of TasNs • Large amount of fragmented I/O requests • Adversarial workload for hard drives! 7

Strawman: tune number of tasks in a job • Tasks spill intermediate data to disk if data splits exceed memory capacity • Larger task execution reduces shuffle I/O, but increases spill I/O 8

Strawman: tune number of tasks in a job 6huffle 6huffle 6huffle 6Sill 6Sill 6Sill 3000 3000 3000 7ime (sec) 7ime (sec) 7ime (sec) 2000 2000 2000 1000 1000 1000 0 0 0 300 300 300 400 400 400 500 500 500 600 600 600 700 700 700 800 800 800 900 900 900 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 10000 10000 10000 1umber of 0aS 7asNs 1umber of 0aS 7asNs 1umber of 0aS 7asNs • Need to retune when input data volume changes for each individual job • Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17] • straggler problems, imbalanced workload, garbage collection overhead 9

Large Amount of Small Tasks Fragmented Shuffle I/O Fewer, Sequential Bulky Tasks Shuffle I/O 10

Riffle: optimized shuffle service Task Driver Task Worker Node Tasks assign Worker Node Worker Machine Task Task Task Task Job / Task report task Scheduler statuses Executor Executor send merge Riffle requests File System Merge report merge Scheduler Riffle Shuffle Service statuses • Riffle shuffle service: a long running instance on each physical node • Riffle scheduler: keeps track of shuffle files and issues merge requests 11

Riffle: optimized shuffle service Application Driver • When receiving a merge request Merge Scheduler Worker-Side Merger 1. Combines small shuffle files into reduce map larger ones reduce merge 2. Keeps original file layout map request reduce map reduce map reduce • Reducers fetch fewer, large blocks merge map instead of many, small blocks reduce request map reduce Optimized Shuffle Service 12

Results with merge operations on synthetic workload 0aS SWage 5educe SWage 5ead BlRcN 6ize 1umber Rf 5eads 500 6000 8000 5equesW CRunW 400 6ize (KB) Time (sec) 4500 6000 300 3000 4000 200 1500 2000 100 0 0 0 1R 0erge 5 10 20 40 1R 0erge 5 10 20 40 1-Way 0erge 1-Way 0erge • Riffle reduces number of fetch requests by 10x • Reduce stage -393s, map stage +169s → job completes 35% faster 13

Best-effort merge: mixing merged and unmerged files 0aS SWage 0aS SWage 5educe SWage 5educe SWage 5ead BlRcN 6ize 5ead BlRcN 6ize 1umber Rf 5eads 1umber Rf 5eads 500 500 6000 6000 8000 8000 5equesW CRunW 5equesW CRunW 400 400 6ize (KB) 6ize (KB) Time (sec) Time (sec) 4500 4500 6000 6000 300 300 3000 3000 4000 4000 200 200 1500 1500 2000 2000 100 100 0 0 0 0 0 0 1R 0erge 1R 0erge 5 5 10 10 20 20 40 40 1R 0erge 1R 0erge 5 5 10 10 20 20 40 40 1-Way 0erge 1-Way 0erge 1-Way 0erge 1-Way 0erge Best-effort merge (95%) • Reduce stage -393s, map stage +52s → job completes 53% faster • Riffle finishes job with only ~50% of cluster resources! 14

Additional enhancements • Handling merge operation failures • Efficient memory management • Balance merge requests in clusters 1 Merger request 1 Buffered Read Block 65-1 Job 1 Driver Buffered Write Merger Block 65-2 … … … … Job 2 Driver Merger Block 65 Block 65 Block 65- m Block 65 Merge Block 66 … Block 66 Block 66 … Block 66-1 Merger Block 67 Block 67 Block 67 Block 66-2 … … … Job k Driver … … request k Block 66- m k Merger 15

Experiment setup • Testbed: Spark on a 100-node cluster • 56 CPU cores, 256GB RAM, 10Gbps Ethernet links • Each node runs 14 executors, each with 4 cores, 14GB RAM • Workload: 4 representative production jobs at Facebook Data Map Reduce Block Description 1 167.6 GB 915 200 983 K ad 2 1.15 TB 7,040 1,438 120 K measurement 3 2.7 TB 8,064 2,500 147 K measurement 4 267 TB 36,145 20,011 360 K ad 16

Reduction in shuffle I/O requests 1R 0erJe 512K 10 20 40 6KuIIOe I2 5equests / 10 6 20 800 15 600 10 400 5 200 0 0 JRb1 JRb2 JRb3 JRb4 • Riffle reduces # of I/O requests by 5--10x for medium / large scale jobs 17

Savings in end-to-end job completion time 1o 0erJe 512K 10 20 40 TotDl TDsN Execution 100 1200 Time / CPU Days Time / DDys 800 50 400 0 0 JoE1 JoE2 JoE3 JoE4 • Map stage time is almost not affected (with best-effort merge) • Reduces job completion time by 20--40% for medium / large jobs 18

Conclusion • Shuffle I/O becomes scaling bottleneck for multi-stage jobs • Efficiently schedule merge operations, mitigate merge stragglers 1o 0erJe 512K 10 20 40 TotDl TDsN Execution 100 1200 Time / DDys merge 800 request 50 400 0 0 JoE1 JoE2 JoE3 JoE4 • Riffle is deployed for Facebook’s production jobs processing PBs of data 19

Thanks! Haoyu Zhang haoyuz@cs.princeton.edu http://www.haoyuzhang.org

Riffle merge policies Block 1 Block 1 Block 2 Block 2 … … Block 1 Block R Block 1 Block R Block 1 Block 1 Block 2 Block 2 Block 2 Block 2 … … Block R … … Block R … … Block 1 Block R Block R Block 1 Block 2 Block 2 … … total average block size N files Block R > merge threshold Block R 21

Best-effort merge • Observation: slowdown in map stage is mostly due to stragglers Thread 1 Thread 2 Thread 3 time Merger • Best-effort merge: mixing merged and unmerged shuffle files • When number of finished merge requests is larger than a user specified percentage threshold, stop waiting for more merge results 22

Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data - PowerPoint PPT Presentation

Princeton University Facebook Haoyu Zhang Brian Cho Ergin Seyfe Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data Analytics Michael J. Freedman Batch analytics systems are widely used Large-scale SQL queries

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

OPS: Optimized Shuffle Management System for Apache Spark Yuchen Cheng * , Chunghsuan Wu * ,

A SHUFFLE ARGUMENT SECURE IN THE GENERIC MODEL Prastudy Fauzi, Helger Lipmaa, Michal Zajac

On the double shuffle Lie algebra structure: Ecalles approach Adriana Salerno (joint work

Optimizing Shuffle in Wide-Area Data Analytics Shuhao Liu * , Hao Wang, Baochun Li Department of

Encryption based on Card Shuffle Jooyoung Lee Faculty of Mathematics and Statistics, Sejong

Shuffle Phase Executed only in the case of one or more reducers Transfers data between the

Limitations of Comal Springs Riffle Beetle Plastron Use During Low-Flow EAHCP Study

Problems for Multivariate Data Analysis Censored data. Riffle: an R Package for Nonmetric

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

Moving Shadow Tracking in VR Interaction A novel optimized approach A novel optimized approach

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

ShuffleWatcher : Shuffle-aware Scheduling in Mul5-tenant

Anti-Combining for MapReduce Alper Okcan Mirek Riedewald Northeastern University, Boston, USA

Breaking Down Barriers to the Sharing of Behavioral Health Information Ch Chri rist sty A

Darwini: Generating realistic large- scale social graphs Dionysios Logothetis Cheng Wang Sergey

Modern C ++ for Computer Vision and Image Processing Lecture 0: The basics Ignacio Vizzo and

CGLR Funder Roundtable Online Meeting March 13, 2018 Please mute your phone until ready to speak

Dream Askew A Game About Queer Strife Amid the Apocalypse by Avery Alder What to Expect?

DSHS Grand Rounds . Logistics Registration for free continuing education (CE) hours or

Remote Operations Workshop Shelter Island September 17-20, 2002 Andrew Hutton Mike Spata

CSSA Convention 2015 at Pitzer College A Tour by Laurel Woodley West Hall with Dorm Rooms West