OPS: Optimized Shuffle Management System for Apache Spark Yuchen - PowerPoint PPT Presentation

OPS: Optimized Shuffle Management System for Apache Spark Yuchen Cheng * , Chunghsuan Wu * , Yanqiang Liu * , Rui Ren * , Hong Xu † , Bin Yang ‡ , Zhengwei Qi * * Shanghai Jiao Tong University † City University of Hong Kong ‡ Intel Corporation

Data Processing in Spark 2

Dependent Shuffle Phase • Map phase • intensive disk I/O for persisted shu ffl e data • idled network I/O resources • Reduce phase • network I/O peaks • shu ffl e request peaks with a significant trough • Observations • the resource slot-based scheduling method that does not consider I/O resources • the calculation logic that couples data transmission and calculation 3

Multi-Round Sub-Tasks • The number of sub-tasks is recommended to be at least twice the total number of CPUs in the cluster • However, the intermediate data in this phase cannot be transmitted in time except the last round • Stragglers ☠ 4

Overhead of Shuffle Phase • 512 GB two-stage sequencing application • 640 to 6400 sub-tasks • As the number of sub-tasks increases, • the total execution time of the shu ffl e phase increases sharply • the number of shu ffl e requests grows to the power of the original • the amount of transmission for each shu ffl e request also gradually decreases 5

Optimizations: I/O Requests • Sailfish [SoCC ’12] • Aggregate intermediate data files and using batch processing • Modification of the file system is needed • Ri ffl e [EuroSys ’18] • E ffi ciently schedule merge operations • Convert small, random shu ffl e I/O requests into much fewer large, sequential I/O requests • Intensive network I/O 6

Optimizations: Shuffle Optimization • iShu ffl e [TPDS, 2017] • Separate the shu ffl e phase from the reduce sub-tasks • Low utilization of I/O resources (e.g., disk and network) • SCache [PPoPP ’18] • In-memory shu ffl e management with pre-scheduling • Lack the support of larger-than-memory datasets 7

Our Goal • In-memory shu ffl e management with larger-than-memory datasets support • Elimination of synchronization barrier • Utilization of I/O resources • Mitigation of the number of shu ffl e requests 8

Proposed Design: OPS • Early-merge phase: Step 1 and 2 • Early-shu ffl e phase: Step 3, 4 and 5 • Local-fetch phase: Step 6 and 7 9

Early-Merge 1. The raw output data of the map sub- tasks is directly transferred to OPS 2. Intermediate data is temporarily stored in memory and transferred to the disk of the designated node 3. OPS releases memory resources after the early-shu ffl ing of the partition page is completed 10

Early-Shuffle • Transferer reads the partition pages in di ff erent partition queues in turn for transmission as a consumer • until all corresponding partition queues are empty • Threshold can be set according to bandwidth and memory size of the cluster 11

Early-Schedule • The execution of the early-shu ffl e strategy of OPS depends on the scheduling results of the reduce sub-tasks • OPS is designed to trigger early-schedule in two cases: • when the first early-shu ffl e is triggered • when the number of completed map sub-tasks reaches the set threshold (5% as default) 12

Testbed • 100 t3.xlarge EC2 nodes with a 4-core CPU and 16 GB of memory • Hadoop YARN v2.8.5 and Spark v2.4.3 • 10 GB of memory is allocated for early-merging Metric Value 3.1 GHz Intel Xeon Platinum 8000 series CPU (Skylake-SP or Cascade Lake) vCPU 4 Memory 16 GB Storage AWS EBS SSD (gp2) 256 GB Storage IOPS 750 Storage Bandwidth 250 Mbps Network Bandwidth 4.8 Gbps OS Amazon Linux 2 13

Workload • Sort application with 1.6 TB of random text Partition Input Splits Rounds Data / Task Numbers 1 1600 1600 4 1000 MB 2 2400 2400 6 670 MB 3 3200 3200 8 500 MB 4 4000 4000 10 400 MB 5 4800 4800 12 330 MB 6 5600 5600 14 290 MB 7 6400 6400 16 250 MB 14

I/O Throughput s reduce starts reduce starts network I/O bursts �seq�ential disk I/O random disk I/O Spark + OPS Spark Spark + SCache • OPS optimizes the total execution time by about 41%, and the execution time of reduce by about 50% • Higher utilization of network I/O in the map phase • Higher utilization of disk I/O in the reduce phase 15

Completion Time Reduce Total • OPS reduces the total completion time by 17%-41% • The completion time of the map phase is also steadily reduced 16

HiBench • OPS outperforms in shu ffl e-intensive workload • e.g., Sort and TeraSort 17

Summary • Early-merge intermediate data to mitigate intensive disk I/O • Early-schedule based on partition pages • Early-shu ffl e intermediate data stored in shared memory • Optimize the overhead of shu ffl e by nearly 50% 18

Thanks for your attention. Yuchen Cheng Shanghai Jiao Tong University rudeigerc@sjtu.edu.cn

OPS: Optimized Shuffle Management System for Apache Spark Yuchen - PowerPoint PPT Presentation

OPS: Optimized Shuffle Management System for Apache Spark Yuchen Cheng * , Chunghsuan Wu * , Yanqiang Liu * , Rui Ren * , Hong Xu , Bin Yang , Zhengwei Qi * * Shanghai Jiao Tong University City University of Hong Kong Intel

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Thames Valley Berkshire LEP Careers Hub Launch 10 July 2019 Sarah Gray Network Operations

T a lking a b o ut o pe ra tio ns - I t s wha t we do T ra c e y Bra mb le Na tio na l

March 29, 2017 The webcast will begin shortly. There is no audio being broadcast at this

Network Operations with Ansible Tower, ServiceNow, and Slack Sean Cavanaugh Jason Edelman

vcsh manage config files in $HOME via fake bare git repositories Richard Hartmann, RichiH@ {

Network Security Fundamentals Security Training Course Dr. Charles J. Antonelli The University

Vulnerability management with OpenVAS Henri Doreau henri.doreau@greenbone.net 12 th LSM -

Securing Your System CS 236 On-Line MS Program Networks and Systems Security Peter Reiher

OPS: Optimized Shuffle Management System for Apache Spark Yuchen - PowerPoint PPT Presentation

OPS: Optimized Shuffle Management System for Apache Spark Yuchen Cheng * , Chunghsuan Wu * , Yanqiang Liu * , Rui Ren * , Hong Xu , Bin Yang , Zhengwei Qi * * Shanghai Jiao Tong University City University of Hong Kong Intel

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Thames Valley Berkshire LEP Careers Hub Launch 10 July 2019 Sarah Gray Network Operations

T a lking a b o ut o pe ra tio ns - I t s wha t we do T ra c e y Bra mb le Na tio na l

March 29, 2017 The webcast will begin shortly. There is no audio being broadcast at this

Network Operations with Ansible Tower, ServiceNow, and Slack Sean Cavanaugh Jason Edelman

vcsh manage config files in $HOME via fake bare git repositories Richard Hartmann, RichiH@ {

Network Security Fundamentals Security Training Course Dr. Charles J. Antonelli The University

Vulnerability management with OpenVAS Henri Doreau henri.doreau@greenbone.net 12 th LSM -

Securing Your System CS 236 On-Line MS Program Networks and Systems Security Peter Reiher

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark