Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - PowerPoint PPT Presentation

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University

Partition Data for Data-Parallel Computing ? …… // 270 GB input data var output = …… input.GroupBy(x => x.UserId) .Select(g => GetStats(g)) • Data partitioning controls the degree of parallelism • What partition function to choose? – Hash partition, r ange partition, …? • How many partitions to generate? – 100, 1000, 10000, ….? Data partitioning performance and costs

Problem 1: Do We have a Skew? • Data skew and computation skew // process 20 GB images in 100 partitions var output = Imgs.Select( x => ProcessImages (x)) 0.14 Partition Size Fraction of Data/Computation 0.12 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 70 80 90 100 Partition ID

Problem 1: Do We have a Skew? • Data skew and computation skew // process 20 GB images in 100 partitions var output = Imgs.Select(x=>ProcessImages (x)) • Image processing time 0.14 Partition Size Computation Time Fraction of Data/Computation 0.12 depends on both image 0.1 and ProcessImage() : 0.08 – Number of images 0.06 – Image features 0.04 ProcessImage() is 0.02 targeting to compute 0 0 10 20 30 40 50 60 70 80 90 100 Partition ID

Problem 2: What’s Optimal? • Balanced workload ≠ optimal performance – Tradeoff: workload vs. cross-node traffic // construct a user-user graph for botnet deteciton var records = input1.Apply(x => SelectRecords(x)).HashPartition(x=>x.label, nump); var output = input1.Apply(records, (x,y) => ConstructGraph(x,y));

Optimal Data Partitioning Given code and data, can we generate a data partitioning scheme to optimize performance, without running code on whole data set? • Performance and cost metrics – Job latency – Number of processes – Memory consumption – Disk and network I/O

Why not DB Solutions • Need to understand both code and data • Programming model – Predefined operators (e.g., select, join) vs. arbitrary user-defined functions (UDF) • Data model – Structured tables vs. unstructured data – Static, indexed data vs. dynamic dataset – Minimize intermediate disk writes vs. using disk as communication channel

Code Analysis - Data processing flow - Computational & I/O complexity - Relevant data features • Challenges: user defined functions (UDF) – How data is accessed, processed, and transformed • Number of IEnumerable<stats> ProcessRecord( Ienumerable<record> users) { recipients is a foreach (var u in users) { if (NumRecipients(u) > 10) { relevant feature yield return GetStat(u); } else { • Different records yield return GetSimpleStat(u); take different code } } paths to process }

Data Analysis Statistics of relevant data features • Challenge: compact data representation – Representative samples of input data – Data summarizations – Approximate histogram – Approximate number of distinct keys • Streaming algorithms in a distributed setting

Cost Modeling and Optimization • Modeling: compare different partitioning schemes • Estimation: predict the potential cost – White-box approach • Analytically based on code/data analysis – Black-box approach • Sampling + regression analysis • Optimization: search for best partitioning scheme Input data Data Data Statistics … & Samples Analysis Cost Modeling Cost Optimized EPG & Estimation Optimization Code Code Computational Updated & IO Complexity Analysis EPG EPG

Conclusion • Preparing your input before you start – Data partitioning is critical to performance • New research opportunities in different fields – Programming language analysis – Data analysis – Optimization – Distributed systems

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - PowerPoint PPT Presentation

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University Partition Data for Data-Parallel Computing ?

Algorithms in the parallel Algorithms in the parallel partitioning tool partitioning tool

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Intuitionistic Proofs Without Syntax Willem Heijltjes, Dominic Hughes, and Lutz Straburger

Obje jective To develop a novel language agnostic text detection method utilizing edge enhanced

Outline Introduction. Paper: Paper: Optimal Sizing for Minimum Energy. B Benton H. C.,

The bilevel lot-sizing problem Tams Kis 1 joint work with Andrs Kovcs 1 Computing and

Lecture 3.1 Factors Against Parallelism EN 600.320/420/620 Instructor: Randal Burns 7 February

A skew product map with a non-contracting iterated monodromy group Volodymyr Nekrashevych 2019,

Skew Hadamard Difference Sets Alexander Pott (with Cunsheng Ding and Qi Wang)

Braided skew monoidal categories Stephen Lack Macquarie University joint work with John Bourke

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - PowerPoint PPT Presentation

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University Partition Data for Data-Parallel Computing ?

Algorithms in the parallel Algorithms in the parallel partitioning tool partitioning tool

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Intuitionistic Proofs Without Syntax Willem Heijltjes, Dominic Hughes, and Lutz Straburger

Obje jective To develop a novel language agnostic text detection method utilizing edge enhanced

Outline Introduction. Paper: Paper: Optimal Sizing for Minimum Energy. B Benton H. C.,

The bilevel lot-sizing problem Tams Kis 1 joint work with Andrs Kovcs 1 Computing and

Lecture 3.1 Factors Against Parallelism EN 600.320/420/620 Instructor: Randal Burns 7 February

A skew product map with a non-contracting iterated monodromy group Volodymyr Nekrashevych 2019,

Skew Hadamard Difference Sets Alexander Pott (with Cunsheng Ding and Qi Wang)

Braided skew monoidal categories Stephen Lack Macquarie University joint work with John Bourke

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &