optimizing data partitioning for data parallel computing
play

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - PowerPoint PPT Presentation

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University Partition Data for Data-Parallel Computing ?


  1. Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University

  2. Partition Data for Data-Parallel Computing ? …… // 270 GB input data var output = …… input.GroupBy(x => x.UserId) .Select(g => GetStats(g)) • Data partitioning controls the degree of parallelism • What partition function to choose? – Hash partition, r ange partition, …? • How many partitions to generate? – 100, 1000, 10000, ….? Data partitioning performance and costs

  3. Problem 1: Do We have a Skew? • Data skew and computation skew // process 20 GB images in 100 partitions var output = Imgs.Select( x => ProcessImages (x)) 0.14 Partition Size Fraction of Data/Computation 0.12 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 70 80 90 100 Partition ID

  4. Problem 1: Do We have a Skew? • Data skew and computation skew // process 20 GB images in 100 partitions var output = Imgs.Select(x=>ProcessImages (x)) • Image processing time 0.14 Partition Size Computation Time Fraction of Data/Computation 0.12 depends on both image 0.1 and ProcessImage() : 0.08 – Number of images 0.06 – Image features 0.04 ProcessImage() is 0.02 targeting to compute 0 0 10 20 30 40 50 60 70 80 90 100 Partition ID

  5. Problem 2: What’s Optimal? • Balanced workload ≠ optimal performance – Tradeoff: workload vs. cross-node traffic // construct a user-user graph for botnet deteciton var records = input1.Apply(x => SelectRecords(x)).HashPartition(x=>x.label, nump); var output = input1.Apply(records, (x,y) => ConstructGraph(x,y));

  6. Optimal Data Partitioning Given code and data, can we generate a data partitioning scheme to optimize performance, without running code on whole data set? • Performance and cost metrics – Job latency – Number of processes – Memory consumption – Disk and network I/O

  7. Why not DB Solutions • Need to understand both code and data • Programming model – Predefined operators (e.g., select, join) vs. arbitrary user-defined functions (UDF) • Data model – Structured tables vs. unstructured data – Static, indexed data vs. dynamic dataset – Minimize intermediate disk writes vs. using disk as communication channel

  8. Code Analysis - Data processing flow - Computational & I/O complexity - Relevant data features • Challenges: user defined functions (UDF) – How data is accessed, processed, and transformed • Number of IEnumerable<stats> ProcessRecord( Ienumerable<record> users) { recipients is a foreach (var u in users) { if (NumRecipients(u) > 10) { relevant feature yield return GetStat(u); } else { • Different records yield return GetSimpleStat(u); take different code } } paths to process }

  9. Data Analysis Statistics of relevant data features • Challenge: compact data representation – Representative samples of input data – Data summarizations – Approximate histogram – Approximate number of distinct keys • Streaming algorithms in a distributed setting

  10. Cost Modeling and Optimization • Modeling: compare different partitioning schemes • Estimation: predict the potential cost – White-box approach • Analytically based on code/data analysis – Black-box approach • Sampling + regression analysis • Optimization: search for best partitioning scheme Input data Data Data Statistics … & Samples Analysis Cost Modeling Cost Optimized EPG & Estimation Optimization Code Code Computational Updated & IO Complexity Analysis EPG EPG

  11. Conclusion • Preparing your input before you start – Data partitioning is critical to performance • New research opportunities in different fields – Programming language analysis – Data analysis – Optimization – Distributed systems

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend