Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - - PowerPoint PPT Presentation

optimizing data partitioning for data parallel computing
SMART_READER_LITE
LIVE PREVIEW

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - - PowerPoint PPT Presentation

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University Partition Data for Data-Parallel Computing ?


slide-1
SLIDE 1

Optimizing Data Partitioning for Data-Parallel Computing

Qifa Ke, Vijayan Prabhakaran, Yinglian Xie, Yuan Yu Jingyue Wu, Junfeng Yang Microsoft Research Silicon Valley Columbia University

slide-2
SLIDE 2

Partition Data for Data-Parallel Computing

…… ……

  • Data partitioning controls the degree of parallelism
  • What partition function to choose?

– Hash partition, range partition, …?

  • How many partitions to generate?

– 100, 1000, 10000, ….?

Data partitioning performance and costs

// 270 GB input data var output = input.GroupBy(x => x.UserId) .Select(g => GetStats(g))

?

slide-3
SLIDE 3

Problem 1: Do We have a Skew?

  • Data skew and computation skew

10 20 30 40 50 60 70 80 90 100 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Partition ID Fraction of Data/Computation Partition Size

// process 20 GB images in 100 partitions var output = Imgs.Select( x => ProcessImages (x))

slide-4
SLIDE 4

Problem 1: Do We have a Skew?

  • Image processing time

depends on both image and ProcessImage():

– Number of images – Image features ProcessImage()is targeting to compute

10 20 30 40 50 60 70 80 90 100 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Partition ID Fraction of Data/Computation Partition Size Computation Time

  • Data skew and computation skew

// process 20 GB images in 100 partitions var output = Imgs.Select(x=>ProcessImages (x))

slide-5
SLIDE 5

Problem 2: What’s Optimal?

  • Balanced workload ≠ optimal performance

– Tradeoff: workload vs. cross-node traffic

// construct a user-user graph for botnet deteciton

var records = input1.Apply(x => SelectRecords(x)).HashPartition(x=>x.label, nump); var output = input1.Apply(records, (x,y) => ConstructGraph(x,y));

slide-6
SLIDE 6

Optimal Data Partitioning

  • Performance and cost metrics

– Job latency – Number of processes – Memory consumption – Disk and network I/O

Given code and data, can we generate a data partitioning scheme to optimize performance, without running code on whole data set?

slide-7
SLIDE 7

Why not DB Solutions

  • Need to understand both code and data
  • Programming model

– Predefined operators (e.g., select, join) vs. arbitrary user-defined functions (UDF)

  • Data model

– Structured tables vs. unstructured data – Static, indexed data vs. dynamic dataset – Minimize intermediate disk writes vs. using disk as communication channel

slide-8
SLIDE 8

Code Analysis

  • Challenges: user defined functions (UDF)

– How data is accessed, processed, and transformed

  • Data processing flow
  • Computational & I/O complexity
  • Relevant data features

IEnumerable<stats> ProcessRecord( Ienumerable<record> users) { foreach (var u in users) {

if (NumRecipients(u) > 10) { yield return GetStat(u);

} else {

yield return GetSimpleStat(u);

} } }

  • Number of

recipients is a relevant feature

  • Different records

take different code paths to process

slide-9
SLIDE 9

Data Analysis

Statistics of relevant data features

  • Challenge: compact data representation

– Representative samples of input data – Data summarizations – Approximate histogram – Approximate number of distinct keys

  • Streaming algorithms in a distributed setting
slide-10
SLIDE 10

Cost Modeling and Optimization

  • Modeling: compare different partitioning schemes
  • Estimation: predict the potential cost

– White-box approach

  • Analytically based on code/data analysis

– Black-box approach

  • Sampling + regression analysis
  • Optimization: search for best partitioning scheme

Input data

Code EPG Data Analysis Code Analysis Cost Modeling & Estimation Cost Optimization

Optimized EPG Data Statistics & Samples

Computational & IO Complexity

Updated EPG

slide-11
SLIDE 11

Conclusion

  • Preparing your input before you start

– Data partitioning is critical to performance

  • New research opportunities in different fields

– Programming language analysis – Data analysis – Optimization – Distributed systems