Beyond the Embarrassingly Parallel New Languages, Compilers, and - PowerPoint PPT Presentation

Beyond the Embarrassingly Parallel New Languages, Compilers, and Runtimes for Big-Data Processing Madan Musuvathi Microsoft Research Joint work with Mike Barnett (MSR), Saeed Maleki (MSR), Todd Mytkowicz (MSR) Yufei Ding (N.C.State), Daniel Lupei (EPFL), Charith Mendis (MIT), Mathias Peters (Humboldt Univ.), Veselin Raychev (EPFL)

parallelism

parallelism = independent computation

can we parallelize dependent computation?

“ I nherently sequential” code is common … H F G log processing event-series pattern matching machine learning algorithms dynamic programming ...

Running example: processing click logs click log: S R R R S R S R R R R R P S R search review purchase influential reviews: S R + P problem: count influential reviews in the log

Running example: processing click logs click log: S R R R S R S R R R R R P S R influential reviews: S R + P bool search_done = false; int num_reviews = 0; int sum = 0; for each record in input Loop carried state switch record.type: case SEARCH: if (!search_done) { num_reviews = 0; search_done = true; } case REVIEW: num_reviews++; case PURCHASE: if (search_done) { search_done = false; sum += num_reviews; }

Extracting parallelism from dependent computations S R R P R S R R R R R R P R P for each record in input for each record in input switch record.type: switch record.type: (true, 1, 2) case SEARCH: if (!search_done) { num_reviews = 0; case SEARCH: if (!search_done) { num_reviews = 0; search_done = true; } search_done = true; } case REVIEW: num_reviews++; case REVIEW: num_reviews++; case PURCHASE: if (search_done) { search_done = false; case PURCHASE: if (search_done) { search_done = false; sum += num_reviews; } sum += num_reviews; } (false, 0, 0) (false, 1, 8) // loop-carried state: // (search_done, num_reviews, sum)

Extracting parallelism from dependent computations S R R P R S R R R R R R P R P for each record in input for each record in input (sd, nr, s) switch record.type: switch record.type: case SEARCH: if (!search_done) { num_reviews = 0; case SEARCH: if (!search_done) { num_reviews = 0; search_done = true; } search_done = true; } case REVIEW: num_reviews++; case REVIEW: num_reviews++; case PURCHASE: if (search_done) { search_done = false; case PURCHASE: if (search_done) { search_done = false; sum += num_reviews; } sum += num_reviews; } summary = F(sd, nr, s) (false, 0, 0) (true, 1, 2) F(sd, nr, s) = (false, nr+6, sd ? s+nr+5 : s) // loop-carried state: // (search_done, num_reviews, sum) output = F(true, 1, 2) = (false, 1, 8)

Recipe for breaking dependences x x 1. replace dependences with symbolic unknowns F G H 2. compute symbolic summaries in parallel g(x) h(x) f 3. combine symbolic summaries output = h( g( f ) ) success depends on 1. fast symbolic execution research challenges : 2. generation of concise summaries 1. i dentifying “compressible” computation 2. using domain-specific structure 3. automating the parallelization

Successful applications of this methodology finite- state machines [ASPLOS ’14] - regular expression matching, Huffman decoding, … - 3x faster on a single core, linear speedup on multiple cores part 2 of the talk dynamic programming [PPoPP ‘14, TOPC ’15, ICASSP ‘16] - linear speedup beyond the previous-best software Viterbi decoder - 7x speedup over state-of-the-art speech decoder part 1 of the talk large-scale data processing [SOSP ’15] - automatically parallelizable language for temporal analysis relational databases - optimize sessionization & windowed aggregates - 10x improvement over SQL server machine learning - parallel stochastic gradient descent

Auto-Parallelization Across Dependences Large-scale data processing

Relational abstractions for data processing map, reduce, join, filter, group-by filter expressive, simple, and declarative select count(*) from objects automatically parallelizable where type = square group-by group by color decades of work on optimizations count 3 3

Forces pushing beyond relational abstractions queries today = relational skeleton + non-relational logic not parallel embarrassingly parallel not optimized optimized temporal, iterative, stateful - log analysis - sessionization - machine learning

Map-Reduce example weblog S R R S P R R P S R R S P R S P S R R S P R R P users can: search S S R review R purchase P P

Count the number of reviews read per user S R R S P R R P S R R S P R P S S R R S P R R P mapper1 mapper2 P R P S S R R S P R R P S R R S P R R P S R R S R R R R R R R R R R R 4 3 3 1 R R 3 R 4 R R R R R R 3 1 R R reducer1 reducer2 sum sum psum psum psum psum 7 4

Count influential reviews (SR + P) per user S R R S P R R P S R R S R R P R S R R R P R P P match match match match summary summary summary summary reduce data shuffled from terabytes to gigabytes S R R P R P S R R R R R P R P S R S R R P S R P match match match match parallel parallel parallel parallel summary summary summary summary match SR*P match SR*P match match match match combine combine

SymPLE [SOSP ‘15] a language for specifying nonrelational parts of data-processing queries a subset of C++ automatically parallelize sequential code expose additional parallelism to query optimizer up to 2 orders of magnitude efficiency improvement

Count influential reviews bool search_done = false; int num_reviews = 0; int sum = 0; for each record in input switch record.type: case SEARCH: if (!search_done) { num_reviews = 0; search_done = true; } case REVIEW: num_reviews++; case PURCHASE: if (search_done) { search_done = false; sum += num_reviews; }

Count influential reviews SymBool search_done = false; SymInt num_reviews = 0; user uses symbolic data types SymInt sum = 0; for loop carried state for each record in input switch record.type: overloaded operators encode case SEARCH: if (!search_done) { num_reviews = 0; search_done = true; } efficient symbolic decision case REVIEW: num_reviews++; procedures for generating symbolic summaries case PURCHASE: if (search_done) { search_done = false; sum += num_reviews; }

Computing max in parallel max is, of course, associative SymInt curr_max = 0; for each num_reviews in input but this is not apparent from code if (curr_max < num_reviews) curr_max = num_reviews; SymPLE can parallelize this code

Parallelize by breaking dependences 2 8 1 5 3 9 8 2 1 x x for each num_reviews in input for each num_reviews in input for each num_reviews in input if (curr_max < num_reviews) if (curr_max < num_reviews) if (curr_max < num_reviews) curr_max = num_reviews; curr_max = num_reviews; curr_max = num_reviews; 0 F(x) G(x) 8 output = G(F(8))

Parallelize by breaking dependences 5 3 9 x for each num_reviews in input if (curr_max < num_reviews) curr_max = num_reviews; F(x)

SymInt max = x; for each num_reviews in (5,3,9) if (max < num_reviews) max = num_reviews; max = 𝒚 no branching iter 1 decision when state procedure 𝒚 ≥ 𝟔 if (max < 5) 𝒚 < 𝟔 < 5? becomes prunes infeasible max = 5; concrete max = 𝒚 max = 5 paths iter 2 if (max < 3) < 3? < 3? 𝒚 ≥ 𝟒 max = 3; max = 𝒚 max = 5 Infeasible iter 3 equivalent if (max < 9) 𝒚 ≥ 𝟘 < 9? 𝟔 ≤ 𝒚 < 𝟘 < 9? paths can be max = 9; merged max = 9 max = 9 max = 𝒚 𝒚 ≥ 𝟘 ⇒ 𝒏𝒃𝒚 = 𝒚 𝒚 < 𝟘 ⇒ 𝒏𝒃𝒚 = 𝟘

Parallelize by breaking dependences 2 8 1 5 3 9 8 2 1 x x for each num_reviews in input for each num_reviews in input for each num_reviews in input if (curr_max < num_reviews) if (curr_max < num_reviews) if (curr_max < num_reviews) curr_max = num_reviews; curr_max = num_reviews; curr_max = num_reviews; 0 8 𝑦 < 8 ⇒ curr_max = 8 𝑦 < 9 ⇒ curr_max = 9 𝑦 ≥ 8 ⇒ curr_max = 𝑦 𝑦 ≥ 9 ⇒ curr_max = 𝑦

Single machine throughput Sequential Symbolic execution 1, 2 and 4 threads throughput MB/s overhead from symbolic execution Query 1 Query 2 Query 3 Query 4

Reduction in data movement data shuffled from mappers to reducers MapReduce SymPLE 172x reduction megabytes Query Query Query Query 1 2 3 4

Challenge can we develop new abstractions for future data-processing needs? - move beyond embarrassingly parallel - automatically parallelizable perform whole query optimizations - unify relational and non-relational parts - extract filters, project unused parts of data, …

Manual Parallelization Across Dependences Dynamic Programming

Speech decoders GMM/DNN /p/ee/p/aw/p/ HMM “ PPoPP ” Phonemes Recognized Text Speech Signal Sequential bottleneck

Viterbi algorithm for Hidden Markov Models (HMM) finds the most likely sequence of hidden states that explain an observation time 𝑞 0 hidden states recurrence equation : = language model 𝑡 𝑞 1 𝑄 𝑢 𝑡 = 𝑞∈𝑞𝑠𝑓𝑒(𝑡) 𝑄 𝑢−1 𝑞 + 𝑈𝑄 𝑢 (𝑞 → 𝑡) max states 𝑞 2

Beyond the Embarrassingly Parallel New Languages, Compilers, and - PowerPoint PPT Presentation

Beyond the Embarrassingly Parallel New Languages, Compilers, and Runtimes for Big-Data Processing Madan Musuvathi Microsoft Research Joint work with Mike Barnett (MSR), Saeed Maleki (MSR), Todd Mytkowicz (MSR) Yufei Ding (N.C.State), Daniel

Embarrassingly Parallel Computations Embarrassingly Parallel Computations A computation that

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Embarrassingly Parallel Computations A computation that can be divided into a number of

Task Farming For Embarrassingly Parallel Processing Ivan Giroo igiroo@ictp.it Informa(on

Parallel Search Ciaran McCreesh and Patrick Prosser This Weeks Lectures Search and

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

IR Quadrupole R&D Program as a basis for MQXF GianLuca Sabbi QXF Design Review CERN, December

LCMAPS (and VOMS) integration for Globus services D. H. van Dok (Nikhef) and M. Sall (Nikhef)

Bro + ELK BroCon 2015 Michael Pananen Vigilant Technology

First Quarter Results Fiscal Year 2021 RPM International Inc. Consolidated Statements of Income

Detection of Hadrons with New Detection of Hadrons with New Heavy Quark at LHC and Heavy Quark

Quasi Riesz transforms, Hardy spaces and generalised sub-Gaussian heat kernel estimates Li CHEN

ACSIS Correlator Crate Software HIA/DRAO ACSIS Correlator Crate Software Outline System

Introduction to GenePattern Rehan Akbani rakbani@mdanderson.org Overview What is GenePattern

Beyond the Embarrassingly Parallel New Languages, Compilers, and - PowerPoint PPT Presentation

Beyond the Embarrassingly Parallel New Languages, Compilers, and Runtimes for Big-Data Processing Madan Musuvathi Microsoft Research Joint work with Mike Barnett (MSR), Saeed Maleki (MSR), Todd Mytkowicz (MSR) Yufei Ding (N.C.State), Daniel

Embarrassingly Parallel Computations Embarrassingly Parallel Computations A computation that

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Embarrassingly Parallel Computations A computation that can be divided into a number of

Task Farming For Embarrassingly Parallel Processing Ivan Giro*o igiro*o@ictp.it Informa(on

Parallel Search Ciaran McCreesh and Patrick Prosser This Weeks Lectures Search and

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

IR Quadrupole R&amp;D Program as a basis for MQXF GianLuca Sabbi QXF Design Review CERN, December

LCMAPS (and VOMS) integration for Globus services D. H. van Dok (Nikhef) and M. Sall (Nikhef)

Bro + ELK BroCon 2015 Michael Pananen Vigilant Technology

First Quarter Results Fiscal Year 2021 RPM International Inc. Consolidated Statements of Income

Detection of Hadrons with New Detection of Hadrons with New Heavy Quark at LHC and Heavy Quark

Quasi Riesz transforms, Hardy spaces and generalised sub-Gaussian heat kernel estimates Li CHEN

ACSIS Correlator Crate Software HIA/DRAO ACSIS Correlator Crate Software Outline System

Introduction to GenePattern Rehan Akbani rakbani@mdanderson.org Overview What is GenePattern

Task Farming For Embarrassingly Parallel Processing Ivan Giroo igiroo@ictp.it Informa(on

IR Quadrupole R&D Program as a basis for MQXF GianLuca Sabbi QXF Design Review CERN, December