MapReduce Design Patterns This section is based on the book by Jimmy - PDF document

MapReduce Design Patterns • This section is based on the book by Jimmy Lin Now let’s look at important program “design and Chris Dyer patterns” for MapReduce. • Programmer can control program execution only through implementation of mapper, reducer, combiner, and partitioner • No explicit synchronization primitives • So how can a programmer control execution and data flow? 1 2 Taking Control of MapReduce (1) Local Aggregation • Store and communicate partial results through • Reduce size of intermediate results passed complex data structures for keys and values from mappers to reducers • Run appropriate initialization code at beginning of task – Important for scalability: recall Amdahl’s Law and termination code at end of task • Various options using combiner function and • Preserve state in mappers and reducers across multiple input splits and intermediate keys, respectively ability to preserve mapper state across • Control sort order of intermediate keys to control multiple inputs processing order at reducers • For example, consider Word Count with the • Control set of keys assigned to a reducer • Use “driver” program document-based version of Map 3 4 Word Count Baseline Algorithm Tally Counts Per Document map(docID a, doc d) H = new hashMap for all term t in doc d do H{t} ++ reduce (term t, counts [c1, c2,…]) for all term t in H do map(docID a, doc d) sum = 0 Emit(term t, count H{t}) for all term t in doc d do for all count c in counts do Emit(term t, count 1) sum += c • Same Reduce function as before Emit(term t, count sum); • Limitation: Map only aggregates counts within a single document • Problem: frequent terms are emitted many • Depending on split size and document size, a Map task might receive many documents times with count 1 • Can we aggregate across all documents in the same Map task? 5 6

Tally Counts Across Documents Design Pattern for Local Aggregation • Data structure H is a private member • In-mapper combining Class Mapper { of the Mapper class – Done by preserving state across map calls in the same task initialize() { – Local to a single task, i.e., does not H = new hashMap • Advantages over using combiners introduce task synchronization issues } – Combiner does not guarantee if, when or how often it is • Initialize is called when the task starts, executed map(docID a, doc d) { i.e., before all map calls for all term t in doc d do – Combiner combines data after it was generated, in- – Configure() in old API H{t} ++ mapper combining avoids generating it! – Setup() in new API } • Drawbacks • Close is called after the last document – Introduces complexity and hence probability for bugs close() { from the Map task has been for all term t in H do processed – Higher memory consumption for managing state Emit(term t, count H{t}) – Close() in old API • Might have to write memory-management code to page data to } disk – Cleanup() in new API } 7 8 (2) Counting of Combinations Pairs Design Pattern map(docID a, doc d) • Needed for computing correlations, for all term w in doc d do for all term u NEAR w do associations, confusion matrix (how many w v u Emit(pair (w, u), count 1) times does a classifier confuse Y i with Y j ) w reduce (pair p, counts [c1, c2,…]) • Co-occurrence matrix for a text corpus: how v sum = 0 for all count c in counts do u many times do two terms appear near each sum += c Emit(pair p, count sum) other • Can use combiner or in-mapper combining • Main idea: compute partial counts for some • Good: easy to implement and understand combinations, then aggregate them • Bad: huge intermediate-key space – At what granularity should Map work? – Quadratic in number of distinct terms 9 10 Stripes Design Pattern Note About Stripes Map Code map(docID a, doc d) • Pairs’ Map code only needs a single sequential scan of the for all term w in doc d do document, keeping the current term w and a “sliding H = new hashMap window” of the nearby terms to its left and right w v u for all term u NEAR w do H{u} ++ Emit(term w, stripe H) • Stripes can do the same, but then it does not aggregate w counts across multiple occurrences of the same term w in v reduce (term w, stripes [H1, H2,…]) document d, i.e., would mostly produce counts of 1 in the Hout = new hashMap u hash map for all stripe H in stripes do Hout = ElementWiseSum(Hout, H) • To aggregate across all occurrences of w in d, Stripes would Emit(term w, stripe Hout) have to repeatedly scan the document, once for each • Can use combiner or in-mapper combining distinct term w in d • Good: much smaller intermediate-key space – Could create an index to find repeated occurrences of w faster – Linear in number of distinct terms • Or use a two-dim. hash map H[w][u] in the Map function, • Bad: more difficult to implement, Map needs to hold entire stripe in allowing a single-scan solution at higher memory cost memory 11 12

Pairs versus Stripes Pairs versus Stripes (cont.) • With combiner or in-mapper combining, Map • Without combiner or in-mapper combining, would produce about the same amount of Pairs could produce significantly more mapper data in both cases output – Two-dimensional index Pairs[w][u] with per-task – ((w,u),1) per pair for Pairs, versus per-document counts for each pair (w,u) is the same as one- aggregates for Stripes dimensional index of one-dimensional indexes • …but it would need a lot less memory (Stripes[w])[u] • …and would also require about the same – Pairs essentially needs no extra storage beyond amount of memory to store the two- the current “window” of nearby words, while dimensional count data structure Stripes has to store the hash map H 13 14 Pairs versus Stripes (cont.) Beyond Pairs and Stripes • Does the number of keys matter? • In general, it is not clear which approach is better – Assume we use the same number of tasks, then Pairs just – Some experiments indicate stripes win for co- assigns more keys per task – Master works with tasks, hence no conceptual difference occurrence matrix computation between Pairs and Stripes • Pairs and Stripes are special cases of shapes for • More fine-grained keys of Pairs allow more flexibility in assigning keys to tasks covering the entire matrix – Pairs can emulate Stripes’ row -wise key assignment to tasks – Could use sub-stripes, or partition matrix horizontally – Stripes cannot emulate all Pairs assignments, e.g., “checkerboard” pattern for two tasks and vertically into more square-like shapes etc. • Greater number of distinct keys per task in Pairs tends to • Can also be applied to higher-dimensional arrays increase sorting cost, even if total data size is the same • Will see interesting version of this idea for joins 15 16 (3) Relative Frequencies Bird Probabilities Using Stripes • Important for data mining • Use species as intermediate key • E.g., for each species and color, estimate the – One stripe per species, e.g., stripe[N.C.] – (stripe[species])[color] stores f(species, color) probability of the color for that species • Map: for each observation of (species S, color C) – Probability of Northern Cardinal being red: P(color = red | species = N.C.) in an observation event, increment (stripe[S])[C] • Count f(N.C.) = the frequency of observations for N.C. – Output (S, stripe[S]) (marginal) • Reduce: for each species S, add all stripes for S • Count f(N.C., red) = the frequency of observations for red N.C.’s ( joint event) – Result: stripeSum[S] with total counts for each color • Estimate P(red | N.C.) as f(N.C., red) / f(N.C.) for S • Similarly: normalize word co-occurrence vector – Can get f(S) by adding all color-counts in stripeSum[S] for word w – Emit (stripeSum[S])[C] / f(S) for each color C 17 18

MapReduce Design Patterns This section is based on the book by Jimmy - PDF document

MapReduce Design Patterns This section is based on the book by Jimmy Lin Now lets look at important program design and Chris Dyer patterns for MapReduce. Programmer can control program execution only through implementation of

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Design Patterns Applications Programming What is design patterns? The design patterns are

Design Patterns in Eiffel Dr. Till Bay design patterns? [Design Patterns] are

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

Design Patterns 1 What are Design Patterns? Design patterns describe common (and successful)

More Design Patterns Horstmann ch.10.1,10.4 Design patterns Structural design patterns

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Applying Hierarchical and Role-Based Access Control to XML Documents Jason Crampton Information

Separation Logic for Non-local Control Flow and Block Scope Variables Robbert Krebbers Joint

Type- and Control-Flow Analysis Matthew Fluet mtf@cs.rit.edu Department of Computer Science

Control-Flow-Only Abstract Syntax Trees for Analyzing Students' Programming Progress David

Team Optimal Control of Coupled Subsystems with Mean-Field Sharing Jalal Arabneydi and Aditya

iRODS UGM2017 Welcome Carolien Besselink CIO Utrecht University Information and Technology

Cyber Security Information Sharing Oscar Serrano NCI Agency Cyber Security Service Line

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International

Sambuz

Useful Links

Newsletter

Mail Us

MapReduce Design Patterns This section is based on the book by Jimmy - PDF document

MapReduce Design Patterns This section is based on the book by Jimmy Lin Now lets look at important program design and Chris Dyer patterns for MapReduce. Programmer can control program execution only through implementation of

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Design Patterns Applications Programming What is design patterns? The design patterns are

Design Patterns in Eiffel Dr. Till Bay design patterns? [Design Patterns] are

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

Design Patterns 1 What are Design Patterns? Design patterns describe common (and successful)

More Design Patterns Horstmann ch.10.1,10.4 Design patterns Structural design patterns

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Applying Hierarchical and Role-Based Access Control to XML Documents Jason Crampton Information

Separation Logic for Non-local Control Flow and Block Scope Variables Robbert Krebbers Joint

Type- and Control-Flow Analysis Matthew Fluet mtf@cs.rit.edu Department of Computer Science

Control-Flow-Only Abstract Syntax Trees for Analyzing Students' Programming Progress David

Team Optimal Control of Coupled Subsystems with Mean-Field Sharing Jalal Arabneydi and Aditya

iRODS UGM2017 Welcome Carolien Besselink CIO Utrecht University Information and Technology

Cyber Security Information Sharing Oscar Serrano NCI Agency Cyber Security Service Line

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International

Sambuz

Useful Links

Newsletter

Mail Us

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the