pairs design pattern stripes design pattern
play

Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) - PDF document

10/14/2011 Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all term w in doc d do for all term w in doc d do for all term u NEAR w do H = new hashMap Emit(pair (w, u), count 1) for all term u NEAR w


  1. 10/14/2011 Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all term w in doc d do for all term w in doc d do for all term u NEAR w do H = new hashMap Emit(pair (w, u), count 1) for all term u NEAR w do H{u} ++ w v u w v u Emit(term w, stripe H) w w reduce (pair p, counts [c1, c2,…]) v v sum = 0 reduce (term w, stripes [H1, H2,…]) u for all count c in counts do u Hout = new hashMap sum += c for all stripe H in stripes do Hout = ElementWiseSum(Hout, H) Emit(pair p, count sum) Emit(term w, stripe Hout) • Can use combiner or in-mapper combining • Can use combiner or in-mapper combining • • Good: easy to implement and understand Good: much smaller intermediate-key space – Linear in number of distinct terms • Bad: huge intermediate-key space (shuffling/sorting cost!) • Bad: more difficult to implement, Map needs to hold entire stripe in – Quadratic in number of distinct terms memory 204 205 Beyond Pairs and Stripes (3) Relative Frequencies • Important for data mining • In general, it is not clear which approach is better • E.g., for each species and color, compute – Some experiments indicate stripes win for co- probability of color for that species occurrence matrix computation – Probability of Northern Cardinal being red, P(color = • Pairs and stripes are special cases of shapes for red | species = N.C.) covering the entire matrix • Count f(N.C.), the frequency of observations for N.C. (marginal) – Could use sub-stripes, or partition matrix horizontally • Count f(N.C., red), the frequency of observations for red N.C.’s ( joint event) and vertically into more square-like shapes etc. • P(red | N.C.) = f(N.C., red) / f(N.C.) • Can also be applied to higher-dimensional arrays • Similarly: normalize word co-occurrence vector • Will see interesting version of this idea for joins for word w by dividing it by w’s frequency 206 207 Bird Probabilities Using Stripes Discussion, Part 1 • Use species as intermediate key • Stripe is great fit for relative frequency – One stripe per species, e.g., stripe[N.C.] computation • (stripe[species])[color] stores f(species, color) • All values for computing the final result are in • Map: for each observation of (species S, color C) in an the stripe observation event, increment (stripe[S])[C] – Output (S, stripe[S]) • Any smaller unit would miss some of the joint • Reduce: for each species S, add all stripes for S events needed for computing f(S), the – Result: stripeSum[S] with total counts for each color for S marginal for the species – Can get f(S) by adding all stripeSum[S] values together • So, this would be a problem for the pairs – Get probability P(color = C | species = S) as (stripeSum[S])[C] / f(S) pattern 208 209 1

  2. 10/14/2011 Bird Probabilities Using Pairs Pairs-Based Solution, Take 1 • Make sure all values f(S, color) for the same • Intermediate key is (species, color) species end up in the same reduce task • Map produces partial counts for each species- – Define custom partitioning function on species color combination in input • Maintain state across different keys in same • Reduce can compute f(species, color), the reduce task total count of each species-color combination • This essentially simulates the stripes approach • But: cannot compute marginal f(S) in the reduce task, creating big reduce tasks – Reduce needs to sum f(S, color) for all colors for when there are many colors species S • Can we do better? 210 211 Discussion, Part 2 Bird Probabilities Using Pairs, Take 2 • Map: for each observation event, emit ((species S, color C), • Pairs-based algorithm would work better, if 1) and ((species S, dummyColor), 1) for each species-color marginal f(S) was known already combination encountered • Use custom partitioner that partitions based on the species – Reducer computes f(species, color) and then outputs component only f(species, color) / f(species) • Use custom key comparator such that (S, dummyColor) is • We can compute the species marginals f(species) before all (S, C) for real colors C – Reducer computes f(S) before the f(S, C) in a separate MapReduce job first • Reducer keeps f(S) in state for duration of entire task • Better: fold this into a single MapReduce job – Reducer then computes f(S, C) for each C, outputting f(S, C) / f(S) – Problem: easy to compute f(S) from all f(S, color), but • Advantage: avoids having to manage all colors for a species how do we compute f(S) before knowing f(S, color)? together 212 213 Order Inversion Design Pattern (4) Secondary Sorting • Occurs surprisingly often during data analysis • Recall the weather data: for simplicity assume • Solution 1: use complex data structures that bring the observations are (date, stationID, temperature) right results together • Goal: for each station, create a time series of – Array structure used by stripes pattern • Solution 2: turn synchronization into ordering problem temperature measurements – Key sort order enforces computation order • Per-station data: use stationID as intermediate – Partitioner for key space assigns appropriate partial results to each reduce task key – Reducer maintains task-level state across Reduce • Problem: reducers receive huge number of (date, invocations – Works for simpler pairs pattern, which uses simpler data temp) pairs for each station structures and requires less reducer memory – Have to be sorted by user code 214 215 2

  3. 10/14/2011 Can Hadoop Do The Sorting? Design Pattern Summary • Use (stationID, date) as intermediate key • In-mapper combining: do work of combiner in – Problem: records for the some station might end up in different mapper reduce tasks – Solution: custom partitioner, using only stationID component of • Pairs and stripes: for keeping track of joint key for partitioning events • General value-to-key conversion design pattern – To partition by X and then sort each X-group by Y, make (X, Y) • Order inversion: convert sequencing of the key computation into sorting problem – Define key comparator to order by composite key (X, Y) – Define partitioner and grouping comparator for (X, Y) to • Value-to-key conversion: scalable solution for consider only X for partitioning and grouping secondary sorting, without writing own sort • Grouping part is necessary if all dates for a station should be processed in the same Reduce invocation (otherwise each station- code date combination ends up in a different Reduce invocation) 216 217 Tools for Synchronization Issues and Tradeoffs • Number of key-value pairs • Cleverly-constructed data structures for key – Object creation overhead and values to bring data together – Time for sorting and shuffling pairs across the network • Preserving state in mappers and reducers, • Size of each key-value pair together with capability to add initialization – (De-)serialization overhead and termination code for entire task • Local aggregation • Sort order of intermediate keys to control – Opportunities to perform local aggregation vary order in which reducers process keys – Combiners can make a big difference – Combiners vs. in-mapper combining • Custom partitioner to control which reducer – RAM vs. disk vs. network processes which keys 218 219 Joins in MapReduce • Data sets S={s 1 ,..., s |S| } and T={t 1 ,..., t |T| } Now that we have seen important design • Find all pairs (s i , t j ) that satisfy some predicate patterns and MapReduce algorithms for simpler problems, let’s look at some more • Examples complex problems. – Pairs of similar or complementary function summaries – Facebook and Twitter posts by same user or from same location • Typical goal: minimize job completion time 220 221 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend