parallel nested loops parallel partition based
play

Parallel Nested Loops Parallel Partition-Based Create n partitions - PDF document

9/15/2011 Parallel Nested Loops Parallel Partition-Based Create n partitions of S by hashing each S-tuple s, e.g., to For each tuple s i in S bucket number (s mod n) For each tuple t j in T Create n partitions of T in the same way


  1. 9/15/2011 Parallel Nested Loops Parallel Partition-Based • Create n partitions of S by hashing each S-tuple s, e.g., to • For each tuple s i in S bucket number (s mod n) – For each tuple t j in T • Create n partitions of T in the same way • If s i =t j , then add (s i ,t j ) to output • Run join algorithm on each pair of corresponding partitions • Create partitions S 1 , S 2 , T 1 , and T 2 • Have processors work on (S 1 ,T 1 ), (S 1 ,T 2 ), (S 2 ,T 1 ), and • Can create partitions of S and T in parallel (S 2 ,T 2 ) • Choose n = number of processors – Can build appropriate local index on chunk if desired • Each processor locally can choose favorite join algorithm • Nice and easy, but… • No data replication, but… – How to choose chunk sizes for given S, T, and #processors? – Does not work well for skewed data – Limited parallelism if range of values is small – There is data duplication, possibly a lot of it • Especially undesirable for highly selective joins with small result 61 62 More Join Thoughts Median • Find the median of a set of integers • What about non-equi join? • Holistic aggregate function – Find pairs (s i ,t j ) that satisfy a predicate like – Chunk assigned to a processor might contain inequality, band, or similarity (e.g., when s and t mostly smaller or mostly larger values, and the are documents) processor does not know this without • Hash-partitioning will not work any more communicating extensively with the others • Now things are becoming really tricky… • Parallel implementation might not do much • We will discuss these issues in a future better than sequential one • Efficient approximation algorithms exist lecture. 63 64 Parallel Office Tools • Parallelize Word, Excel, email client? Before exploring parallel algorithms in more • Impossible without rewriting them as multi- depth, how do we know if our parallel algorithm or implementation actually does threaded applications well or not? – Seem to naturally have low degree of parallelism • Leverage economies of scale: n processors (or cores) support n desktop users by hosting the service in the Cloud – E.g., Google docs 65 66 1

  2. 9/15/2011 Measures Of Success Things to Consider: Amdahl’s Law • Consider job taking sequential time 1 and • If sequential version takes time t, then parallel consisting of two sequential tasks taking time t 1 version on n processors should take time t/n and 1-t 1 , respectively – Speedup = sequentialTime / parallelTime • Assume we can perfectly parallelize the first task – Note: job, i.e., work to be done, is fixed on n processors – Parallel time: t 1 /n + (1 – t 1 ) • Response time should stay constant if number of • Speedup = 1 / (1 – t 1 (n-1)/n) processors increases at same rate as “amount of – t 1 =0.9, n=2: speedup = 1.81 work” – t 1 =0.9, n=10: speedup = 5.3 – Scaleup = workDoneParallel / workDoneSequential – t 1 =0.9, n=100: speedup = 9.2 – Note: time to work on job is fixed – Max. possible speedup for t 1 =0.9 is 1/(1-0.9) = 10 67 68 Implications of Amdahl’s Law Performance Metrics • Parallelize the tasks that take the longest • Total execution time • Sequential steps limit maximum possible speedup – Part of both speedup and scaleup – Communication between tasks, e.g., to transmit • Total resources (maybe only of type X) intermediate results, can inherently limit speedup, no consumed matter how well the tasks themselves can be • Total amount of money paid parallelized • If fraction x of the job is inherently sequential, • Total energy consumed speedup can never exceed 1/x • Optimize some combination of the above – No point running this on an excessive number of – E.g., minimize total execution time, subject to a processors money budget constraint 69 70 Popular Strategies • Load balancing Let’s see how MapReduce works. – Avoid overloading one processor while other is idle – Careful: if better balancing increases total load, it might not be worth it – Careful: optimizes for response time, but not necessarily other metrics like $ paid • Static load balancing – Need cost analyzer like in DBMS • Dynamic load balancing – Easy: Web search – Hard: join 71 72 2

  3. 9/15/2011 MapReduce Overview • Proposed by Google in research paper • MapReduce = programming model and associated implementation for processing large – Jeffrey Dean and Sanjay Ghemawat. MapReduce: data sets Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System • Programmer essentially just specifies two Design and Implementation, San Francisco, CA, (sequential) functions: map and reduce December, 2004 • Program execution is automatically parallelized • MapReduce implementations like Hadoop on large clusters of commodity PCs differ in details, but main principles are the – MapReduce could be implemented on different same architectures, but Google proposed it for clusters 73 74 Overview Programming Model • Transforms set of input key-value pairs to set of • Clever abstraction that is a good fit for many output values (notice small modification real-world problems compared to paper) • Programmer focuses on algorithm itself • Map: (k1, v1)  list (k2, v2) • Runtime system takes care of all messy details • MapReduce library groups all intermediate pairs – Partitioning of input data with same key together • Reduce: (k2, list (v2))  list (k3, v3) – Scheduling program execution – Usually zero or one output value per group – Handling machine failures – Intermediate values supplied via iterator (to handle – Managing inter-machine communication lists that do not fit in memory) 75 76 Example: Word Count Word Count in MapReduce • Insight: can count each document in parallel, then Count number of occurrences of each word in a document collection: aggregate counts reduce(String key, Iterator values): • Final aggregation has to happen in Reduce map(String key, String value): // key: a word // key: document name – Need count per word, hence use word itself as // values: a list of counts // value: document contents int result = 0; intermediate key (k2) for each word w in value: for each v in values: – Intermediate counts are the intermediate values (v2) EmitIntermediate(w, "1"); result += ParseInt(v); • Parallel counting can happen in Map Emit(AsString(result)); – For each document, output set of pairs, each being a word in the document and its frequency of occurrence in the Almost all the coding needed (need also MapReduce specification object with names of input and document output files, and optional tuning parameters) – Alternative: output (word, “1”) for each word encountered 77 78 3

  4. 9/15/2011 Execution Overview Execution Overview • Data is stored in files • Master assigns map and reduce tasks to workers, taking data location into account – Files are partitioned into smaller splits, typically 64MB • Mapper reads an assigned file split and writes intermediate – Splits are stored (usually also replicated) on different key-value pairs to local disk cluster machines • Mapper informs master about result locations, who in turn • Master node controls program execution and informs the reducers keeps track of progress • Reducers pull data from appropriate mapper disk location – Does not participate in data processing • After map phase is completed, reducers sort their data by • Some workers will execute the Map function, let’s key • For each key, Reduce function is executed and output is call them mappers appended to final output file • Some workers will execute the Reduce function, • When all reduce tasks are completed, master wakes up let’s call them reducers user program 79 80 Execution Overview Master Data Structures • Master keeps track of status of each map and reduce task and who is working on it – Idle, in-progress, or completed • Master stores location and size of output of each completed map task – Pushes information incrementally to workers with in-progress reduce tasks 81 82 Example: Equi-Join Equi-Join in MapReduce • Given two data sets S=(s 1 ,s 2 ,…) and T=(t 1 ,t 2 ,…) • Join condition: S.A=T.A • Map(s) = (s.A, s); Map(t) = (t.A, t) of integers, find all pairs (s i ,t j ) where s i .A=t j .A • Reduce computes Cartesian product of set of S-tuples and set of T-tuples with same key • Can only combine the s i and t j in Reduce DFS nodes Mappers Reducers DFS nodes – To ensure that the right tuples end up in the same s 5 ,1 s 1 ,1 (s 5 ,t 3 ) s 3 ,2 1,(s 5 ,1) Reduce invocation, use join attribute A as s 5 ,1 (s 1 ,t 3 ) 2,(s 3 ,2) (s 1 ,t 8 ) intermediate key (k2) s 3 ,2 t 3 ,1 1,(t 3 ,1) 1,[(s 5 ,1)(t 3 ,1)(s 1 ,1)(t 8 ,1)] t 3 ,1 – Intermediate value is actual tuple to be joined • Map needs to output (s.A, s) for each S-tuple s t 1 ,2 1,(s 1 ,1) (s 3 ,t 1 ) (k2,list(v2)) 1,(t 8 ,1) t 8 ,1 s 1 ,1 (s 5 ,t 8 ) (similar for T-tuples) t 8 ,1 2,(t 1 ,2) 2,[(s 3 ,2)(t 1 ,2)] list(v3) (k1,v1) list(k2,v2) t 1 ,2 83 Transfer Map Transfer Reduce Transfer 84 Input Map Output Reduce Output 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend