Algorithms for MapReduce Combiners Partition and Sort Pairs vs - PowerPoint PPT Presentation

Algorithms for MapReduce Combiners Partition and Sort Pairs vs Stripes 1

Assignment 1 released Due 16:00 on 20 October Correctness is not enough! Most marks are for efficiency. Combiners Partition and Sort Pairs vs Stripes 2

Combining, Sorting, and Partitioning . . . and algorithms exploiting these options. Important: learn and apply optimization tricks. Less important: these specific examples. Combiners Partition and Sort Pairs vs Stripes 3

Last lecture: hash table has unbounded size #!/usr/bin/python3 import sys def spill(cache): for word, count in cache.items(): print(word + "\t" + str(count)) cache = {} for line in sys.stdin: for word in line.split(): cache[word] = cache.get(word, 0) + 1 spill(cache) Combiners Partition and Sort Pairs vs Stripes 4

Solution: bounded size #!/usr/bin/python3 import sys def spill(cache): for word, count in cache.items(): print(word + "\t" + str(count)) cache = {} for line in sys.stdin: for word in line.split(): cache[word] = cache.get(word, 0) + 1 if (len(cache) >= 10): #Limit 10 entries spill(cache) cache.clear() spill(cache) Combiners Partition and Sort Pairs vs Stripes 5

Combiners Combiners formalize the local aggregation we just did: Map Machine Mapper Combiner Local Disk Combiners Partition and Sort Pairs vs Stripes 6

Specifying a Combiner Hadoop bas built-in support for combiners: Run Hadoop hadoop jar hadoop-streaming-2.7.3.jar -files count_map.py,count_reduce.py Copy to workers Read text file -input /data/assignments/ex1/webSmall.txt Write here -output /user/$USER/combined -mapper count_map.py Simple mapper Combiner sums -combiner count_reduce.py -reducer count_reduce.py Reducer sums Combiners Partition and Sort Pairs vs Stripes 7

Specifying a Combiner Hadoop bas built-in support for combiners: Run Hadoop hadoop jar hadoop-streaming-2.7.3.jar -files count_map.py,count_reduce.py Copy to workers Read text file -input /data/assignments/ex1/webSmall.txt Write here -output /user/$USER/combined -mapper count_map.py Simple mapper Combiner sums -combiner count_reduce.py -reducer count_reduce.py Reducer sums How is this implemented? Combiners Partition and Sort Pairs vs Stripes 8

Mapper’s Initial Sort Map Partition (aka Shard) Assign destination reducer Remember what fits in RAM RAM buffer RAM buffer Sort batch in RAM Sort Sort Optional combiner Combine Combine Disk Disk Combiners Partition and Sort Pairs vs Stripes 9

Merge Sort When the mapper runs out of RAM, it spills to disk. ⇒ Chunks of sorted data called “spills”. = Mappers merge their spills into one per reducer. Reducers merge input from multiple mappers. Spill 0 Spill 1 a 3 a 5 → c 4 b 9 d 2 → c 6 Combiner a 8 b 9 → c 10 . . . Combiners Partition and Sort Pairs vs Stripes 10

Combiner Summary Combiners optimize merge sort and reduce network traffic. They may run in: Mapper initial sort Mapper merge Reducer merge Combiners Partition and Sort Pairs vs Stripes 11

Combiner FAQ Hadoop might not run your combiner at all! Combiners will see a mix of mapper and combiner output. Hadoop won’t partition or sort combiner output again. ⇒ Don’t change the key. = Combiners Partition and Sort Pairs vs Stripes 12

Combiner Efficiency: Sort vs Hash Table Hadoop sorts before combining ⇒ Duplicate keys are sorted = ⇒ slow = Our in-mapper implementation used a hash table. Also reduces Java ↔ Python overhead. In-mapper is usually faster, but we’ll let you use either one. Combiners Partition and Sort Pairs vs Stripes 13

Problem: Averaging We’re given temperature readings from cities: Key Value San Francisco 22 Edinburgh 14 Los Angeles 23 Edinburgh 12 Edinburgh 9 Los Angeles 21 Find the average temperature in each city. Map: (city, temperature) �→ (city, temperature) Reduce: Count, sum temperatures, and divide. Combiners Partition and Sort Pairs vs Stripes 14

Problem: Averaging We’re given temperature readings from cities: Key Value San Francisco 22 Edinburgh 14 Los Angeles 23 Edinburgh 12 Edinburgh 9 Los Angeles 21 Find the average temperature in each city. Map: (city, temperature) �→ (city, temperature) Combine: Same as reducer? Reduce: Count, sum temperatures, and divide. Combiners Partition and Sort Pairs vs Stripes 15

Problem: Averaging We’re given temperature readings from cities: Key Value San Francisco 22 Edinburgh 14 Los Angeles 23 Edinburgh 12 Edinburgh 9 Los Angeles 21 Find the average temperature in each city. Map: (city, temperature) �→ (city, count = 1, temperature) Combine: Sum count and temperature fields. Reduce: Sum count, sum temperatures, and divide. Combiners Partition and Sort Pairs vs Stripes 16

Pattern: Combiners Combiners reduce communication by aggregating locally. Many times they are the same as reducers (i.e. summing). . . . but not always (i.e. averaging). Combiners Partition and Sort Pairs vs Stripes 17

Custom Partitioner and Sorting Function Combiners Partition and Sort Pairs vs Stripes 18

Mapper’s Initial Sort Map Partition (aka Shard) Custom partitioner RAM buffer RAM buffer Custom sort function Sort Sort Combine Combine Disk Disk Combiners Partition and Sort Pairs vs Stripes 19

Problem: Comparing Output Alice’s Word Counts Bob’s Word Counts a 20 i 13 why 12 a 20 hi 2 hi 2 the 31 why 12 i 13 the 31 Combiners Partition and Sort Pairs vs Stripes 20

Problem: Comparing Output Alice’s Word Counts Bob’s Word Counts a 20 i 13 why 12 a 20 hi 2 hi 2 the 31 why 12 i 13 the 31 a 20 the 31 i 13 a 20 the 31 i 13 hi 2 why 12 hi 2 why 12 Send words to a consistent place Combiners Partition and Sort Pairs vs Stripes 21

Problem: Comparing Output Alice’s Word Counts Bob’s Word Counts a 20 i 13 why 12 a 20 hi 2 hi 2 the 31 why 12 i 13 the 31 Map a 20 the 31 i 13 a 20 the 31 i 13 hi 2 why 12 hi 2 why 12 Reduce Send words to a consistent place: reducers Combiners Partition and Sort Pairs vs Stripes 22

Problem: Comparing Output Alice’s Word Counts Bob’s Word Counts a 20 i 13 why 12 a 20 hi 2 hi 2 the 31 why 12 i 13 the 31 Map a 20 the 31 i 13 Unordered a 20 the 31 i 13 Alice/Bob hi 2 why 12 hi 2 why 12 Reduce Send words to a consistent place: reducers Combiners Partition and Sort Pairs vs Stripes 23

Comparing Output Detail Map: (word, count) �→ (word, student, count) 1 Reduce: Verify both values are present and match. Deduct marks from Alice/Bob as appropriate. 1 The mapper can tell Alice and Bob apart by input file name. Combiners Partition and Sort Pairs vs Stripes 24

Comparing Output Detail Map: (word, count) �→ (word, student, count) 1 Partition: By word Sort: By word(word, student) Reduce: Verify both values are present and match. Deduct marks from Alice/Bob as appropriate. Exploit sort to control input order 1 The mapper can tell Alice and Bob apart by input file name. Combiners Partition and Sort Pairs vs Stripes 25

Problem: Comparing Output Alice’s Word Counts Bob’s Word Counts a 20 i 13 why 12 a 20 hi 2 hi 2 the 31 why 12 i 13 the 31 Map a 20 the 31 i 13 Ordered a 20 the 31 i 13 Alice/Bob hi 2 why 12 hi 2 why 12 Reduce Send words to a consistent place: reducers Combiners Partition and Sort Pairs vs Stripes 26

Pattern: Exploit the Sort Without Custom Sort Reducer buffers all students in RAM ⇒ = Might run out of RAM With Custom Sort TA appears first, reducer streams through students. Constant reducer memory. Combiners Partition and Sort Pairs vs Stripes 27

Problem: Word Coocurrence Count pairs of words that appear in the same line. Combiners Partition and Sort Pairs vs Stripes 28

First try: pairs • Each mapper takes a sentence: – Generate all co-occurring term pairs – For all pairs, emit (a, b) → count • Reducers sum up counts associated with these pairs • Use combiners! www.inf.ed.ac.uk

Pairs: pseudo-code class ¡ Mapper ¡ ¡ ¡method ¡ map (docid ¡a, ¡doc ¡d) ¡ ¡ ¡ ¡ ¡ for ¡all ¡ w ¡in ¡d ¡ do ¡ ¡ ¡ ¡ ¡ ¡ ¡ for ¡all ¡ u ¡in ¡ neighbours (w) ¡ do ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ emit (pair(w, ¡u), ¡1); ¡ ¡ class ¡ Reducer ¡ ¡ ¡method ¡ reduce (pair ¡p, ¡counts ¡[c1, ¡c2, ¡…]) ¡ ¡ ¡ ¡ ¡sum ¡= ¡0; ¡ ¡ ¡ ¡ ¡for ¡all ¡ c ¡in ¡[c1, ¡c2, ¡…] ¡ do ¡ ¡ ¡ ¡ ¡ ¡ ¡sum ¡= ¡sum ¡+ ¡c; ¡ ¡ ¡ ¡ ¡ emit (p, ¡sum); ¡ www.inf.ed.ac.uk

Analysing pairs • Advantages – Easy to implement, easy to understand • Disadvantages – Lots of pairs to sort and shuffle around (upper bound?) – Not many opportunities for combiners to work www.inf.ed.ac.uk

Algorithms for MapReduce Combiners Partition and Sort Pairs vs - PowerPoint PPT Presentation

Algorithms for MapReduce Combiners Partition and Sort Pairs vs Stripes 1 Assignment 1 released Due 16:00 on 20 October Correctness is not enough! Most marks are for efficiency. Combiners Partition and Sort Pairs vs Stripes 2 Combining,

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

08 A: Sorting IV CS1102S: Data Structures and Algorithms Martin Henz March 12, 2010 Generated on

Algorithms and Architecture I Sorting in Linear Time 1 Linear Sort? But... Best algorithms

Sorting Algorithms Mark Redekopp David Kempe Sandra Batista 2 Algorithm Efficiency SORTING 3

Sorting Upper and Lower bounds [Aggarwal, Vitter, 88] EMADS Fall 2003: Sorting Page 1 Standard

Physical Operators Scanning, sorting, merging, hashing 193 Physical Operators Execution Query

access to a function f . The tester has to accept with probability at least 2 / 3 if f belongs to

A generic data structure for representing discrete paths on regular grids e and Alexandre

Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing G63.2011.002/G22.2945.001 November

Algorithms for MapReduce Combiners Partition and Sort Pairs vs - PowerPoint PPT Presentation

Algorithms for MapReduce Combiners Partition and Sort Pairs vs Stripes 1 Assignment 1 released Due 16:00 on 20 October Correctness is not enough! Most marks are for efficiency. Combiners Partition and Sort Pairs vs Stripes 2 Combining,

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

08 A: Sorting IV CS1102S: Data Structures and Algorithms Martin Henz March 12, 2010 Generated on

Algorithms and Architecture I Sorting in Linear Time 1 Linear Sort? But... Best algorithms

Sorting Algorithms Mark Redekopp David Kempe Sandra Batista 2 Algorithm Efficiency SORTING 3

Sorting Upper and Lower bounds [Aggarwal, Vitter, 88] EMADS Fall 2003: Sorting Page 1 Standard

Physical Operators Scanning, sorting, merging, hashing 193 Physical Operators Execution Query

access to a function f . The tester has to accept with probability at least 2 / 3 if f belongs to

A generic data structure for representing discrete paths on regular grids e and Alexandre

Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing G63.2011.002/G22.2945.001 November

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the