Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design • Tasks that can be performed independently on a • Programming model might appear very limited data object, large number of them: Map • But, map and reduce can do anything with their • Tasks that require combining of multiple data input objects: Reduce – Could implement a Turing machine inside… • Sometimes it is easier to start program design – …which could compute anything, but… with Map, sometimes with Reduce • Select keys and values such that the right objects – …would not result in a good parallel implementation. end up together in the same Reduce invocation • Challenge: find best MapReduce implementation • Might have to partition a complex task into for a given problem multiple MapReduce sub-tasks 22 23 Choosing M and R Grep • M = number of map tasks, R = number of reduce tasks • Find all lines matching some pattern • Larger M, R: creates smaller tasks, enabling easier load • No need to combine anything balancing and faster recovery (many small tasks from failed machine) – Reduce is not needed, i.e., just identity function • Limitation: O(M+R) scheduling decisions and O(M  R) in- • Map takes line and outputs it if it matches the memory state at master pattern – Very small tasks not worth the startup cost • Recommendation: • Map could also take an entire document and emit – Choose M so that split size is approximately 64 MB all matching lines – Choose R a small multiple of the number of workers; – Not a good idea if there is a single large document, alternatively choose R a little smaller than #workers to finish reduce phase in one “wave” but works well if there are many documents 24 25 Reverse Web-Link Graph Inverted Index • For each URL, find all pages (URLs) pointing to it • For each word, create list of documents (incoming links) (document IDs) containing it • Problem: Web page has only outgoing links • Same as reverse Web-link graph problem • Need all (anySource, P) links for each page P – “Source URL” is now “document ID” – Suggests Reduce with P as the key, source as value – “Target URL” is now “word” • Map: for page source , create all ( target , source ) • Can augment this to create list of (document pairs for each link to a target found in page ID, position) pairs for each word • Reduce: since target is key, will receive all sources – Map emits (word, (document ID, position)) while pointing to that target parsing a document 26 27 1

Distributed Sorting Distributed Sorting, Revisited • Can Map do pre-sorting and Reduce the • Quicksort-style partitioning merging? • For simplicity, consider case with 2 machines – Use set of input records as Map input – Goal: each machine sorts about half of the data – Map pre-sorts it and single reducer merges them • Assuming we can find the median record, – Does not scale! assign all smaller records to machine 1, all • We need to get multiple reducers involved others to machine 2 – What should we use as the intermediate key? • Sort locally on each machine, then “concatenate” output 28 29 Partitioning Sort in MapReduce Partitioning Sort in MapReduce • Consider 2 reducers for simplicity • MapReduce has class Partitioner<KEY, VALUE> • Run MapReduce job to find approximate median of – Method int getPartition(KEY key, VALUE value, int data numPartitions) allows assigning keys to partitions – Hadoop also offers InputSampler • Example for numPartitions = 2 • Writes the keys that define the partitions, to be used by TotalOrderPartitioner – Partition 1 gets all numbers less than median • Runs on client and downloads input data splits, hence only useful – Partition 2 gets all larger numbers if data is sampled from few splits, i.e., splits themselves should contain random data samples • What about concatenating the output? • Map outputs (sortKey, record) for an input record – Not necessary, except for many small files (big files are • All sortKey < median are assigned to reduce task 1, all broken up anyway) others to reduce task 2, using a partitioner • Reduce sorts its assigned set of records • Generalizes obviously to more reducers 30 31 package org.apache.hadoop.examples; import java.io.IOException; import java.net.URI; import java.util.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; MapReduce and Key Sorting import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.IdentityMapper; import org.apache.hadoop.mapred.lib.IdentityReducer; import org.apache.hadoop.mapred.lib.InputSampler; import org.apache.hadoop.mapred.lib.TotalOrderPartitioner; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; • MapReduce environment guarantees that for /** * This is the trivial map/reduce program that does absolutely nothing each reduce task the assigned set of intermediate * other than use the framework to fragment and sort the input values. * * To run: bin/hadoop jar build/hadoop-examples.jar sort keys is processed in key order * [-m maps] [-r reduces] * [-inFormat input format class] * [-outFormat output format class] – After receiving all (key2, val2) pairs from mappers, * [-outKey output key class] * [-outValue output value class] reducer sorts them by key2, then calls Reduce on each * [-totalOrder pcnt num samples max splits] * in-dir out-dir (key2, list(val2)) group in order */ Sort Code in Hadoop 1.0.3 Distribution; public class Sort<K,V> extends Configured implements Tool { • Can leverage this guarantee for partitioning sort private RunningJob jobResult = null; part 1: boilerplate code static int printUsage() { System.out.println("sort [-m <maps>] [-r <reduces>] " + – Reduce simply emits the records unchanged "[-inFormat <input format class>] " + "[-outFormat <output format class>] " + – No need for user sort code in Reduce function! "[-outKey <output key class>] " + "[-outValue <output value class>] " + "[-totalOrder <pcnt> <num samples> <max splits>] " + "<input> <output>"); ToolRunner.printGenericCommandUsage(System.out); return -1; 32 33 } 2

Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design Tasks that can be performed independently on a Programming model might appear very limited data object, large number of them: Map But, map and reduce can do anything with their Tasks that

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM slots

LBNF Update to the FNAL Physics Advisory Committee C. J. Mossey, Deputy Director for LBNF 21

RIFL: Implementing Linearizability at Large Scale and Low Latency Jiaxin Wang Motivation

Wednesday, March 9th | 1871 Who is registered to vote in Illinois? Are men and women

The Google File System Armando Fracalossi, Maurlio Schmitt, e Ricardo Fritsche OS 2008/2 -

Temporal Planning through Reduction to Satisfiability Modulo Theories Jussi Rintanen Department

Socket programming Goal: learn how to build client/server application that communicate using

Expanded Very Large Array (EVLA) SRS This file shows each sentence with "only". When the

Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design Tasks that can be performed independently on a Programming model might appear very limited data object, large number of them: Map But, map and reduce can do anything with their Tasks that

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM slots

LBNF Update to the FNAL Physics Advisory Committee C. J. Mossey, Deputy Director for LBNF 21

RIFL: Implementing Linearizability at Large Scale and Low Latency Jiaxin Wang Motivation

Wednesday, March 9th | 1871 Who is registered to vote in Illinois? Are men and women

The Google File System Armando Fracalossi, Maurlio Schmitt, e Ricardo Fritsche OS 2008/2 -

Temporal Planning through Reduction to Satisfiability Modulo Theories Jussi Rintanen Department

Socket programming Goal: learn how to build client/server application that communicate using

Expanded Very Large Array (EVLA) SRS This file shows each sentence with &quot;only&quot;. When the

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Expanded Very Large Array (EVLA) SRS This file shows each sentence with "only". When the