comments basic mapreduce program design
play

Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design Tasks that can be performed independently on a Programming model might appear very limited data object, large number of them: Map But, map and reduce can do anything with their Tasks that


  1. Comments Basic MapReduce Program Design • Tasks that can be performed independently on a • Programming model might appear very limited data object, large number of them: Map • But, map and reduce can do anything with their • Tasks that require combining of multiple data input objects: Reduce – Could implement a Turing machine inside… • Sometimes it is easier to start program design – …which could compute anything, but… with Map, sometimes with Reduce • Select keys and values such that the right objects – …would not result in a good parallel implementation. end up together in the same Reduce invocation • Challenge: find best MapReduce implementation • Might have to partition a complex task into for a given problem multiple MapReduce sub-tasks 22 23 Choosing M and R Grep • M = number of map tasks, R = number of reduce tasks • Find all lines matching some pattern • Larger M, R: creates smaller tasks, enabling easier load • No need to combine anything balancing and faster recovery (many small tasks from failed machine) – Reduce is not needed, i.e., just identity function • Limitation: O(M+R) scheduling decisions and O(M  R) in- • Map takes line and outputs it if it matches the memory state at master pattern – Very small tasks not worth the startup cost • Recommendation: • Map could also take an entire document and emit – Choose M so that split size is approximately 64 MB all matching lines – Choose R a small multiple of the number of workers; – Not a good idea if there is a single large document, alternatively choose R a little smaller than #workers to finish reduce phase in one “wave” but works well if there are many documents 24 25 Reverse Web-Link Graph Inverted Index • For each URL, find all pages (URLs) pointing to it • For each word, create list of documents (incoming links) (document IDs) containing it • Problem: Web page has only outgoing links • Same as reverse Web-link graph problem • Need all (anySource, P) links for each page P – “Source URL” is now “document ID” – Suggests Reduce with P as the key, source as value – “Target URL” is now “word” • Map: for page source , create all ( target , source ) • Can augment this to create list of (document pairs for each link to a target found in page ID, position) pairs for each word • Reduce: since target is key, will receive all sources – Map emits (word, (document ID, position)) while pointing to that target parsing a document 26 27 1

  2. Distributed Sorting Distributed Sorting, Revisited • Can Map do pre-sorting and Reduce the • Quicksort-style partitioning merging? • For simplicity, consider case with 2 machines – Use set of input records as Map input – Goal: each machine sorts about half of the data – Map pre-sorts it and single reducer merges them • Assuming we can find the median record, – Does not scale! assign all smaller records to machine 1, all • We need to get multiple reducers involved others to machine 2 – What should we use as the intermediate key? • Sort locally on each machine, then “concatenate” output 28 29 Partitioning Sort in MapReduce Partitioning Sort in MapReduce • Consider 2 reducers for simplicity • MapReduce has class Partitioner<KEY, VALUE> • Run MapReduce job to find approximate median of – Method int getPartition(KEY key, VALUE value, int data numPartitions) allows assigning keys to partitions – Hadoop also offers InputSampler • Example for numPartitions = 2 • Writes the keys that define the partitions, to be used by TotalOrderPartitioner – Partition 1 gets all numbers less than median • Runs on client and downloads input data splits, hence only useful – Partition 2 gets all larger numbers if data is sampled from few splits, i.e., splits themselves should contain random data samples • What about concatenating the output? • Map outputs (sortKey, record) for an input record – Not necessary, except for many small files (big files are • All sortKey < median are assigned to reduce task 1, all broken up anyway) others to reduce task 2, using a partitioner • Reduce sorts its assigned set of records • Generalizes obviously to more reducers 30 31 package org.apache.hadoop.examples; import java.io.IOException; import java.net.URI; import java.util.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; MapReduce and Key Sorting import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.IdentityMapper; import org.apache.hadoop.mapred.lib.IdentityReducer; import org.apache.hadoop.mapred.lib.InputSampler; import org.apache.hadoop.mapred.lib.TotalOrderPartitioner; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; • MapReduce environment guarantees that for /** * This is the trivial map/reduce program that does absolutely nothing each reduce task the assigned set of intermediate * other than use the framework to fragment and sort the input values. * * To run: bin/hadoop jar build/hadoop-examples.jar sort keys is processed in key order * [-m <i>maps</i>] [-r <i>reduces</i>] * [-inFormat <i>input format class</i>] * [-outFormat <i>output format class</i>] – After receiving all (key2, val2) pairs from mappers, * [-outKey <i>output key class</i>] * [-outValue <i>output value class</i>] reducer sorts them by key2, then calls Reduce on each * [-totalOrder <i>pcnt</i> <i>num samples</i> <i>max splits</i>] * <i>in-dir</i> <i>out-dir</i> (key2, list(val2)) group in order */ Sort Code in Hadoop 1.0.3 Distribution; public class Sort<K,V> extends Configured implements Tool { • Can leverage this guarantee for partitioning sort private RunningJob jobResult = null; part 1: boilerplate code static int printUsage() { System.out.println("sort [-m <maps>] [-r <reduces>] " + – Reduce simply emits the records unchanged "[-inFormat <input format class>] " + "[-outFormat <output format class>] " + – No need for user sort code in Reduce function! "[-outKey <output key class>] " + "[-outValue <output value class>] " + "[-totalOrder <pcnt> <num samples> <max splits>] " + "<input> <output>"); ToolRunner.printGenericCommandUsage(System.out); return -1; 32 33 } 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend