Data-intensive programming MapReduce Timo Aaltonen Department of - - PowerPoint PPT Presentation
Data-intensive programming MapReduce Timo Aaltonen Department of - - PowerPoint PPT Presentation
Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing Outline MapReduce Original Map and Reduce map(foo, i) Apply function foo to every item of i and return a list of the results Example:
Outline
- MapReduce
Original Map and Reduce
- map(foo, i)
– Apply function foo to every item of i and return a list
- f the results
– Example: map(square, [1, 2, 3, 4]) = [1, 4, 9, 16]
- reduce(bar, i)
– Apply two argument function bar
- cumulatively to the items of i,
- from left to right
- to reduce the values in i to a single value.
– Example: reduce(sum, [1, 4, 9]) = (((1+4)+9)+16) = 30
MapReduce
- MapReduce is a programming model for
distributed processing of large data sets
- Scales nearly linearly
– Twice as many nodes -> twice as fast – Achieved by exploiting data locality
- Computation is moved close to the data
- Simple programming model
– Programmer only needs to write two functions: Map and Reduce
Map & Reduce
- Map maps input data to (key, value) pairs
- Reduce processes the list of values for a given
key
- The MapReduce framework (such as Hadoop)
takes care of the rest
– Distributes the job among nodes – Moves the data to and from the nodes – Handles node failures – etc.
Sheena is a punk rocker Sheena is a punk rocker now Well she’s a punk punk
MAP
Sheena 1 is 1 a 1 punk 1 rocker 1 rocker Sheena is a punk now Well she’s a 1 1 1 1 1 1 1 1 1 punk 1 punk 1 1 Sheena is 1 a 1
SHUFFLE
Sheena 21
REDUCE
is 23 a 31 punk 1 punk 37 Node 1 Node 2 Node 3 Node D Node C Node B Node A
MapReduce
MAP
Sheena Sheena 1
SHUFFLE REDUCE
is 1 a 1 punk 1 rocker Sheena is … 1 1 1 1 is a punk rocker Sheena is … 1 Sheena Sheena 21 is 1 a 1 is 23 a 31 punk 1 punk 37
MapReduce
MAP REDUCE
Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)
Map & Reduce in Hadoop
- In Hadoop, Map and Reduce functions can be
written in
– Java
- org.apache.hadoop.mapreduce.lib
– C++ using Hadoop Pipes – any language, using Hadoop Streaming
- Also a number of third party programming
frameworks for Hadoop MapReduce
– For Java, Scala, Python, Ruby, PHP, …
Mapper Java example
public class WordCount { public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){ word = new Text( items[i]);
- ne = new IntWritable(1);
context.write(word, one); } }
Input key and value types Output key and value types
- The Mapper input types depend on the defined InputFormat
- By default TextInputFormat
- Key (LongWritable): position in the file
- Value (Text): the line
Reducer Java example
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum = sum + value.get(); } context.write(key, new IntWritable(sum)); } }
Input key and value types Output key and value types
Job of the MapReduce example
- The main driver program configures the
MapReduce job and submits it to the Hadoop YARN cluster:
public static void main( String[] args) throws Exception { … Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName(”Word Count"); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class);
Input key and value types Output key and value types
- Input and output value classes are defined:
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
Job of the MapReduce example
- FileInputFormat feeds the input splits to map
instances from wc-files directory
FileInputFormat.addInputPath(job, new Path(”wc-files"));
- The result is written to wc-output directory:
FileInputFormat.setOutputPath(job, new Path(”wc-output"));
- The job waits until Hadoop runtime environment
has executed the program:
job.waitForCompletion(true); } // main } // class
MapReduce WordCount Demo
- Building the program:
% mkdir wordcount_classes
% javac -classpath `hadoop classpath` \
- d wc_classes WordCount.java
% jar -cvf wordcount.jar -C wordcount_classes/ .
- Send WordCount to YARN:
% hadoop jar wordcount.jar fi.tut.WordCount /data output
MapReduce WordCount Demo
- The result:
% hdfs dfs -ls output Found 2 items
- rw-r--r-- 1 hduser supergroup
0 2016-09-08 11:46 output/_SUCCESS
- rw-r--r-- 1 hduser supergroup
470923 2016-09-08 11:46 output/part-r-00000 % hdfs dfs -get output/part-r-00000 % tail -10 part-r-00000 zoology, 2 zu 2 À 2 à 4 çela, 1 …
Fault Tolerance and Speculative Execution
- Faults are handled by restarting tasks
- All managed in the background
- No need to manage side-effects or process
state
- Speculative execution helps prevent
bottlenecks
Combiners
A A 1 1 A 1 B 1 Map node 1 A Reduce node for key A A 111 A 3 Combiner
Combiners
- Combiner can ”compress” data on a mapper node
before sending it forward
- Combiner input/output types must equal the
mapper output types
- In Hadoop Java, Combiners use the Reducer
interface job.setCombinerClass(MyReducer.class);
Reducer as a Combiner
- Reducer can be used as a Combiner if it is
commutative and associative
– Eg. max is
- max(1, 2, max(3,4,5)) =
max(max(2, 4), max(1, 5, 3))
- true for any order of function applications…
– Eg. avg is not
- avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠
avg(avg(2, 4), avg(1, 5, 3)) = 3
- Note: if Reducer is not c&a, Combiners can still be used
– The Combiner just has to be different from the Reducer and designed for the specific case
Adding a Combiner to WordCount
walk, 1 run, 1 walk, 1 run, 1 walk, 2 Map Combiner Shuffle
Hadoop Streaming
- Map and Reduce functions can be implemented
in any language with the Hadoop Streaming API
- Input is read from standard input
- Output is written to standard output
- Input/output items are lines of the form
key\tvalue
– \t is the tabulator character
- Reducer input lines are grouped by key
– One reducer instance may receive multiple keys
Run Hadoop Streaming
- Debug using Unix pipes:
cat sample.txt | ./mapper.py | sort | ./reducer.py
- On Hadoop:
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
- input sample.txt \
- output output \
- mapper ./mapper.py \
- reducer ./reducer.py
MapReduce Examples
- These are from
https://highlyscalable.wordpress.com/2012/02/01/ma preduce-patterns/
- Counting and Summing
– WordCount
- Filtering (“Grepping”), Parsing, and Validation
– Problem: There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation – Solution: Mapper takes records one by one and emits accepted items or their transformed versions
MapReduce Examples
- Collating
– Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building
- f inverted indexes.
– Solution: Mapper computes a given function for each item and emits value of the function as a key and item itself as a
- value. Reducer obtains all items grouped by function value
and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.
Index
This page contains text My page contains too Page A Page B This, A page, A contains, A text, A My, B page, B contains, B too, B This: A page: A, B contains: A, B text: A too: B Reduced output
Conclusions
- Map tasks spread out the load (the logic)