Data-intensive programming MapReduce Timo Aaltonen Department of - - PowerPoint PPT Presentation

data intensive programming mapreduce
SMART_READER_LITE
LIVE PREVIEW

Data-intensive programming MapReduce Timo Aaltonen Department of - - PowerPoint PPT Presentation

Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing Outline MapReduce Original Map and Reduce map(foo, i) Apply function foo to every item of i and return a list of the results Example:


slide-1
SLIDE 1

Data-intensive programming MapReduce

Timo Aaltonen Department of Pervasive Computing

slide-2
SLIDE 2

Outline

  • MapReduce
slide-3
SLIDE 3

Original Map and Reduce

  • map(foo, i)

– Apply function foo to every item of i and return a list

  • f the results

– Example: map(square, [1, 2, 3, 4]) = [1, 4, 9, 16]

  • reduce(bar, i)

– Apply two argument function bar

  • cumulatively to the items of i,
  • from left to right
  • to reduce the values in i to a single value.

– Example: reduce(sum, [1, 4, 9]) = (((1+4)+9)+16) = 30

slide-4
SLIDE 4

MapReduce

  • MapReduce is a programming model for

distributed processing of large data sets

  • Scales nearly linearly

– Twice as many nodes -> twice as fast – Achieved by exploiting data locality

  • Computation is moved close to the data
  • Simple programming model

– Programmer only needs to write two functions: Map and Reduce

slide-5
SLIDE 5

Map & Reduce

  • Map maps input data to (key, value) pairs
  • Reduce processes the list of values for a given

key

  • The MapReduce framework (such as Hadoop)

takes care of the rest

– Distributes the job among nodes – Moves the data to and from the nodes – Handles node failures – etc.

slide-6
SLIDE 6

Sheena is a punk rocker Sheena is a punk rocker now Well she’s a punk punk

MAP

Sheena 1 is 1 a 1 punk 1 rocker 1 rocker Sheena is a punk now Well she’s a 1 1 1 1 1 1 1 1 1 punk 1 punk 1 1 Sheena is 1 a 1

SHUFFLE

Sheena 21

REDUCE

is 23 a 31 punk 1 punk 37 Node 1 Node 2 Node 3 Node D Node C Node B Node A

slide-7
SLIDE 7

MapReduce

MAP

Sheena Sheena 1

SHUFFLE REDUCE

is 1 a 1 punk 1 rocker Sheena is … 1 1 1 1 is a punk rocker Sheena is … 1 Sheena Sheena 21 is 1 a 1 is 23 a 31 punk 1 punk 37

slide-8
SLIDE 8

MapReduce

MAP REDUCE

Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)

slide-9
SLIDE 9

Map & Reduce in Hadoop

  • In Hadoop, Map and Reduce functions can be

written in

– Java

  • org.apache.hadoop.mapreduce.lib

– C++ using Hadoop Pipes – any language, using Hadoop Streaming

  • Also a number of third party programming

frameworks for Hadoop MapReduce

– For Java, Scala, Python, Ruby, PHP, …

slide-10
SLIDE 10

Mapper Java example

public class WordCount { public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){ word = new Text( items[i]);

  • ne = new IntWritable(1);

context.write(word, one); } }

Input key and value types Output key and value types

  • The Mapper input types depend on the defined InputFormat
  • By default TextInputFormat
  • Key (LongWritable): position in the file
  • Value (Text): the line
slide-11
SLIDE 11

Reducer Java example

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum = sum + value.get(); } context.write(key, new IntWritable(sum)); } }

Input key and value types Output key and value types

slide-12
SLIDE 12

Job of the MapReduce example

  • The main driver program configures the

MapReduce job and submits it to the Hadoop YARN cluster:

public static void main( String[] args) throws Exception { … Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName(”Word Count"); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class);

slide-13
SLIDE 13

Input key and value types Output key and value types

  • Input and output value classes are defined:

job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

slide-14
SLIDE 14

Job of the MapReduce example

  • FileInputFormat feeds the input splits to map

instances from wc-files directory

FileInputFormat.addInputPath(job, new Path(”wc-files"));

  • The result is written to wc-output directory:

FileInputFormat.setOutputPath(job, new Path(”wc-output"));

  • The job waits until Hadoop runtime environment

has executed the program:

job.waitForCompletion(true); } // main } // class

slide-15
SLIDE 15

MapReduce WordCount Demo

  • Building the program:

% mkdir wordcount_classes

% javac -classpath `hadoop classpath` \

  • d wc_classes WordCount.java

% jar -cvf wordcount.jar -C wordcount_classes/ .

  • Send WordCount to YARN:

% hadoop jar wordcount.jar fi.tut.WordCount /data output

slide-16
SLIDE 16

MapReduce WordCount Demo

  • The result:

% hdfs dfs -ls output Found 2 items

  • rw-r--r-- 1 hduser supergroup

0 2016-09-08 11:46 output/_SUCCESS

  • rw-r--r-- 1 hduser supergroup

470923 2016-09-08 11:46 output/part-r-00000 % hdfs dfs -get output/part-r-00000 % tail -10 part-r-00000 zoology, 2 zu 2 À 2 à 4 çela, 1 …

slide-17
SLIDE 17

Fault Tolerance and Speculative Execution

  • Faults are handled by restarting tasks
  • All managed in the background
  • No need to manage side-effects or process

state

  • Speculative execution helps prevent

bottlenecks

slide-18
SLIDE 18

Combiners

A A 1 1 A 1 B 1 Map node 1 A Reduce node for key A A 111 A 3 Combiner

slide-19
SLIDE 19

Combiners

  • Combiner can ”compress” data on a mapper node

before sending it forward

  • Combiner input/output types must equal the

mapper output types

  • In Hadoop Java, Combiners use the Reducer

interface job.setCombinerClass(MyReducer.class);

slide-20
SLIDE 20

Reducer as a Combiner

  • Reducer can be used as a Combiner if it is

commutative and associative

– Eg. max is

  • max(1, 2, max(3,4,5)) =

max(max(2, 4), max(1, 5, 3))

  • true for any order of function applications…

– Eg. avg is not

  • avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠

avg(avg(2, 4), avg(1, 5, 3)) = 3

  • Note: if Reducer is not c&a, Combiners can still be used

– The Combiner just has to be different from the Reducer and designed for the specific case

slide-21
SLIDE 21

Adding a Combiner to WordCount

walk, 1 run, 1 walk, 1 run, 1 walk, 2 Map Combiner Shuffle

slide-22
SLIDE 22

Hadoop Streaming

  • Map and Reduce functions can be implemented

in any language with the Hadoop Streaming API

  • Input is read from standard input
  • Output is written to standard output
  • Input/output items are lines of the form

key\tvalue

– \t is the tabulator character

  • Reducer input lines are grouped by key

– One reducer instance may receive multiple keys

slide-23
SLIDE 23

Run Hadoop Streaming

  • Debug using Unix pipes:

cat sample.txt | ./mapper.py | sort | ./reducer.py

  • On Hadoop:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

  • input sample.txt \
  • output output \
  • mapper ./mapper.py \
  • reducer ./reducer.py
slide-24
SLIDE 24

MapReduce Examples

  • These are from

https://highlyscalable.wordpress.com/2012/02/01/ma preduce-patterns/

  • Counting and Summing

– WordCount

  • Filtering (“Grepping”), Parsing, and Validation

– Problem: There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation – Solution: Mapper takes records one by one and emits accepted items or their transformed versions

slide-25
SLIDE 25

MapReduce Examples

  • Collating

– Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building

  • f inverted indexes.

– Solution: Mapper computes a given function for each item and emits value of the function as a key and item itself as a

  • value. Reducer obtains all items grouped by function value

and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.

slide-26
SLIDE 26

Index

This page contains text My page contains too Page A Page B This, A page, A contains, A text, A My, B page, B contains, B too, B This: A page: A, B contains: A, B text: A too: B Reduced output

slide-27
SLIDE 27

Conclusions

  • Map tasks spread out the load (the logic)

– may have hundreds or millions mappers – fast for large amounts of data