Data-intensive programming MapReduce Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing

Outline • MapReduce

Original Map and Reduce • map(foo, i) – Apply function foo to every item of i and return a list of the results – Example: map(square, [1, 2, 3, 4]) = [1, 4, 9, 16] • reduce( bar , i) – Apply two argument function bar • cumulatively to the items of i, • from left to right • to reduce the values in i to a single value. – Example: reduce(sum, [1, 4, 9]) = (((1+4)+9)+16) = 30

MapReduce • MapReduce is a programming model for distributed processing of large data sets • Scales nearly linearly – Twice as many nodes -> twice as fast – Achieved by exploiting data locality • Computation is moved close to the data • Simple programming model – Programmer only needs to write two functions: Map and Reduce

Map & Reduce • Map maps input data to (key, value) pairs • Reduce processes the list of values for a given key • The MapReduce framework (such as Hadoop) takes care of the rest – Distributes the job among nodes – Moves the data to and from the nodes – Handles node failures – etc.

SHUFFLE MAP REDUCE Node A Sheena Node 1 Sheena Sheena 1 Sheena 21 is is 1 1 a a 1 punk punk 1 Node B is rocker rocker 1 is 23 Sheena Sheena 1 Node 2 1 is is 1 a a 1 Node C punk punk 1 a rocker rocker 1 Node 3 a 31 1 now now 1 Well Well 1 she’s she’s 1 Node D punk a a 1 punk 37 punk punk 1 1 punk punk 1

MapReduce SHUFFLE MAP REDUCE Sheena Sheena Sheena 1 Sheena 21 is is 1 1 a a 1 punk punk 1 is rocker rocker 1 is 23 Sheena Sheena 1 1 is is 1 … … 1 a a 31 1 punk punk 37 1

MapReduce MAP REDUCE Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)

Map & Reduce in Hadoop • In Hadoop, Map and Reduce functions can be written in – Java • org.apache.hadoop.mapreduce.lib – C++ using Hadoop Pipes – any language, using Hadoop Streaming • Also a number of third party programming frameworks for Hadoop MapReduce – For Java, Scala, Python, Ruby, PHP, …

Mapper Java example Output key and value types Input key and value types public class WordCount { public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){ word = new Text( items[i]); one = new IntWritable(1); context.write(word, one); } } The Mapper input types depend on the defined InputFormat • By default TextInputFormat • Key ( LongWritable ): position in the file • Value ( Text ): the line •

Reducer Java example Input key and value types Output key and value types public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum = sum + value.get(); } context.write(key, new IntWritable(sum)); } }

Job of the MapReduce example • The main driver program configures the MapReduce job and submits it to the Hadoop YARN cluster: public static void main( String[] args) throws Exception { … Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName(”Word Count"); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class);

• Input and output value classes are defined: Input key and value types job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); Output key and value types job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

Job of the MapReduce example • FileInputFormat feeds the input splits to map instances from wc-files directory FileInputFormat.addInputPath(job, new Path(”wc-files")); • The result is written to wc-output directory: FileInputFormat.setOutputPath(job, new Path(”wc-output")); • The job waits until Hadoop runtime environment has executed the program: job.waitForCompletion(true); } // main } // class

MapReduce WordCount Demo • Building the program: % mkdir wordcount_classes % javac -classpath `hadoop classpath` \ -d wc_classes WordCount.java % jar -cvf wordcount.jar -C wordcount_classes/ . • Send WordCount to YARN: % hadoop jar wordcount.jar fi.tut.WordCount /data output

MapReduce WordCount Demo • The result: % hdfs dfs -ls output Found 2 items -rw-r--r-- 1 hduser supergroup 0 2016-09-08 11:46 output/_SUCCESS -rw-r--r-- 1 hduser supergroup 470923 2016-09-08 11:46 output/part-r-00000 % hdfs dfs -get output/part-r-00000 % tail -10 part-r-00000 zoology, 2 zu 2 À 2 à 4 çela, 1 …

Fault Tolerance and Speculative Execution • Faults are handled by restarting tasks • All managed in the background • No need to manage side-effects or process state • Speculative execution helps prevent bottlenecks

Combiners Map node 1 A 1 Reduce node for key A A 1 A 1 B 1 A Combiner A 111 A 3

Combiners • Combiner can ”compress” data on a mapper node before sending it forward • Combiner input/output types must equal the mapper output types • In Hadoop Java, Combiners use the Reducer interface job.setCombinerClass(MyReducer.class);

Reducer as a Combiner • Reducer can be used as a Combiner if it is commutative and associative – Eg. max is • max(1, 2, max(3,4,5)) = max(max(2, 4), max(1, 5, 3)) • true for any order of function applications… – Eg. avg is not • avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠ avg(avg(2, 4), avg(1, 5, 3)) = 3 • Note: if Reducer is not c&a, Combiners can still be used – The Combiner just has to be different from the Reducer and designed for the specific case

Adding a Combiner to WordCount Map Shuffle Combiner walk, 1 run, 1 run, 1 walk, 2 walk, 1

Hadoop Streaming • Map and Reduce functions can be implemented in any language with the Hadoop Streaming API • Input is read from standard input • Output is written to standard output • Input/output items are lines of the form key \t value – \t is the tabulator character • Reducer input lines are grouped by key – One reducer instance may receive multiple keys

Run Hadoop Streaming • Debug using Unix pipes: cat sample.txt | ./mapper.py | sort | ./reducer.py • On Hadoop: hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input sample.txt \ -output output \ -mapper ./mapper.py \ -reducer ./reducer.py

MapReduce Examples • These are from https://highlyscalable.wordpress.com/2012/02/01/ma preduce-patterns/ • Counting and Summing – WordCount • Filtering (“Grepping”), Parsing, and Validation – Problem : There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation – Solution : Mapper takes records one by one and emits accepted items or their transformed versions

MapReduce Examples • Collating – Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes. – Solution: Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.

Index Page A This, A Reduced output page, A This page contains, A contains text text, A This: A page: A, B Page B contains: A, B My, B text: A page, B too: B My page contains, B contains too too, B

Conclusions • Map tasks spread out the load (the logic) – may have hundreds or millions mappers – fast for large amounts of data

Data-intensive programming MapReduce Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing Outline MapReduce Original Map and Reduce map(foo, i) Apply function foo to every item of i and return a list of the results Example:

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Tutorial: MapReduce Theory and Practice of Data-intensive Applications Pietro Michiardi Eurecom

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

MapReduce and its use for indexing The Programming Model and Practice Enrique Alfonseca

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 2: MapReduce Algorithm

MapReduce and Frequent Itemsets Mining Yang Wang 1 MapReduce (Hadoop) Programming model

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Hardware switches - the open-source approach Ji Prko jiri@resnulli.us Red Hat 1 Scope of

Rocker Scott Feldman netdev 0.1 2015 Ottawa Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa,

Centraal-station Munttoren Rijksmuseum Chteau du KEUKENHOF Stad AMSTERDAM Upravil:

Library of Congress Classification Module 10.5 Resources Entered under Corporate Body Policy,

SESSION AGENDA Application of ABA and ADA Standards Play Area Terminology Existing Play

The Controller B 1 The Waypointer 2 Lumos (Helmet) 3 Lumos (Controller)

sensors in a SiGe Bi-CMOS process. Speaker: Lorenzo Paolozzi 10/12/2018 L. Paolozzi - PIXEL18 -

The Solu)on is You! OrWhy I dont panic about Internet

Data-intensive programming MapReduce Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing Outline MapReduce Original Map and Reduce map(foo, i) Apply function foo to every item of i and return a list of the results Example:

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Tutorial: MapReduce Theory and Practice of Data-intensive Applications Pietro Michiardi Eurecom

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

MapReduce and its use for indexing The Programming Model and Practice Enrique Alfonseca

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 2: MapReduce Algorithm

MapReduce and Frequent Itemsets Mining Yang Wang 1 MapReduce (Hadoop) Programming model

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Hardware switches - the open-source approach Ji Prko jiri@resnulli.us Red Hat 1 Scope of

Rocker Scott Feldman netdev 0.1 2015 Ottawa Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa,

Centraal-station Munttoren Rijksmuseum Chteau du KEUKENHOF Stad AMSTERDAM Upravil:

Library of Congress Classification Module 10.5 Resources Entered under Corporate Body Policy,

SESSION AGENDA Application of ABA and ADA Standards Play Area Terminology Existing Play

The Controller B 1 The Waypointer 2 Lumos (Helmet) 3 Lumos (Controller)

sensors in a SiGe Bi-CMOS process. Speaker: Lorenzo Paolozzi 10/12/2018 L. Paolozzi - PIXEL18 -

The Solu)on is You! OrWhy I dont panic about Internet

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the