data intensive programming mapreduce
play

Data-intensive programming MapReduce Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing Outline MapReduce Original Map and Reduce map(foo, i) Apply function foo to every item of i and return a list of the results Example:


  1. Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing

  2. Outline • MapReduce

  3. Original Map and Reduce • map(foo, i) – Apply function foo to every item of i and return a list of the results – Example: map(square, [1, 2, 3, 4]) = [1, 4, 9, 16] • reduce( bar , i) – Apply two argument function bar • cumulatively to the items of i, • from left to right • to reduce the values in i to a single value. – Example: reduce(sum, [1, 4, 9]) = (((1+4)+9)+16) = 30

  4. MapReduce • MapReduce is a programming model for distributed processing of large data sets • Scales nearly linearly – Twice as many nodes -> twice as fast – Achieved by exploiting data locality • Computation is moved close to the data • Simple programming model – Programmer only needs to write two functions: Map and Reduce

  5. Map & Reduce • Map maps input data to (key, value) pairs • Reduce processes the list of values for a given key • The MapReduce framework (such as Hadoop) takes care of the rest – Distributes the job among nodes – Moves the data to and from the nodes – Handles node failures – etc.

  6. SHUFFLE MAP REDUCE Node A Sheena Node 1 Sheena Sheena 1 Sheena 21 is is 1 1 a a 1 punk punk 1 Node B is rocker rocker 1 is 23 Sheena Sheena 1 Node 2 1 is is 1 a a 1 Node C punk punk 1 a rocker rocker 1 Node 3 a 31 1 now now 1 Well Well 1 she’s she’s 1 Node D punk a a 1 punk 37 punk punk 1 1 punk punk 1

  7. MapReduce SHUFFLE MAP REDUCE Sheena Sheena Sheena 1 Sheena 21 is is 1 1 a a 1 punk punk 1 is rocker rocker 1 is 23 Sheena Sheena 1 1 is is 1 … … 1 a a 31 1 punk punk 37 1

  8. MapReduce MAP REDUCE Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)

  9. Map & Reduce in Hadoop • In Hadoop, Map and Reduce functions can be written in – Java • org.apache.hadoop.mapreduce.lib – C++ using Hadoop Pipes – any language, using Hadoop Streaming • Also a number of third party programming frameworks for Hadoop MapReduce – For Java, Scala, Python, Ruby, PHP, …

  10. Mapper Java example Output key and value types Input key and value types public class WordCount { public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){ word = new Text( items[i]); one = new IntWritable(1); context.write(word, one); } } The Mapper input types depend on the defined InputFormat • By default TextInputFormat • Key ( LongWritable ): position in the file • Value ( Text ): the line •

  11. Reducer Java example Input key and value types Output key and value types public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum = sum + value.get(); } context.write(key, new IntWritable(sum)); } }

  12. Job of the MapReduce example • The main driver program configures the MapReduce job and submits it to the Hadoop YARN cluster: public static void main( String[] args) throws Exception { … Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName(”Word Count"); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class);

  13. • Input and output value classes are defined: Input key and value types job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); Output key and value types job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

  14. Job of the MapReduce example • FileInputFormat feeds the input splits to map instances from wc-files directory FileInputFormat.addInputPath(job, new Path(”wc-files")); • The result is written to wc-output directory: FileInputFormat.setOutputPath(job, new Path(”wc-output")); • The job waits until Hadoop runtime environment has executed the program: job.waitForCompletion(true); } // main } // class

  15. MapReduce WordCount Demo • Building the program:  % mkdir wordcount_classes % javac -classpath `hadoop classpath` \ -d wc_classes WordCount.java % jar -cvf wordcount.jar -C wordcount_classes/ . • Send WordCount to YARN: % hadoop jar wordcount.jar fi.tut.WordCount /data output

  16. MapReduce WordCount Demo • The result: % hdfs dfs -ls output Found 2 items -rw-r--r-- 1 hduser supergroup 0 2016-09-08 11:46 output/_SUCCESS -rw-r--r-- 1 hduser supergroup 470923 2016-09-08 11:46 output/part-r-00000 % hdfs dfs -get output/part-r-00000 % tail -10 part-r-00000 zoology, 2 zu 2 À 2 à 4 çela, 1 …

  17. Fault Tolerance and Speculative Execution • Faults are handled by restarting tasks • All managed in the background • No need to manage side-effects or process state • Speculative execution helps prevent bottlenecks

  18. Combiners Map node 1 A 1 Reduce node for key A A 1 A 1 B 1 A Combiner A 111 A 3

  19. Combiners • Combiner can ”compress” data on a mapper node before sending it forward • Combiner input/output types must equal the mapper output types • In Hadoop Java, Combiners use the Reducer interface job.setCombinerClass(MyReducer.class);

  20. Reducer as a Combiner • Reducer can be used as a Combiner if it is commutative and associative – Eg. max is • max(1, 2, max(3,4,5)) = max(max(2, 4), max(1, 5, 3)) • true for any order of function applications… – Eg. avg is not • avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠ avg(avg(2, 4), avg(1, 5, 3)) = 3 • Note: if Reducer is not c&a, Combiners can still be used – The Combiner just has to be different from the Reducer and designed for the specific case

  21. Adding a Combiner to WordCount Map Shuffle Combiner walk, 1 run, 1 run, 1 walk, 2 walk, 1

  22. Hadoop Streaming • Map and Reduce functions can be implemented in any language with the Hadoop Streaming API • Input is read from standard input • Output is written to standard output • Input/output items are lines of the form key \t value – \t is the tabulator character • Reducer input lines are grouped by key – One reducer instance may receive multiple keys

  23. Run Hadoop Streaming • Debug using Unix pipes: cat sample.txt | ./mapper.py | sort | ./reducer.py • On Hadoop: hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input sample.txt \ -output output \ -mapper ./mapper.py \ -reducer ./reducer.py

  24. MapReduce Examples • These are from https://highlyscalable.wordpress.com/2012/02/01/ma preduce-patterns/ • Counting and Summing – WordCount • Filtering (“Grepping”), Parsing, and Validation – Problem : There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation – Solution : Mapper takes records one by one and emits accepted items or their transformed versions

  25. MapReduce Examples • Collating – Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes. – Solution: Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.

  26. Index Page A This, A Reduced output page, A This page contains, A contains text text, A This: A page: A, B Page B contains: A, B My, B text: A page, B too: B My page contains, B contains too too, B

  27. Conclusions • Map tasks spread out the load (the logic) – may have hundreds or millions mappers – fast for large amounts of data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend