mapreduce
play

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? - PowerPoint PPT Presentation

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations


  1. MapReduce Andrew Crotty Alex Galakatos

  2. What is MapReduce? MapReduce is a framework for:  parallelizable problems  large datasets  cluster/grid computing

  3. Background  Google project  Implemented many special-purpose computations  Needed an abstraction  MapReduce: Simplified Data Processing on Large Clusters , OSDI 2004

  4. Map  User-defined function  Takes input key/value pairs  Returns intermediate key/value pairs  Grouped by key and passed to Reduce

  5. Reduce  User-defined function  Takes intermediate key/corresponding set of values  Returns merged result (e.g., aggregates)  Result is usually smaller

  6. Example  Problem : count the number of word occurrences in a very large document  Solution :  Map : emit each word with initial count 1  Reduce : emit aggregated counts

  7. Word Count: Map function map(String text) { for (String word : text) { emit (word, 1); } }

  8. Word Count: Reduce function reduce(String word, Iterator counts) { int sum = 0; for (int count : counts) { sum += count; } emit (word, sum); }

  9. Shuffle  Happens between map and reduce phases  Transfer all intermediate values for particular key to single node  High network load  Any problems with word count?

  10. Combiner  Word count map function produces repetitive intermediate key/value pairs  User can provide optional function to perform partial merging  Must be commutative and associative  Logic is usually same as reduce function

  11. Execution Overview 1) Partition data 2) Map phase 3) Combiner phase (optional) 4) Shuffle data 5) Reduce phase 6) Return result

  12. Uses  Distributed search  Distributed sort  Large-scale indexing  Log file analysis  Machine learning  Many more...

  13. Advantages  Simple programming model  Can express many different problems  Allows seamless horizontal scalability

  14. Criticisms  Lack of novelty  No performance enhancements  Restricted framework

  15. DBMS Complement  NOT a replacement  Useful for: 1) ETL and "read once" datasets 2) Complex analytics 3) Semi-structured data 4) Quick-and-dirty analyses

  16. Hadoop

  17. What is Hadoop?  Created in 2005 by Doug Cutting and Mike Cafarella  Open-source MapReduce implementation  Written in Java  Supported by Apache

  18. HDFS  Distributed file system  Highly scalable and fault tolerant  Replication for:  Availability  Data locality  Rack-aware

  19. Amazon Web Services  S3  EC2  Elastic MapReduce  Managed Hadoop Framework  Run "job flows"  Much more...

  20. Elastic MapReduce  Job Flows  Java jar file  Streaming  Hive / Pig  HBase  Word count (streaming)  Write map and reduce functions in Python  Upload input data and functions to S3  Output written to S3

  21. Mapper  Reads/writes to stdin and stdout  Splits each line and emits (word, 1)

  22. Reducer  Go through sorted words and sum counts for same words

  23. Demo

  24. Tupleware  Distributed analytics framework  Supports MapReduce-style programs  Machine learning/visualization use cases  CPU is the bottleneck  Optimize for CPU efficiency:  Cache-aware  Register-aware  Vectorized loops

  25. Potential Projects  SQL interpreter  Language bindings  Visualization  Comparison benchmarks  Many more...

  26. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend