SLIDE 1 MapReduce
Andrew Crotty Alex Galakatos
SLIDE 2
MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing
What is MapReduce?
SLIDE 3
Google project Implemented many special-purpose computations Needed an abstraction MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004
Background
SLIDE 4
User-defined function Takes input key/value pairs Returns intermediate key/value pairs Grouped by key and passed to Reduce
Map
SLIDE 5
User-defined function Takes intermediate key/corresponding set of values Returns merged result (e.g., aggregates) Result is usually smaller
Reduce
SLIDE 6 Problem: count the number of word occurrences in a very large document Solution:
Map: emit each word with initial count 1 Reduce: emit aggregated counts
Example
SLIDE 7 function map(String text) { for (String word : text) { emit (word, 1); } }
Word Count: Map
SLIDE 8 function reduce(String word, Iterator counts) { int sum = 0; for (int count : counts) { sum += count; } emit (word, sum); }
Word Count: Reduce
SLIDE 9
Happens between map and reduce phases Transfer all intermediate values for particular key to single node High network load Any problems with word count?
Shuffle
SLIDE 10
Word count map function produces repetitive intermediate key/value pairs User can provide optional function to perform partial merging Must be commutative and associative Logic is usually same as reduce function
Combiner
SLIDE 11
1) Partition data 2) Map phase 3) Combiner phase (optional) 4) Shuffle data 5) Reduce phase 6) Return result
Execution Overview
SLIDE 12
Distributed search Distributed sort Large-scale indexing Log file analysis Machine learning Many more...
Uses
SLIDE 13
Simple programming model Can express many different problems Allows seamless horizontal scalability
Advantages
SLIDE 14
Lack of novelty No performance enhancements Restricted framework
Criticisms
SLIDE 15 NOT a replacement Useful for:
1) ETL and "read once" datasets 2) Complex analytics 3) Semi-structured data 4) Quick-and-dirty analyses
DBMS Complement
SLIDE 16
Hadoop
SLIDE 17
Created in 2005 by Doug Cutting and Mike Cafarella Open-source MapReduce implementation Written in Java Supported by Apache
What is Hadoop?
SLIDE 18 Distributed file system Highly scalable and fault tolerant Replication for:
Availability Data locality
Rack-aware
HDFS
SLIDE 19 S3 EC2 Elastic MapReduce
Managed Hadoop Framework Run "job flows"
Much more...
Amazon Web Services
SLIDE 20 Job Flows
Java jar file Streaming Hive / Pig HBase
Word count (streaming)
Write map and reduce functions in Python Upload input data and functions to S3 Output written to S3
Elastic MapReduce
SLIDE 21
Reads/writes to stdin and stdout Splits each line and emits (word, 1)
Mapper
SLIDE 22
Go through sorted words and sum counts for same words
Reducer
SLIDE 23
Demo
SLIDE 24 Distributed analytics framework Supports MapReduce-style programs Machine learning/visualization use cases CPU is the bottleneck Optimize for CPU efficiency:
Cache-aware Register-aware Vectorized loops
Tupleware
SLIDE 25
SQL interpreter Language bindings Visualization Comparison benchmarks Many more...
Potential Projects
SLIDE 26
Questions?