MapReduce Andrew Crotty Alex Galakatos What is MapReduce? - - PowerPoint PPT Presentation

mapreduce
SMART_READER_LITE
LIVE PREVIEW

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? - - PowerPoint PPT Presentation

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations


slide-1
SLIDE 1

MapReduce

Andrew Crotty Alex Galakatos

slide-2
SLIDE 2

MapReduce is a framework for:  parallelizable problems  large datasets  cluster/grid computing

What is MapReduce?

slide-3
SLIDE 3

 Google project  Implemented many special-purpose computations  Needed an abstraction  MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004

Background

slide-4
SLIDE 4

 User-defined function  Takes input key/value pairs  Returns intermediate key/value pairs  Grouped by key and passed to Reduce

Map

slide-5
SLIDE 5

 User-defined function  Takes intermediate key/corresponding set of values  Returns merged result (e.g., aggregates)  Result is usually smaller

Reduce

slide-6
SLIDE 6

 Problem: count the number of word occurrences in a very large document  Solution:

 Map: emit each word with initial count 1  Reduce: emit aggregated counts

Example

slide-7
SLIDE 7

function map(String text) { for (String word : text) { emit (word, 1); } }

Word Count: Map

slide-8
SLIDE 8

function reduce(String word, Iterator counts) { int sum = 0; for (int count : counts) { sum += count; } emit (word, sum); }

Word Count: Reduce

slide-9
SLIDE 9

 Happens between map and reduce phases  Transfer all intermediate values for particular key to single node  High network load  Any problems with word count?

Shuffle

slide-10
SLIDE 10

 Word count map function produces repetitive intermediate key/value pairs  User can provide optional function to perform partial merging  Must be commutative and associative  Logic is usually same as reduce function

Combiner

slide-11
SLIDE 11

1) Partition data 2) Map phase 3) Combiner phase (optional) 4) Shuffle data 5) Reduce phase 6) Return result

Execution Overview

slide-12
SLIDE 12

 Distributed search  Distributed sort  Large-scale indexing  Log file analysis  Machine learning  Many more...

Uses

slide-13
SLIDE 13

 Simple programming model  Can express many different problems  Allows seamless horizontal scalability

Advantages

slide-14
SLIDE 14

 Lack of novelty  No performance enhancements  Restricted framework

Criticisms

slide-15
SLIDE 15

 NOT a replacement  Useful for:

1) ETL and "read once" datasets 2) Complex analytics 3) Semi-structured data 4) Quick-and-dirty analyses

DBMS Complement

slide-16
SLIDE 16

Hadoop

slide-17
SLIDE 17

 Created in 2005 by Doug Cutting and Mike Cafarella  Open-source MapReduce implementation  Written in Java  Supported by Apache

What is Hadoop?

slide-18
SLIDE 18

 Distributed file system  Highly scalable and fault tolerant  Replication for:

 Availability  Data locality

 Rack-aware

HDFS

slide-19
SLIDE 19

 S3  EC2  Elastic MapReduce

 Managed Hadoop Framework  Run "job flows"

 Much more...

Amazon Web Services

slide-20
SLIDE 20

 Job Flows

 Java jar file  Streaming  Hive / Pig  HBase

 Word count (streaming)

 Write map and reduce functions in Python  Upload input data and functions to S3  Output written to S3

Elastic MapReduce

slide-21
SLIDE 21

 Reads/writes to stdin and stdout  Splits each line and emits (word, 1)

Mapper

slide-22
SLIDE 22

 Go through sorted words and sum counts for same words

Reducer

slide-23
SLIDE 23

Demo

slide-24
SLIDE 24

 Distributed analytics framework  Supports MapReduce-style programs  Machine learning/visualization use cases  CPU is the bottleneck  Optimize for CPU efficiency:

 Cache-aware  Register-aware  Vectorized loops

Tupleware

slide-25
SLIDE 25

 SQL interpreter  Language bindings  Visualization  Comparison benchmarks  Many more...

Potential Projects

slide-26
SLIDE 26

Questions?