MapReduce Andrew Crotty Alex Galakatos What is MapReduce? - PowerPoint PPT Presentation

MapReduce Andrew Crotty Alex Galakatos

What is MapReduce? MapReduce is a framework for:  parallelizable problems  large datasets  cluster/grid computing

Background  Google project  Implemented many special-purpose computations  Needed an abstraction  MapReduce: Simplified Data Processing on Large Clusters , OSDI 2004

Map  User-defined function  Takes input key/value pairs  Returns intermediate key/value pairs  Grouped by key and passed to Reduce

Reduce  User-defined function  Takes intermediate key/corresponding set of values  Returns merged result (e.g., aggregates)  Result is usually smaller

Example  Problem : count the number of word occurrences in a very large document  Solution :  Map : emit each word with initial count 1  Reduce : emit aggregated counts

Word Count: Map function map(String text) { for (String word : text) { emit (word, 1); } }

Word Count: Reduce function reduce(String word, Iterator counts) { int sum = 0; for (int count : counts) { sum += count; } emit (word, sum); }

Shuffle  Happens between map and reduce phases  Transfer all intermediate values for particular key to single node  High network load  Any problems with word count?

Combiner  Word count map function produces repetitive intermediate key/value pairs  User can provide optional function to perform partial merging  Must be commutative and associative  Logic is usually same as reduce function

Execution Overview 1) Partition data 2) Map phase 3) Combiner phase (optional) 4) Shuffle data 5) Reduce phase 6) Return result

Uses  Distributed search  Distributed sort  Large-scale indexing  Log file analysis  Machine learning  Many more...

Advantages  Simple programming model  Can express many different problems  Allows seamless horizontal scalability

Criticisms  Lack of novelty  No performance enhancements  Restricted framework

DBMS Complement  NOT a replacement  Useful for: 1) ETL and "read once" datasets 2) Complex analytics 3) Semi-structured data 4) Quick-and-dirty analyses

Hadoop

What is Hadoop?  Created in 2005 by Doug Cutting and Mike Cafarella  Open-source MapReduce implementation  Written in Java  Supported by Apache

HDFS  Distributed file system  Highly scalable and fault tolerant  Replication for:  Availability  Data locality  Rack-aware

Amazon Web Services  S3  EC2  Elastic MapReduce  Managed Hadoop Framework  Run "job flows"  Much more...

Elastic MapReduce  Job Flows  Java jar file  Streaming  Hive / Pig  HBase  Word count (streaming)  Write map and reduce functions in Python  Upload input data and functions to S3  Output written to S3

Mapper  Reads/writes to stdin and stdout  Splits each line and emits (word, 1)

Reducer  Go through sorted words and sum counts for same words

Tupleware  Distributed analytics framework  Supports MapReduce-style programs  Machine learning/visualization use cases  CPU is the bottleneck  Optimize for CPU efficiency:  Cache-aware  Register-aware  Vectorized loops

Potential Projects  SQL interpreter  Language bindings  Visualization  Comparison benchmarks  Many more...

Questions?

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? - PowerPoint PPT Presentation

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Investigation into file size distribution and its effect on disk server performance Brain Davies

SECFUZZ: Fuzz-testing Security Protocols Petar Tsankov, Mohammad Torabi Dashti, David Basin ETH

Harvesting Logs and Events Using MetaCentrum Virtualization Services Radoslav Bod, Daniel

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Precimonious & HiFPTuner Tuning Assistant for Floating-Point Precision Ignacio Laguna,

Towards a Methodology for Benchmarking Edge Processing Frameworks Pedro Silva, Alexandru Costan,

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

A ns2-based simulation framework for performance evaluation of overlay networks Michele Amoretti

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? - PowerPoint PPT Presentation

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Investigation into file size distribution and its effect on disk server performance Brain Davies

SECFUZZ: Fuzz-testing Security Protocols Petar Tsankov, Mohammad Torabi Dashti, David Basin ETH

Harvesting Logs and Events Using MetaCentrum Virtualization Services Radoslav Bod, Daniel

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Precimonious &amp; HiFPTuner Tuning Assistant for Floating-Point Precision Ignacio Laguna,

Towards a Methodology for Benchmarking Edge Processing Frameworks Pedro Silva, Alexandru Costan,

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

A ns2-based simulation framework for performance evaluation of overlay networks Michele Amoretti

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Precimonious & HiFPTuner Tuning Assistant for Floating-Point Precision Ignacio Laguna,