MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. - PowerPoint PPT Presentation

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu for R212

Motivation: Large scale data processing We want to: Extract data from large datasets Run on big clusters of computers Be easy to program

Solution: MapReduce A new programming model: Map & Reduce Provides: Automatic parallelization and distribution Fault tolerance I/O scheduling Status and monitoring

Map (you, 1) (are, 1) (1, you are in Cambridge) (in, 1) (Cambridge, 1) (I, 1) (2, I like Cambridge) (like, 1) (Cambridge, 1) (we, 1) (live, 1) (3, we live in Cambridge) (in, 1) (Cambridge, 1) map (in_key, in_value) → list(out_key, intermediate_value)

(you, 1) (are, 1) (in, 1) (Cambridge, 1) (I, 1) (like, 1) (Cambridge, 1) (we, 1) (live, 1) (in, 1) (Cambridge, 1)

Partition (you, 1) (you, 1) (are, 1) (are, 1) (in, 1) (in, 1) (Cambridge, 1) (in, 1) (Cambridge, 1) (I, 1) (Cambridge, 1) (like, 1) (Cambridge, 1) (Cambridge, 1) (I, 1) (we, 1) (like, 1) (live, 1) (in, 1) (we, 1) (Cambridge, 1) (live, 1)

Partition Reduce (you, 1) (you, 1) (you, 1) (are, 1) (are, 1) (are, 1) (in, 1) (in, 1) (Cambridge, 1) (in, 2) (in, 1) (Cambridge, 1) (I, 1) (Cambridge, 3) (Cambridge, 1) (like, 1) (Cambridge, 1) (Cambridge, 1) (I, 1) (I, 1) (we, 1) (like, 1) (like, 1) (live, 1) (in, 1) (we, 1) (we, 1) (Cambridge, 1) (live, 1) (live, 1) reduce (out_key, list(intermediate_value)) -> list(out_value)

User Program File 1 File 2 File 3 Input files

User fork Master Program fork fork File 1 worker worker File 2 worker worker File 3 worker Input files

User fork Master Program assign assign map reduce File 1 worker worker File 2 worker worker File 3 worker Input files

User fork Master Program assign assign map reduce File 1 worker split 0 worker split 1 read File 2 split 2 worker split split 3 worker split 4 File 3 worker Input M Map files splits phase

User fork Master Program assign assign map reduce File 1 worker split 0 worker split 1 local read write File 2 split 2 worker split split 3 worker split 4 File 3 worker Input M Map Intermediate files files splits phase (on local disks)

User fork Master Program assign assign map reduce File 1 worker split 0 worker split 1 local read write remote File 2 split 2 worker read split split 3 worker split 4 File 3 worker Input M Map Intermediate files Reduce files splits phase (on local disks) phase

User fork Master Program assign assign map reduce File 1 worker write Output split 0 worker File 1 split 1 local read write remote File 2 split 2 worker read split split 3 Output worker File 2 split 4 File 3 worker Input M Map Intermediate files Reduce R Output files splits phase (on local disks) phase files

Fine task granularity M so that data is between 16MB and 64MB R is small multiple of workers E.g. M = 200,000, R = 5,000 on 2,000 workers Advantages: dynamic load balancing fault tolerance

Fault tolerance Workers: Detect failure via periodic heartbeat Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master: Not handled - failure unlikely

Refinements Locality optimization Backup tasks Ordering guarantees Combiner function Skipping bad records Local execution

Performance Tests run on 1800 machines: Dual 2GHz Intel Xeon processors with Hyper-Threading enabled 4GB of memory Two 160GB IDE disks Gigabit Ethernet link 2 Benchmarks: 10 10 x 100 byte entries, 92k matches MR_Grep 10 10 x 100 byte entries MR_Sort

MR_Grep 150 seconds run (startup overhead of ~60 seconds)

MR_Sort Normal execution No backup tasks 200 tasks killed

Experience Rewrite of the indexing system for Google web search Large scale machine learning Clustering for Google News Data extraction for Google Zeitgeist Large scale graph computations

Conclusions MapReduce: useful abstraction simplifies large-scale computations easy to use However: expensive for small applications long startup time (~1 min) chaining of map-reduce phases?

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. - PowerPoint PPT Presentation

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu for R212 Motivation: Large scale data processing We want to: Extract data from large datasets Run on big clusters of

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Dynamic analysis, reporting and visualization of data catalogs Use cases INSPIRE Dashboard for

Hash Tables Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing

Automated, Connected, Electric, Shared & Emerging Applications - Practical Implementation

E CONOMI C DE VE L OPME NT YE AR 1, UNI VE RSI T Y OF WAT E RL OO ST AT E O F

CryptDB Protecting Confidentiality with Encrypted Query Processing Katarzyna Baranowska

Not Your Parents Test Automation: Application of Non-Traditional Automation Presented by: Paul

HOW TO REPRESENT RELATIONS 2018. 11. 14 Naver TechTalk SNU Datamining Laboratory Sungwon, Lyu

SCIENCE & TECHNOLOGY OFFICE Overview of NASA Initiatives in 3D Printing and Additive

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. - PowerPoint PPT Presentation

MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu for R212 Motivation: Large scale data processing We want to: Extract data from large datasets Run on big clusters of

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Dynamic analysis, reporting and visualization of data catalogs Use cases INSPIRE Dashboard for

Hash Tables Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing

Automated, Connected, Electric, Shared &amp; Emerging Applications - Practical Implementation

E CONOMI C DE VE L OPME NT YE AR 1, UNI VE RSI T Y OF WAT E RL OO ST AT E O F

CryptDB Protecting Confidentiality with Encrypted Query Processing Katarzyna Baranowska

Not Your Parents Test Automation: Application of Non-Traditional Automation Presented by: Paul

HOW TO REPRESENT RELATIONS 2018. 11. 14 Naver TechTalk SNU Datamining Laboratory Sungwon, Lyu

SCIENCE &amp; TECHNOLOGY OFFICE Overview of NASA Initiatives in 3D Printing and Additive

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Automated, Connected, Electric, Shared & Emerging Applications - Practical Implementation

SCIENCE & TECHNOLOGY OFFICE Overview of NASA Initiatives in 3D Printing and Additive