MapReduce Simplified Data Processing on Large Clusters Dean J. and - PDF document

Sep 27, 2022 •1.95k likes •2.06k views

MapReduce Simplified Data Processing on Large Clusters Dean J. and Ghemawat S. Google, 2008 Presented by Robert Hoff 14 Feb 2012 MapReduce Distributed Execution Engine For Processing Large Datasets Provides a restrictive

MapReduce Simplified Data Processing on Large Clusters Dean J. and Ghemawat S. Google, 2008 Presented by Robert Hoff 14 Feb 2012 MapReduce ● Distributed Execution Engine ● For Processing Large Datasets ● Provides a restrictive programming model to achieve this 1
By Originated in 2003 to Solve search related problems ● Inverted Indices (Pagerank) ● Word Count ● Most Frequent Queries Previously at Google ● Issues of parallisation, fault-tolerance, load-balancing were specific for each problem ● Using ideas from functional programming, map and reduce don't have side effects and can be parallised ● This method turned out to be applicable to most of their computational requirements 2
Related Work ● There existed systems that provided restricted programming models, and used these to parallise the computations. MapReduce main contributions at the time ● Fault Tolerance (running on top of commodity HW) ● Higher-Level of abstraction Can consider separately: Programming Interface Execution Engine (The Implemenation) 3
map (k1,v1) -> list(k2,v2) reduce (k2, list(v2)) -> list(v3) Map and reduce are client supplied functions ( may be anything ). These are applied to an input set that can be broken into n number of (k1, v1) pieces map (k1,v1) → list(k2,v2) 4
reduce (k2, list(v2)) -> list(v3) Word Count Example Map must finish before reduce starts 5
Twitter Hashtag Count Implementation ● Single Master ● Assigns Workers ● Fault Tolerant (includes failed and lagging workers) 6
Performance – Grep ● Searches for a 10^10 100 byte records for a three character pattern ● 10^12 bytes = 1,000,000 MB = 15,000 x 64MB chunks ● 1800 Worker Machines Experience MapReduce Applied to an increasing number of useful Problems ● Machine learning (e.g. statistical translation) ● Clustering for Google News ● Graph Computations (social network data) 7
Further / Future Work Since MapReduce programming model is restrictive and can only be applied to limited set of problems. Research is ongoing on execution engines that have higher generality ● DryadLINQ ● CIEL Further / Future Work The ideas of MapReduce, or any other Distributed Execution Engine may be applied to many-core architectures. For example Open-Source version Phoenix (from Stanford). Automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. 8
The paper - Remarks ● MapReduce solves Google's problems well. ● Results and ideas are highly replicable. ● But, somewhat disassociated from other research, lacks comparisons to other work (solves Google's problems well enough so why bother?) Conclusion ● MapReduce is still in use by Google today, solving a growing number of problems. ● MapReduce has become the ● leading programming model of choice for processing large data sets ● Open-Source versions (e.g. Hadoop) are employed by many other organisations 9

Recommend

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2 Challenge with Spot Market 3 Cloud MapReduce Hadoop Our prior work MapReduce App MapReduce App Cloud MapReduce Hadoop Cloud OS Amazon

185 views • 7 slides

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations

373 views • 26 slides

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012 MapReduce MapReduce

432 views • 29 slides

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and implementation used to process and generate large data sets. The map component of a MapReduce job typically parses input data and distills it down to

532 views • 5 slides

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2 Logical View of MapReduce During MapReduce, the

422 views • 23 slides

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large Scale Data Processing MapReduce Idea: simple, highly scalable, generic parallelization model Want to process lots of data ( > 1 TB)

781 views • 49 slides

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the concept Hadoop : the implementation Query Languages for Hadoop Spark : the improvement MapReduce vs databases Conclusion 340151

788 views • 29 slides

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the basic data structure in MapReduce Keys and values can be: integers, float, strings, raw bytes They can also be arbitrary data structures

1.77k views • 65 slides

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a framework for batch processing of Big Data: http://research.google.com/archive/mapreduce-osdi04-slides] Framework: A system used by programmers to build

186 views • 3 slides

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Laboratory Session: MapReduce 1 / 63 Algorithm Design Preliminaries Preliminaries Pietro Michiardi (Eurecom) Laboratory

814 views • 62 slides

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice deletes intermediate results of MapReduce jobs These results are not useless A system that reuses the output of MapReduce jobs / sub-jobs --

802 views • 59 slides

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer Agenda MapReduce What is it? Case Study Entropy Timeseries Scaling MapReduces Other thoughts, Conclusions MapReduce: What is it? A parallel

380 views • 12 slides

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and Michael Schatz Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons

398 views • 29 slides

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling MapReduce How and why did we come up with our model? [Karloff, Suri, Vassilvitskii SODA 2010] MapReduce algorithms for counting triangles in a

482 views • 29 slides

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA, Link oping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks Support Vector

527 views • 27 slides

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards? beyond MapReduce: Dryad Other data analytics systems: Machine learning: GraphLab Faster queries: Spark MapReduce Model input is stored as

694 views • 44 slides

th Century Early 20 th ry New Reli ligious Movements and Historical Newspapers Melissa Jerome

th Century Early 20 th ry New Reli ligious Movements and Historical Newspapers Melissa Jerome Project Coordinator, FPRDNP Some Digital Newspaper Repositories ProQuest Institutional Affiliation Required Access Newspaper Archive

410 views • 25 slides

Text Understanding from Scratch Xiang Zhang and Yann LeCun Article presented by Chad DeChant

Text Understanding from Scratch Xiang Zhang and Yann LeCun Article presented by Chad DeChant Paper Highlights Text understanding...without artificially embedding knowledge about words, phrases, sentences or any other syntactic or semantic

610 views • 29 slides

Search and Fake News Everything you need to know about how Google helps users find quality

Search and Fake News Everything you need to know about how Google helps users find quality information 1 Our Mission: Organize the worlds information and make it universally accessible and useful. 2 Contents 1. How Search Works 2.

600 views • 19 slides

End result Implementation Resources used Google maps Formsbased worksheets in both

4/16/2019 The teaching goal: Dept Earth, Ocean & Atmospheric Sciences Enhance student motivation Automating Creative, Peerreviewed Projects to Enhance Tactics known to enhance motivation: Motivation in a large 1 st yr course

643 views • 4 slides

Mass Customization and the Technical Engineer Ishikawa National College of Technology

Mass Customization and the Technical Engineer Ishikawa National College of Technology Sakura Minami Asami Chikae Soichiro Kato 1. Introduction Hello. We are from Ishikawa National College of Technology. Im Asami Chikae, majoring in

81 views • 4 slides

Content Strategy: Metro/Regional The Content Buckets Located in Google Drive -- Team Drive under

Content Strategy: Metro/Regional The Content Buckets Located in Google Drive -- Team Drive under Content Strategy Guides The ones you will use the most: Breaking News Daily Stories Enterprise Column Live Blog Aggregation News Breaking

835 views • 15 slides

Neilye Garrity How to conduct an 814.574.2882 Informational Interview neilye@candidcareer.com

Neilye Garrity How to conduct an 814.574.2882 Informational Interview neilye@candidcareer.com Presenter Background Neilye Garrity, Co-founder of CandidCareer.com Pennsylvania State University 2004 Bachelor of Science in Information

821 views • 17 slides

MaaS platform provider as a multimodal transport operator: vacatio legis in EU level on passengers

Albrecht Mendelssohn Bartholdy Graduate School of Law MaaS platform provider as a multimodal transport operator: vacatio legis in EU level on passengers rights in the multimodal context Florence, 20/21 June

396 views • 12 slides