MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why - PowerPoint PPT Presentation

MapReduce 320302 Databases & Web Services (P. Baumann) 1

Why MapReduce? Motivation: Large Scale Data Processing MapReduce Idea: simple, highly scalable,   generic parallelization model Want to process lots of data ( > 1 TB)  Automatic parallelization & distribution  Want to parallelize across  hundreds/thousands of CPUs Fault-tolerant  … Want to make this easy Clean abstraction for programmers   • MPI has programming overhead status & monitoring tools  320302 Databases & Web Services (P. Baumann) 2

Who Uses MapReduce?  At Google: • Index construction for Google Search • Article clustering for Google News • Statistical machine translation  At Yahoo!: • “Web map” powering Yahoo! Search • Spam detection for Yahoo! Mail  At Facebook: • Data mining • Ad optimization 320302 Databases & Web Services (P. Baumann) 3

Overview  MapReduce : the concept  Hadoop : the implementation  Query Languages for Hadoop  Spark : the improvement  MapReduce vs databases  Conclusion 320302 Databases & Web Services (P. Baumann) 4

MapReduce: the concept Credits: - David Maier - Google - Shiva Teja Reddi Gopidi 320302 Databases & Web Services (P. Baumann) 5

Preamble: Merits of Functional Programming (FP)  FP: input determines output – and nothing else • No other knowledge used (global variables!) • No other data modified (global variables!) • Every function invocation generates new data  Opposite: procedural programming  side effects • Unforeseeable interference between parallel processes  difficult/impossible to ensure dterministic result  (function, value set) must be monoid  Advantage of FP: parallelization can be arranged automatically • can (automatically!) reorder or parallelize execution - data flow implicit 320302 Databases & Web Services (P. Baumann) 6

Programming Model  Goals: large data sets, processing distributed over 1,000s of nodes • Abstraction to express simple computations • Hide details of parallelization, data distribution, fault tolerance, load balancing - MapReduce engine performs all housekeeping  Inspired by primitives from functional PLs like Lisp, Scheme, Haskell  Input, output are sets of key/value pairs  Users implement interface of two functions: map (inKey, inValue) -> (outKey, intermediateValuelist ) aka „group by“ in SQL aka aggregation in SQL reduce(outKey, intermediateValuelist) -> outValuelist 320302 Databases & Web Services (P. Baumann) 7

Ex 1: Count Word Occurrences map(String inKey, String inValue): reduce(String outputKey, Iterator auxValues): // inKey: document name // outKey: a word // inValue: document contents // outValues: a list of counts for each word w in inValue: int result = 0; EmitIntermediate(w, "1"); for each v in auxValues: result += ParseInt(v); Emit( AsString(result) ); [image: Google] 320302 Databases & Web Services (P. Baumann) 9

Ex 2: Distributed Grep  map function emits line if matches given pattern • identity function that just copies supplied intermediate data to output  Application 1: Count of URL Access Frequency • logs of web page requests  map()  <URL,1> • all values for same URL  reduce()  <URL, total count>  Application 2: Inverted Index • Document  map()  sequence of <word, document ID> pairs • all pairs for a given word  reduce() sorts document IDs  <word, list(document ID)> • set of all output pairs = simple inverted index • easy to extend for word positions 320302 Databases & Web Services (P. Baumann) 10

Ex 3: Relational Join  Map function M: “hash on key attribute”: ( ?, tuple) → list(key, tuple)  Reduce function R: “join on each k value”: (key, list(tuple)) → list(tuple) 320302 Databases & Web Services (P. Baumann) 11

Map & Reduce Input key*value Input key*value pairs pairs ... map map Data store 1 Data store n (key 2, (key 2, (key 1, (key 3, (key 1, (key 3, values...) values...) values...) values...) values...) values...) == Barrier == : Aggregates intermediate values by output key key 1, key 2, key 3, intermediate intermediate intermediate values values values reduce reduce reduce final key 1 final key 2 final key 3 values values values 320302 Databases & Web Services (P. Baumann) 12

Map Reduce Patent  Google granted US Patent 7,650,331, January 2010  System and method for efficient large-scale data processing A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application- independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data. 320302 Databases & Web Services (P. Baumann) 13

Hadoop: a MapReduce implementation Credits: - David Maier, U Wash - Costin Raiciu - “The Google File System” by S. Ghemawat, H. Gobioff, and S.-T. Leung, 2003 - https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html 320302 Databases & Web Services (P. Baumann) 14

Hadoop Distributed File System  HDFS = scalable, fault-tolerant file system • modeled after Google File System (GFS) • 64 MB blocks („chunks“) Hadoop [“The Google File System” by S. Ghemawat, H. Gobioff, and S.-T. Leung, 2003] 320302 Databases & Web Services (P. Baumann) 15

GFS  Goals: • Many inexpensive commodity components – failures happen routinely • Optimized for small # of large files (ex: a few million of 100+ MB files)  relies on local storage on each node • parallel file systems: typically dedicated I/O servers (ex: IBM GPFS)  metadata (file-chunk mapping, replica locations, ...) in master node„s RAM • Operation log on master„s local disk, replicated to remotes  master crash recovery! • „Shadow masters“ for read -only access HDFS differences? • No random write; append only • Implemented in Java, emphasizes platform independence • terminology: namenode  master, block  chunk, ... 320302 Databases & Web Services (P. Baumann) 16

GFS Consistency  Relaxed consistency model • tailored to Google‟s highly distributed applications, simple & efficient to implement  File namespace mutations are atomic • handled exclusively by master; locking guarantees atomicity & correctness • master‟s log defines global total order of operations State of file region after data mutation  • consistent: all clients always see same data, regardless of replica they read from • defined: consistent, plus all clients see the entire data mutation • undefined but consistent: result of concurrent successful mutations; all clients see same data, but may not reflect any one mutation • inconsistent: result of a failed mutation 320302 Databases & Web Services (P. Baumann) 17

GFS Consistency: Consequences  Implications for applications • better not distribute records across chunks! • rely on appends rather than overwrites • application-level checksums, checkpointing, writing self-validating & self-identifying records  Typical use cases (or “hacking around relaxed consistency”) • writer generates file from beginning to end and then atomically renames it to a permanent name under which it is accessed • writer inserts periodical checkpoints, readers only read up to checkpoint • many writers concurrently append to file to merge results, reader skip occasional padding and repetition using checksums 320302 Databases & Web Services (P. Baumann) 18

Replica Placement  Goals of placement policy • scalability, reliability and availability, maximize network bandwidth utilization  Background: GFS clusters are highly distributed • 100s of chunkservers across many racks • accessed from 100s of clients from same or different racks • traffic between machines on different racks may cross many switches • bandwidth between racks typically lower than within rack 320302 Databases & Web Services (P. Baumann) 19

Replica Placement  Goals of placement policy • scalability, reliability and availability, maximize network bandwidth utilization  Background: GFS clusters are highly distributed • 100s of chunkservers across many racks • accessed from 100s of clients from same or different racks • traffic between machines on different racks may cross many switches • bandwidth between racks typically lower than within rack  Selecting a chunkserver • place chunks on servers with below-average disk space utilization • place chunks on servers with low number of recent writes • spread chunks across racks (see above) 320302 Databases & Web Services (P. Baumann) 20

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why - PowerPoint PPT Presentation

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large Scale Data Processing MapReduce Idea: simple, highly scalable, generic parallelization model Want to process lots of data ( > 1 TB)

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Exploring API Embedding for API Usages and Applications Yi Chang Trong Duc Nguyen, Anh Tuan

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

Logistics Please fill out class survey! https://uw.iasystem.org/survey/205862 Midterm

IMPROVING NEWS RANKING BY COMMUNITY TWEETS Xin Shuai, Xiaozhong Liu, Johan Bollen Sunday, April

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema mappings and

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

Web Vitals for a healthier open web Ben Morss Developer Advocate DrupalCon Ben Morss

Exposing Inconsistent Search Results with Bobble Nick Feamster Georgia Tech Wenke Lee, Xinyu Xing,

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why - PowerPoint PPT Presentation

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large Scale Data Processing MapReduce Idea: simple, highly scalable, generic parallelization model Want to process lots of data ( > 1 TB)

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Exploring API Embedding for API Usages and Applications Yi Chang Trong Duc Nguyen, Anh Tuan

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

Logistics Please fill out class survey! https://uw.iasystem.org/survey/205862 Midterm

IMPROVING NEWS RANKING BY COMMUNITY TWEETS Xin Shuai, Xiaozhong Liu, Johan Bollen Sunday, April

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema mappings and

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

Web Vitals for a healthier open web Ben Morss Developer Advocate DrupalCon Ben Morss

Exposing Inconsistent Search Results with Bobble Nick Feamster Georgia Tech Wenke Lee, Xinyu Xing,

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the