V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop - PowerPoint PPT Presentation

                      V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop   Based on MRS Chapter 4 and RU Chapter 2 IR&DM ’13/’14 ! 74

Why MapReduce? • Large clusters of commodity computers   (as opposed to few supercomputers)   • Challenges: • load balancing • fault tolerance • ease of programming   Jeff Dean • MapReduce • system for distributed data processing • programming model Sanjay Ghemawat • Full details: [Ghemawat et al. ‘03][Dean and Ghemawat ’04] IR&DM ’13/’14 ! 75

Why MapReduce? • Large clusters of commodity computers   (as opposed to few supercomputers)   Jeff Dean Facts: ! • Challenges: When Jeff Dean designs software, he first codes the binary and then   writes the source as documentation. • load balancing ! Compilers don’t warn Jeff Dean. Jeff Dean warns compilers. • fault tolerance ! Jeff Dean's keyboard has two keys: 1 and 0. • ease of programming   ! When Graham Bell invented the telephone, he saw a missed call from Jeff Dean. Jeff Dean ! Source: http://www.quora.com/Jeff-Dean/What-are-all-the-Jeff-Dean-facts • MapReduce • system for distributed data processing • programming model Sanjay Ghemawat • Full details: [Ghemawat et al. ‘03][Dean and Ghemawat ’04] IR&DM ’13/’14 ! 75

1. System Architecture GFS master • Google File System (GFS) /foo/bar chunk 1df2 • distributed file system for large clusters chunk 2ef0 chunk 3ef1 • tunable replication factor • single master GFS client • manages namespace (/home/user/data) • coordinates replication of data chunks GFS chunkserver • first point of contact for clients chunk 2ef0 chunk 1df2 chunk 5ef0 chunk 3ef2 • many chunkservers chunk 3ef1 chunk 5af1 • keep data chunks (typically 64 MB) control • send/receive data chunks to/from clients data • Full details: [Ghemawat et al. ’03] IR&DM ’13/’14 ! 76

System Architecture (cont’d) • MapReduce (MR) MR master • system for distributed data processing • moves computation to the data for locality • copes with failure of workers report progress • single master assign tasks MR client • coordinates execution of job MR worker • (re-)assigns map/reduce tasks to workers MR worker MR worker MR worker GFS chunkserver GFS chunkserver GFS chunkserver • many workers GFS chunkserver • execute assigned map/reduce tasks control • Full details: [Dean and Ghemawat ’04] IR&DM ’13/’14 ! 77

2. Programming Model • Inspired by functional programming (i.e., no side effects) • Input/output are key-value pairs ( k , v ) (e.g., string and int) • Users implement two functions • map : ( k 1, v 1) => list ( k 2, v 2) • reduce : ( k 2, list( v 2)) => list ( k 3, v 3) with input sorted by key k 2 • Anatomy of a MapReduce job • Workers execute map () on their portion of the input data in GFS • Intermediate data from map () is partitioned and sorted • Workers execute reduce () on their partition and write output data to GFS • Users may implement combine () for local aggregation of intermediate data and compare () to control how data is sorted IR&DM ’13/’14 ! 78

WordCount • Problem: Count how often every word w occurs in the   document collection (i.e., determine cf ( w )) map (long did, string content) {   for (string word : content.split()) {   emit(word, 1)   }   } reduce (string word, list<int> counts) {   int total = 0   for (int count : counts) {   total += count   }   emit(word, total)   } IR&DM ’13/’14 ! 79

Execution of WordCount map (long did, string content) {   reduce (string word, list<int> counts) {   for (string word : content.split()) {   int total = 0   emit(word, 1)   for (int count : counts) {   }   total += count   } }   emit(word, total)   } IR&DM ’13/’14 ! 80

Execution of WordCount d123 a x b   b a y d242 b y a   x a b map (long did, string content) {   reduce (string word, list<int> counts) {   for (string word : content.split()) {   int total = 0   emit(word, 1)   for (int count : counts) {   }   total += count   } }   emit(word, total)   } IR&DM ’13/’14 ! 80

Execution of WordCount Map d123 (a,d123), M 1 a x b   (x,d242), … b a y d242 (b,d123), M n b y a   (y,d242), … x a b map () map (long did, string content) {   reduce (string word, list<int> counts) {   for (string word : content.split()) {   int total = 0   emit(word, 1)   for (int count : counts) {   }   total += count   } }   emit(word, total)   } IR&DM ’13/’14 ! 80

Execution of WordCount Map Sort d123 1 1 (a,d123), M 1 a x b   (x,d242), … b a y m 1 d242 1 m (b,d123), M n b y a   (y,d242), m m … x a b partition () map () compare () map (long did, string content) {   reduce (string word, list<int> counts) {   for (string word : content.split()) {   int total = 0   emit(word, 1)   for (int count : counts) {   }   total += count   } }   emit(word, total)   } IR&DM ’13/’14 ! 80

Execution of WordCount Map Sort Reduce d123 1 1 (a,d123), (a,d123), M 1 R 1 a x b   (x,d242), (a,d242), … … b a y m 1 d242 1 m (b,d123), (x,d123), M n R m b y a   (y,d242), (x,d242), m m … … x a b partition () map () reduce () compare () map (long did, string content) {   reduce (string word, list<int> counts) {   for (string word : content.split()) {   int total = 0   emit(word, 1)   for (int count : counts) {   }   total += count   } }   emit(word, total)   } IR&DM ’13/’14 ! 80

Execution of WordCount Map Sort Reduce d123 (a,4) 1 1 (a,d123), (a,d123), (b,4) M 1 R 1 a x b   (x,d242), (a,d242), …   … … b a y m 1 d242 1 m (x,2) (b,d123), (x,d123), M n R m (y,2) b y a   (y,d242), (x,d242), m m …   … … x a b partition () map () reduce () compare () map (long did, string content) {   reduce (string word, list<int> counts) {   for (string word : content.split()) {   int total = 0   emit(word, 1)   for (int count : counts) {   }   total += count   } }   emit(word, total)   } IR&DM ’13/’14 ! 80

Inverted Index Construction • Problem: Construct a positional inverted index with postings   containing positions (e.g., { d 123 , 3, [1, 9, 20] }) map (long did, string content) {   int pos = 0   map<string, list<int>> positions = new map<string, list<int>>()   for (string word : content.split()) { // tokenize document content   positions.get(word).add(pos++) // aggregate word positions   }   for (string word : map.keys()) {   emit(word, new posting(did, positions.get(word))) // emit posting   }   } reduce (string word, list<posting> postings) {   postings.sort() // sort postings (e.g., by did)   emit(word, postings) // emit posting list   } IR&DM ’13/’14 ! 81

3. Hadoop • Open source implementation of GFS and MapReduce   • Hadoop File System (HDFS) • name node (master) • data node (chunkserver)   • Hadoop MapReduce • job tracker (master) Doug Cutting • task tracker (worker)   • Has been successfully deployed on clusters of 10,000s machines • Productive use at Yahoo!, Facebook, and many more IR&DM ’13/’14 ! 82

Jim Gray Benchmark • Jim Gray Benchmark : • sort large amount of 100 byte records (10 first bytes are keys) • minute sort : sort as many records as possible in under a minute • gray sort : must sort at least 100 TB, must run at least 1 hours ! • November 2008 : Google sorts 1 TB in 68 s and 1 PB in 6:02 h on MapReduce using a cluster of 4,000 computers and 48,000 hard disks   http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html   • May 2011 : Yahoo! sorts 1 TB in 62 s and 1 PB in 16:15 h on Hadoop   using a cluster of approximately 3,800 computers 15,200 hard disks   http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/ IR&DM ’13/’14 ! 83

Summary of V.4 • MapReduce   a system of distributed data processing   a programming model • Hadoop   a widely-used open-source implementation of MapReduce IR&DM ’13/’14 IR&DM ’13/’14 ! 84

Additional Literature for V.4 • Apache Hadoop ( http://hadoop.apache.org ) • J. Dean and S. Ghemawat : MapReduce: Simplified Data Processing on Large Clusters , OSDI 2004 • J. Dean and S. Ghemawat : MapReduce: Simplified Data Processing on Large Clusters , CACM 51(1):107-113, 2008 • S. Ghemawat, H. Gobioff, and S.-T. Leung : The Google File System ,   SOPS 2003 • J. Lin and C. Dyer : Data-Intensive Text Processing with MapReduce , Morgan & Claypool Publishers, 2010 (http://lintool.github.io/MapReduceAlgorithms) IR&DM ’13/’14 IR&DM ’13/’14 ! 85

                  V.5 Near-Duplicate Detection 1. Shingling 2. SpotSigs 3. Min-Wise Independent Permutations 4. Locality-Sensitive Hashing Based on MRS Chapter 19 and RU Chapter 3 IR&DM ’13/’14 ! 86

Near-Duplicate Detection IR&DM ’13/’14 ! 87

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop - PowerPoint PPT Presentation

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/14 ! 74 Why MapReduce? Large clusters of commodity

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Jeffrey D. Ullman Stanford University It has been said that the mark of a computer scientist

DATA MINING LECTURE 5 Sketching, Locality Sensitive Hashing 2 Jaccard Similarity The

Bag-of-Words Models and Beyond Sentiment, Subjectivity, and Stance Ling 575 April 8, 2014

Linguistic Expressions of Sentiment, Subjectivity & Stance Ling575 Sentiment April 1, 2014

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Locality Sensitive Hashing & ANN CS 584: Big Data Analytics Material adapted from Piotr

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop - PowerPoint PPT Presentation

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/14 ! 74 Why MapReduce? Large clusters of commodity

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Jeffrey D. Ullman Stanford University It has been said that the mark of a computer scientist

DATA MINING LECTURE 5 Sketching, Locality Sensitive Hashing 2 Jaccard Similarity The

Bag-of-Words Models and Beyond Sentiment, Subjectivity, and Stance Ling 575 April 8, 2014

Linguistic Expressions of Sentiment, Subjectivity &amp; Stance Ling575 Sentiment April 1, 2014

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Locality Sensitive Hashing &amp; ANN CS 584: Big Data Analytics Material adapted from Piotr

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Linguistic Expressions of Sentiment, Subjectivity & Stance Ling575 Sentiment April 1, 2014

Locality Sensitive Hashing & ANN CS 584: Big Data Analytics Material adapted from Piotr