CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS - PowerPoint PPT Presentation

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019

ANNOUNCEMENTS • Assignment 1 out • CloudLab notes on Piazza • No teams yet?

Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems Resource Management Datacenter Architecture

BACKGROUND: PTHREADS void *myThreadFun(void *vargp) { sleep(1); printf(“Hello World\n"); return NULL; } int main() { pthread_t thread_id_1, thread_id_2; pthread_create(&thread_id_1, NULL, myThreadFun, NULL); pthread_create(&thread_id_2, NULL, myThreadFun, NULL); pthread_join(thread_id_1, NULL); pthread_join(thread_id_2, NULL); exit(0); }

BACKGROUND: MPI mpirun -n 4 -f host_file ./mpi_hello_world int main(int argc, char** argv) { MPI_Init(NULL, NULL); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Print off a hello world message printf("Hello world from rank %d out of %d processors\n", world_rank, world_size); // Finalize the MPI environment. MPI_Finalize(); }

MOTIVATION Build Google Web Search - Crawl documents, build inverted indexes etc. Need for - automatic parallelization - network, disk optimization - handling of machine failures

OUTLINE - Programming Model - Execution Overview - Fault Tolerance - Optimizations

PROGRAMMING MODEL Data type: Each record is (key, value) Map function: (K in , V in ) à list(K inter , V inter ) Reduce function: (K inter , list(V inter )) à list(K out , V out )

Example: Word Count def def mapper(line): for for word in in line.split(): output(word, 1) def def reducer(key, values): output(key, sum(values))

Word Count Execution Input Map Shuffle & Sort Reduce Output the quick Map brown fox Reduce the fox ate Map the mouse Reduce how now Map brown cow

Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick brown, 2 Map fox, 1 brown fox fox, 2 Reduce how, 1 how, 1 now, 1 now, 1 brown, 1 the, 1 the, 3 the fox ate Map fox, 1 the mouse the, 1 ate, 1 quick, 1 cow, 1 Reduce how now ate, 1 mouse, 1 Map brown mouse, 1 quick, 1 cow cow, 1

ASSUMPTIONS

ASSUMPTIONS 1. Commodity networking, less bisection bandwidth 2. Failures are common 3. Local storage is cheap 4. Replicated FS

Word Count Execution Submit a Job JobTracker Schedule tasks Automatically with locality split work Map Map Map how now the quick the fox ate brown brown fox the mouse cow

Fault Recovery If a task crashes: – Retry on another node – If the same task repeatedly fails, end the job Map Map Map how now the quick the fox ate brown brown fox the mouse cow

Fault Recovery If a node crashes: – Relaunch its current tasks on other nodes What about task inputs ? File system replication Map Map Map how now the quick the fox ate brown brown fox the mouse cow

Fault Recovery If a task is going slowly (straggler): – Launch second copy of task on another node – Take the output of whichever finishes first Map Map Map how now the quick the fox ate the quick brown brown fox the mouse brown fox cow

MORE DESIGN Master failure Locality Task Granularity

REFINEMENTS - Combiner functions - Counters - Skipping bad records

Jeff Dean, LADIS 2009

DISCUSSION https://forms.gle/hK8wFDxBDfS6chD28

DISCUSSION Indexing pipeline where you start with HTML documents. You want to index the documents after removing the most commonly occurring words. 1. Compute most common words. 2. Remove them and build the index. What are the main shortcomings of using MapReduce?

DISCUSSION

NEXT STEPS • Next lecture: Spark • Assignment 1: Use Piazza! • Project topics: End of this week

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS - PowerPoint PPT Presentation

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS Assignment 1 out CloudLab notes on Piazza No teams yet? Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems Resource

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS Assignment 1 deliverables

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

Biosens II Background Biosens II Subproject 2.3 Optimal replacement policies for dairy

Humans Impact on Earth Ecological Footprint Negative Human Impacts Positive Human

COMP26120: Algorithms and Imperative Programming Lecture C2: C - Simple Data Structures Pete

1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results A new Barrier Algorithm for

CompSci 94 1) Setting up the scene Classwork: Biped Procs, Review 2D Add in any ground, I

Moving toward formalisation COMP60421 Sean Bechhofer sean.bechhofer@manchester.ac.uk

From Saving the Princess to From Saving the Princess to Saving the Cow Saving the Cow Content

Simple Graphs: 99 Isomorphism 67 306 99 145 306 67 isomorphism.1 isomorphism.2 Albert R

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS - PowerPoint PPT Presentation

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS Assignment 1 out CloudLab notes on Piazza No teams yet? Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems Resource

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS Assignment 1 deliverables

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

Biosens II Background Biosens II Subproject 2.3 Optimal replacement policies for dairy

Humans Impact on Earth Ecological Footprint Negative Human Impacts Positive Human

COMP26120: Algorithms and Imperative Programming Lecture C2: C - Simple Data Structures Pete

1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results A new Barrier Algorithm for

CompSci 94 1) Setting up the scene Classwork: Biped Procs, Review 2D Add in any ground, I

Moving toward formalisation COMP60421 Sean Bechhofer sean.bechhofer@manchester.ac.uk

From Saving the Princess to From Saving the Princess to Saving the Cow Saving the Cow Content

Simple Graphs: 99 Isomorphism 67 306 99 145 306 67 isomorphism.1 isomorphism.2 Albert R

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the