CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS - - PowerPoint PPT Presentation

cs 744 mapreduce
SMART_READER_LITE
LIVE PREVIEW

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS - - PowerPoint PPT Presentation

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 ANNOUNCEMENTS Assignment 1 out CloudLab notes on Piazza No teams yet? Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems Resource


slide-1
SLIDE 1

CS 744: MAPREDUCE

Shivaram Venkataraman Fall 2019

slide-2
SLIDE 2

ANNOUNCEMENTS

  • Assignment 1 out
  • CloudLab notes on Piazza
  • No teams yet?
slide-3
SLIDE 3

Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications

slide-4
SLIDE 4

BACKGROUND: PTHREADS

void *myThreadFun(void *vargp) { sleep(1); printf(“Hello World\n"); return NULL; } int main() { pthread_t thread_id_1, thread_id_2; pthread_create(&thread_id_1, NULL, myThreadFun, NULL); pthread_create(&thread_id_2, NULL, myThreadFun, NULL); pthread_join(thread_id_1, NULL); pthread_join(thread_id_2, NULL); exit(0); }

slide-5
SLIDE 5

BACKGROUND: MPI

int main(int argc, char** argv) { MPI_Init(NULL, NULL); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Print off a hello world message printf("Hello world from rank %d out of %d processors\n", world_rank, world_size); // Finalize the MPI environment. MPI_Finalize(); }

mpirun -n 4 -f host_file ./mpi_hello_world

slide-6
SLIDE 6

MOTIVATION

Build Google Web Search

  • Crawl documents, build inverted indexes etc.

Need for

  • automatic parallelization
  • network, disk optimization
  • handling of machine failures
slide-7
SLIDE 7

OUTLINE

  • Programming Model
  • Execution Overview
  • Fault Tolerance
  • Optimizations
slide-8
SLIDE 8

PROGRAMMING MODEL

Data type: Each record is (key, value) Map function: (Kin, Vin) à list(Kinter, Vinter) Reduce function: (Kinter, list(Vinter)) à list(Kout, Vout)

slide-9
SLIDE 9

Example: Word Count

def def mapper(line): for for word in in line.split():

  • utput(word, 1)

def def reducer(key, values):

  • utput(key, sum(values))
slide-10
SLIDE 10

Word Count Execution

the quick brown fox the fox ate the mouse how now brown cow

Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output

slide-11
SLIDE 11

Word Count Execution

the quick brown fox the fox ate the mouse how now brown cow

Map Map Map Reduce Reduce

brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1

the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1

Input Map Shuffle & Sort Reduce Output

slide-12
SLIDE 12

ASSUMPTIONS

slide-13
SLIDE 13

ASSUMPTIONS

  • 1. Commodity networking, less bisection bandwidth
  • 2. Failures are common
  • 3. Local storage is cheap
  • 4. Replicated FS
slide-14
SLIDE 14

Word Count Execution

the quick brown fox

Map Map

the fox ate the mouse

Map

how now brown cow

Automatically split work Schedule tasks with locality

JobTracker

Submit a Job

slide-15
SLIDE 15

Fault Recovery

If a task crashes: – Retry on another node – If the same task repeatedly fails, end the job

the quick brown fox

Map Map

the fox ate the mouse

Map

how now brown cow

slide-16
SLIDE 16

Fault Recovery

If a node crashes: – Relaunch its current tasks on other nodes What about task inputs ? File system replication

the quick brown fox

Map Map

the fox ate the mouse

Map

how now brown cow

slide-17
SLIDE 17

the quick brown fox

Map

Fault Recovery

If a task is going slowly (straggler): – Launch second copy of task on another node – Take the output of whichever finishes first

the quick brown fox

Map

the fox ate the mouse

Map

how now brown cow

slide-18
SLIDE 18

MORE DESIGN

Master failure Locality Task Granularity

slide-19
SLIDE 19

REFINEMENTS

  • Combiner functions
  • Counters
  • Skipping bad records
slide-20
SLIDE 20

Jeff Dean, LADIS 2009

slide-21
SLIDE 21

DISCUSSION

https://forms.gle/hK8wFDxBDfS6chD28

slide-22
SLIDE 22

DISCUSSION

Indexing pipeline where you start with HTML documents. You want to index the documents after removing the most commonly occurring words. 1. Compute most common words. 2. Remove them and build the index. What are the main shortcomings of using MapReduce?

slide-23
SLIDE 23

DISCUSSION

slide-24
SLIDE 24

NEXT STEPS

  • Next lecture: Spark
  • Assignment 1: Use Piazza!
  • Project topics: End of this week