cs 744 mapreduce
play

CS 744: MAPREDUCE Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS - PowerPoint PPT Presentation

back ! welcome CS 744: MAPREDUCE Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS Assignment 1 deliverables Code (comments, formatting) papers # sections evaluation in Report to Similar Partitioning analysis (graphs,


  1. back ! welcome CS 744: MAPREDUCE Shivaram Venkataraman Fall 2020

  2. ANNOUNCEMENTS • Assignment 1 deliverables – Code (comments, formatting) papers # sections evaluation in – Report to Similar → • Partitioning analysis (graphs, tables, figures etc.) • Persistence analysis (graphs, tables, figures etc.) • Fault-tolerance analysis (graphs, tables, figures etc.) • See Piazza for Spark installation

  3. Applications MapReduce Machine Learning SQL Streaming Graph → Computational Engines HFS - Scalable Storage Systems Resource Management center Data arch → Datacenter Architecture

  4. ↳ BACKGROUND: PTHREADS set limited og arap i. Bharal in parallel memory void *myThreadFun(void *vargp) heap - this { execute } sleep(1); printf(“Hello World\n"); two thread Locks . Cvs return NULL; 2 . ↳ Synchronization } create int main() → threads { across pthread_t thread_id_1, thread_id_2; - pthread_create(&thread_id_1, NULL, myThreadFun, NULL); → Muth - are pthread_create(&thread_id_2, NULL, myThreadFun, NULL); Single are ~ pthread_join(thread_id_1, NULL); pthread_join(thread_id_2, NULL); both exit(0); for wait threads } them of

  5. ↳ Supercomputing n - I 0,1 2 BACKGROUND: MPI . - - n go , , , fine grained processes for ranks synchronization n get ttam mpirun -n 4 -f host_file ./mpi_hello_world int main(int argc, char** argv) { MPI_Init(NULL, NULL); . C- n ) // Get the number of processes { int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); processes → of ferial // Get the rank of the process for this job hosts int world_rank; program - MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); in - are =host-fk# [ // Print off a hello world message printf("Hello world from rank %d out of %d processors\n", world_rank, world_size); - ji ; - // Finalize the MPI environment. * me menage : . MPI_Finalize(); send a to D } prom - tend D D D ardeer MPI + .

  6. MOTIVATION Build Google Web Search - Crawl documents, build inverted indexes etc. Need for programmers want Didn't - automatic parallelization it about - worry to - network, disk optimization - crashes mpirun → - handling of machine failures MPI \ DDD DX ,

  7. OUTLINE - Programming Model - Execution Overview - Fault Tolerance - Optimizations

  8. PROGRAMMING MODEL ÷÷÷÷÷÷ Data type: Each record is (key, value) Map function: (K in , V in ) à list(K inter , V inter ) - - - - - Reduce function: - (K inter , list(V inter )) à list(K out , V out ) -

  9. ↳ Example: Word Count in :* def mapper(line): def for word in for in line.split(): T ifeng.de?ntermediate output(word, 1) value def reducer(key, values): def output(key, sum(values)) cow . 2) list G. D Courses → ,

  10. trashing GFL , data chunked he → 10*12=0 is # Word Count Execution shared brown - 535%2=1 =fFrkEeas Input Map mappers - Shuffle & Sort Reduce Output g. thegn ? ? ( ' ) Heil ) Ike ! ) Che , Partition 3 the , Chunk the quick function Map : - → → brown fox : Reduce . intermediate i - . . . key , values ; the fox ate Map → the mouse brown , 2 ( the , l ) brown the , I ) . ( : Reduce how now I Map brown → cow

  11. I the , 2 the I Ft : " he :b , ' . # em → Word Count Execution ' . the , I combiner k ⇒ t Input Map Shuffle & Sort Reduce Output files the, 1 fetch port 0 I ? 0 the , I numbered → ① quick , ' # brown, 1 the quick brown, 2 Map fox, 1 brown fox fox, 2 Reduce how, 1 ' ID tox , ' brown , how, 1 now, 1 sina.IE#e now, 1 brown, 1 the, 1 the, 3 the fox ate Map fox, 1 the mouse the, 1 ate, 1 - quick, 1 Reduce cow, 1 port D how now ate, 1 mouse, 1 Map ' D brown mouse, 1 quick, 1 cow cow, 1

  12. ASSUMPTIONS Only have tasks Failures norm are → i. passing shared message or memory No intermediate KV than . other . abundant ( disk ) cheap Local is storage 2 , . model this written in be Applications can s . collection ) ( records sploltalde Input is 4 .

  13. ASSUMPTIONS 1. Commodity networking, less bisection bandwidth 2. Failures are common 3. Local storage is cheap 4. Replicated FS

  14. MapReduce frameworks Word Count Execution MRjT ) launching A tasks ↳ Reduce ( by tasks Master → MR Submit a Job , users conflation JobTracker , Schedule tasks → Automatically with locality split work events - - Tony detim - Map - Map Map * worker how now law the quick the fox ate brown - brown fox the mouse cow - 5 I 4 3 2

  15. Fault Recovery be still read ( replication ) If a task crashes: Input can – Retry on another node → – If the same task repeatedly fails, end the job Map Map Map → how now the quick the fox ate brown brown fox the mouse cow

  16. Fault Recovery If a node crashes: – Relaunch its current tasks on other nodes What about task inputs ? File system replication ④ 3 ' 2 Map Map Map ④ how now the quick the fox ate brown brown fox the mouse cow

  17. Fault Recovery than other much slower runs task mappers a f) Assumption If a task is going slowly (straggler): / about the = → something – Launch second copy of task on another node is making node – Take the output of whichever finishes first it slow → Deterministic IM Che , . ) D¥ s¥ . ⑦ ÷÷ Map Map Map , O how now the quick the fox ate the quick X brown brown fox the mouse brown fox cow

  18. MORE DESIGN Fuentes checkpoints master failing probability of Master failure the job → retry → Metadata chunks are Gfs where tasks Map Locality → Task Granularity

  19. MAPREDUCE: SUMMARY - Simplify programming on large clusters with frequent failures data Intermediate I " - Limited but general functional API ¥÷ .mn?-:.s i . - Map, Reduce, Sort . - No other synchronization / communication " data " push - Fault recovery, straggler mitigation through retries

  20. DISCUSSION https://forms.gle/mAHD4QuMXko7vnjB6

  21. DISCUSSION List one similarity and one difference between MPI and MapReduce MapReduce MPI Parallel confuting Similarity = perf µ .gs tinted model programing Expressive Diff tolerance Fault storage Patti 't Intermediate Message fine grained ( network ) pairs KV

  22. ↳ DISCUSSION Indexing pipeline where you start with HTML documents. You want to index the documents after removing the most commonly occurring words. . . ) 1. Compute most common words. ( the , a . . . 2. Remove them and build the index. What are the main shortcomings of using MapReduce to do this? words the → list of ① common to access How inputs read same ops MR ② Both repeated disk I/o avoid to compose

  23. Wd resiliency go , staffman failures , fast without than longer ↳ Barely any failures pot liked process zoo Af \ need to also redone is map if of the parts redo O wiggle .

  24. Jeff Dean, LADIS 2009

  25. NEXT STEPS inputs 15,000 program → grey " " µ " • Next lecture: Spark " " " " t • Assignment 1: Use Piazza!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend