massively parallel computation
play

Massively Parallel Computation Philip Bille Sequential Computation - PowerPoint PPT Presentation

Massively Parallel Computation Philip Bille Sequential Computation Computation. Read and write in storage Arithmetic and boolean operations Control-flow (if-then-else, while-do, ..) Scalability. Massive data. 001111 E


  1. Massively Parallel Computation Philip Bille

  2. Sequential Computation • Computation. • Read and write in storage • Arithmetic and boolean operations • Control-flow (if-then-else, while-do, ..) • Scalability. • Massive data. 001111 • E ffi ciency constraints. 001010 001011 • Limited resources. 111001 110010 101011 000000 110100 CPU 001111 001111 111011 101011 110010 111111 000000 101101

  3. Massively Parallel Computation • Massively parallel computation. • Lots of sequential processors. • Parallelism. • Communication. • Failures and error recovery. • Deadlock and race conditions • Predictability • Implementation

  4. MapReduce

  5. MapReduce • “MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” — Wikipedia.

  6. MapReduce • Dataflow. • Split. Partition data into segments and distribute to di ff erent machines. • Map. Map data items to list of <key, value> pairs. • Shu ffl e. Group data with the same key and send to same machine. • Reduce. Takes list of values with the same key <key, [value 1 , ..., value k ]> and outputs list of new data items. • You only write map and reduce function. • Goals. • Few rounds, maximum parallelism. • Work distribution. • Small total work.

  7. MapReduce input splitting mapping shu ffl ing reducing output map(data item) → list of <key, value> pairs reduce(key, [value 1 , value 2 , ..., value k ]) → list of new items

  8. Word Counting • Input. • Document of words • Output. • Frequency of each word • Document: “Deer Bear River Car Car River Deer Car Bear.” • (Bear, 2), (Car, 3), (Deer, 2), (River, 2)

  9. input splitting mapping shu ffl ing reducing output map(word) → <word, 1> reduce(word, [1, 1, .., 1]) → <word, number of 1's>

  10. Inverted Index • Input. • Set of documents • Output. • List of documents that contain each word. • Document 1: “Deer Bear River Car Car River Deer Car Bear.” • Document 2: "Deer Antilope Stream River Stream" • (Bear, [1]), (Car, [1]), (Deer, [1,2]), (River, [1,2]), (Antilope, [2]), (Stream, [2])

  11. Common Friends • Input. • Friends lists • Output. • For pairs of friends, a list of common friends B A (A B) → (C D) (A C) → (B D) A → B C D (A D) → (B C E B → A C D E (B C) → (A D E) C → A B D E (B D) → (A C E) D → A B C E (B E) → (C D) E → B C D (C D) → (A B E) C (C E) → (B D (D E) → (B C) D

  12. sorted keys (A B) → B C D (A C) → B C D (A D) → B C D (A B) → A C D E Map (B C) → A C D E (B D) → A C D E Map A → B C D (B E) → A C D E B → A C D E Map C → A B D E (A C) → A B D E D → A B C E (B C) → A B D E E → B C D (C D) → A B D E (C E) → A B D E Map (A D) → A B C E (B D) → A B C E Map (C D) → A B C E (D E) → A B C E B A (B E) → B C D key value (C E) → B C D E (D E) → B C D C D

  13. (A B) → B C D (A C) → B C D (A D) → B C D (A B) → A C D E (B C) → A C D E (B D) → A C D E (A B) → (A C D E) (B C D) (B E) → A C D E (A C) → (A B D E) (B C D) (A D) → (A B C E) (B C D) (A C) → A B D E (B C) → (A B D E) (A C D E) (B C) → A B D E Group by key (B D) → (A B C E) (A C D E) (C D) → A B D E (B E) → (A C D E) (B C D) (C E) → A B D E (C D) → (A B C E) (A B D E) (C E) → (A B D E) (B C D) (A D) → A B C E (D E) → (A B C E) (B C D) (B D) → A B C E (C D) → A B C E B (D E) → A B C E A E (B E) → B C D (C E) → B C D (D E) → B C D C D

  14. (A B) → (A C D E) (B C D) (A B) → (C D) (A C) → (A B D E) (B C D) (A C) → (B D) (A D) → (A B C E) (B C D) (A D) → (B C (B C) → (A B D E) (A C D E) (B C) → (A D E) Reduce (B D) → (A B C E) (A C D E) (B D) → (A C E) (B E) → (A C D E) (B C D) (B E) → (C D) (C D) → (A B C E) (A B D E) (C D) → (A B E) (C E) → (A B D E) (B C D) (C E) → (B D (D E) → (A B C E) (B C D) (D E) → (B C) B A E C D

  15. input splitting mapping shu ffl ing reducing output (A B) → (A C D E) (A B) → (C D) (A B) → (B C D) (A B) → B C D (A C) → (A B D E) (A C) → B C D (A C) → (B D) (A C) → (B C D) (A D) → B C D (A D) → (A B C E) (A B) → A C D E A → B C D (A D) → (B C) (A D) → (B C D) (B C) → A C D E B → A C D E (B D) → A C D E (B E) → A C D E (B C) → (A B D E) (B C) → (A D E) (A B) → (C D) (B C) → (A C D E) (A C) → (B D) (A C) → A B D E A → B C D (A D) → (B C (B C) → A B D E (B D) → (A B C E) B → A C D E (B C) → (A D E) (B D) → (A C E) (C D) → A B D E (B D) → (A C D E) C → A B D E C → A B D E (B D) → (A C E) (C E) → A B D E D → A B C E D → A B C E (B E) → (C D) E → B C D (C D) → (A B E) (B E) → (A C D E) (B E) → (C D) (A D) → A B C E (C E) → (B D (B E) → (B C D) (B D) → A B C E (D E) → (B C) (C D) → A B C E (D E) → A B C E (C D) → (A B C E) (C D) → (A B E) (C D) → (A B D E) E → B C D (B E) → B C D (C E) → (A B D E) (C E) → (B D) (C E) → B C D (C E) → (B C D) (D E) → B C D (D E) → (A B C E) (D E) → (B C) (D E) → (B C D)

  16. K-means • Input • List of points, integer k • Output • k clusters • Algorithm (sequential). 1.Pick k random centers 2.Assign each point to the nearest center 3.Move each center to centroid of cluster. 4.Repeat 2-4 until all centers are stable .

  17. K-means in MapReduce • K-means iteration. • map(point, list of centers) → <closest center, point> • reduce(center, [point 1 , ..., point k ]) → centroid of point 1 , ..., point k

  18. MapReduce Architecture • Master. • Dispatches map and reduce task to workers • Worker. • Performs map and reduce task. • Bu ff ered input/output. User Program • Splitting and shu ffl ing via hashing. (1) fork (1) fork (1) fork • Combiners. Master • Fault tolerance. (2) assign (2) reduce assign • Worker checkpointing. map worker • Master restart. split 0 (6) write output worker split 1 file 0 (5) remote read (3) read split 2 (4) local write worker output worker split 3 file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files

  19. MapReduce and Massively Parallel Computation • Parallelism. • Communication. • Failures and error recovery. • Deadlock and race conditions • Predictability • Implementation input splitting mapping shu ffl ing reducing output map(word) → <word, 1> reduce(word, [1, 1, .., 1]) → <word, number of 1's>

  20. MapReduce Applications • Design patterns. • Counting, summing, filtering, sorting • Cross-correlation (data mining) • Iterative message processing (graph processing, clustering) • More examples. • Text search • URL access frequency • Reverse web-link graph

  21. MapReduce Implementation and Users • Implementations. • Google MapReduce (2004) • Apache Hadoop (2006) • CouchDB (2005) • Disco Project (2008) • Infinispan (2009) • Riak (2009) • Example uses. • Yahoo (2008): 10.000 linux cores, The Yahoo! Search Webmap • FaceBook (2012): Analytics on 100 PB storage, +.5 PB per day. • TimesMachine (2008): Digitized full page scan of 150 years of NYT on AWS.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend