Massively Parallel Computation Philip Bille Sequential Computation - PowerPoint PPT Presentation

Massively Parallel Computation Philip Bille

Sequential Computation • Computation. • Read and write in storage • Arithmetic and boolean operations • Control-flow (if-then-else, while-do, ..) • Scalability. • Massive data. 001111 • E ffi ciency constraints. 001010 001011 • Limited resources. 111001 110010 101011 000000 110100 CPU 001111 001111 111011 101011 110010 111111 000000 101101

Massively Parallel Computation • Massively parallel computation. • Lots of sequential processors. • Parallelism. • Communication. • Failures and error recovery. • Deadlock and race conditions • Predictability • Implementation

MapReduce

MapReduce • “MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” — Wikipedia.

MapReduce • Dataflow. • Split. Partition data into segments and distribute to di ff erent machines. • Map. Map data items to list of <key, value> pairs. • Shu ffl e. Group data with the same key and send to same machine. • Reduce. Takes list of values with the same key <key, [value 1 , ..., value k ]> and outputs list of new data items. • You only write map and reduce function. • Goals. • Few rounds, maximum parallelism. • Work distribution. • Small total work.

MapReduce input splitting mapping shu ffl ing reducing output map(data item) → list of <key, value> pairs reduce(key, [value 1 , value 2 , ..., value k ]) → list of new items

Word Counting • Input. • Document of words • Output. • Frequency of each word • Document: “Deer Bear River Car Car River Deer Car Bear.” • (Bear, 2), (Car, 3), (Deer, 2), (River, 2)

input splitting mapping shu ffl ing reducing output map(word) → <word, 1> reduce(word, [1, 1, .., 1]) → <word, number of 1's>

Inverted Index • Input. • Set of documents • Output. • List of documents that contain each word. • Document 1: “Deer Bear River Car Car River Deer Car Bear.” • Document 2: "Deer Antilope Stream River Stream" • (Bear, [1]), (Car, [1]), (Deer, [1,2]), (River, [1,2]), (Antilope, [2]), (Stream, [2])

Common Friends • Input. • Friends lists • Output. • For pairs of friends, a list of common friends B A (A B) → (C D) (A C) → (B D) A → B C D (A D) → (B C E B → A C D E (B C) → (A D E) C → A B D E (B D) → (A C E) D → A B C E (B E) → (C D) E → B C D (C D) → (A B E) C (C E) → (B D (D E) → (B C) D

sorted keys (A B) → B C D (A C) → B C D (A D) → B C D (A B) → A C D E Map (B C) → A C D E (B D) → A C D E Map A → B C D (B E) → A C D E B → A C D E Map C → A B D E (A C) → A B D E D → A B C E (B C) → A B D E E → B C D (C D) → A B D E (C E) → A B D E Map (A D) → A B C E (B D) → A B C E Map (C D) → A B C E (D E) → A B C E B A (B E) → B C D key value (C E) → B C D E (D E) → B C D C D

(A B) → B C D (A C) → B C D (A D) → B C D (A B) → A C D E (B C) → A C D E (B D) → A C D E (A B) → (A C D E) (B C D) (B E) → A C D E (A C) → (A B D E) (B C D) (A D) → (A B C E) (B C D) (A C) → A B D E (B C) → (A B D E) (A C D E) (B C) → A B D E Group by key (B D) → (A B C E) (A C D E) (C D) → A B D E (B E) → (A C D E) (B C D) (C E) → A B D E (C D) → (A B C E) (A B D E) (C E) → (A B D E) (B C D) (A D) → A B C E (D E) → (A B C E) (B C D) (B D) → A B C E (C D) → A B C E B (D E) → A B C E A E (B E) → B C D (C E) → B C D (D E) → B C D C D

(A B) → (A C D E) (B C D) (A B) → (C D) (A C) → (A B D E) (B C D) (A C) → (B D) (A D) → (A B C E) (B C D) (A D) → (B C (B C) → (A B D E) (A C D E) (B C) → (A D E) Reduce (B D) → (A B C E) (A C D E) (B D) → (A C E) (B E) → (A C D E) (B C D) (B E) → (C D) (C D) → (A B C E) (A B D E) (C D) → (A B E) (C E) → (A B D E) (B C D) (C E) → (B D (D E) → (A B C E) (B C D) (D E) → (B C) B A E C D

input splitting mapping shu ffl ing reducing output (A B) → (A C D E) (A B) → (C D) (A B) → (B C D) (A B) → B C D (A C) → (A B D E) (A C) → B C D (A C) → (B D) (A C) → (B C D) (A D) → B C D (A D) → (A B C E) (A B) → A C D E A → B C D (A D) → (B C) (A D) → (B C D) (B C) → A C D E B → A C D E (B D) → A C D E (B E) → A C D E (B C) → (A B D E) (B C) → (A D E) (A B) → (C D) (B C) → (A C D E) (A C) → (B D) (A C) → A B D E A → B C D (A D) → (B C (B C) → A B D E (B D) → (A B C E) B → A C D E (B C) → (A D E) (B D) → (A C E) (C D) → A B D E (B D) → (A C D E) C → A B D E C → A B D E (B D) → (A C E) (C E) → A B D E D → A B C E D → A B C E (B E) → (C D) E → B C D (C D) → (A B E) (B E) → (A C D E) (B E) → (C D) (A D) → A B C E (C E) → (B D (B E) → (B C D) (B D) → A B C E (D E) → (B C) (C D) → A B C E (D E) → A B C E (C D) → (A B C E) (C D) → (A B E) (C D) → (A B D E) E → B C D (B E) → B C D (C E) → (A B D E) (C E) → (B D) (C E) → B C D (C E) → (B C D) (D E) → B C D (D E) → (A B C E) (D E) → (B C) (D E) → (B C D)

K-means • Input • List of points, integer k • Output • k clusters • Algorithm (sequential). 1.Pick k random centers 2.Assign each point to the nearest center 3.Move each center to centroid of cluster. 4.Repeat 2-4 until all centers are stable .

K-means in MapReduce • K-means iteration. • map(point, list of centers) → <closest center, point> • reduce(center, [point 1 , ..., point k ]) → centroid of point 1 , ..., point k

MapReduce Architecture • Master. • Dispatches map and reduce task to workers • Worker. • Performs map and reduce task. • Bu ff ered input/output. User Program • Splitting and shu ffl ing via hashing. (1) fork (1) fork (1) fork • Combiners. Master • Fault tolerance. (2) assign (2) reduce assign • Worker checkpointing. map worker • Master restart. split 0 (6) write output worker split 1 file 0 (5) remote read (3) read split 2 (4) local write worker output worker split 3 file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files

MapReduce and Massively Parallel Computation • Parallelism. • Communication. • Failures and error recovery. • Deadlock and race conditions • Predictability • Implementation input splitting mapping shu ffl ing reducing output map(word) → <word, 1> reduce(word, [1, 1, .., 1]) → <word, number of 1's>

MapReduce Applications • Design patterns. • Counting, summing, filtering, sorting • Cross-correlation (data mining) • Iterative message processing (graph processing, clustering) • More examples. • Text search • URL access frequency • Reverse web-link graph

MapReduce Implementation and Users • Implementations. • Google MapReduce (2004) • Apache Hadoop (2006) • CouchDB (2005) • Disco Project (2008) • Infinispan (2009) • Riak (2009) • Example uses. • Yahoo (2008): 10.000 linux cores, The Yahoo! Search Webmap • FaceBook (2012): Analytics on 100 PB storage, +.5 PB per day. • TimesMachine (2008): Digitized full page scan of 150 years of NYT on AWS.

Massively Parallel Computation Philip Bille Sequential Computation - PowerPoint PPT Presentation

Massively Parallel Computation Philip Bille Sequential Computation Computation. Read and write in storage Arithmetic and boolean operations Control-flow (if-then-else, while-do, ..) Scalability. Massive data. 001111 E

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

The Complexity of ( +1) Coloring in Congested Clique, Massively Parallel Computation, and

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu,

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am

Algorithmic Frontiers of Modern Massively Parallel Computation Introduction Ashish Goel, Sergei

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Complexity Measures for Parallel Computation Complexity Measures for Parallel Computation

CSL 860: Modern Parallel Computation Computation PARALLEL ALGORITHM TECHNIQUES: BALANCED BINARY

Massively Parallel Optimization on a Cluster Environment Stratis Ioannidis Data, Networks, and

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Classification from Positive, Unlabeled and Biased Negative Data Poster #180 Yu-Guan Hsieh 1 ,

CS6501: Deep Learning for Visual Recognition Object Detection: RCNN, Fast-RCNN, Faster-RCNN

De Deer P Pop opula lation ion on on K Kaib aibab ab Pla Plateau G Game P Preserve

Incorporating Stakeholders Values into Ohio Deer Management: Workshop #2 Ohio Division of

Correlation Autoencoder Hashing for Supervised Cross-Modal Search . . . Yue Cao, Mingsheng

JUST THE MATHS SLIDES NUMBER 15.1 ORDINARY DIFFERENTIAL EQUATIONS 1 (First order

Deep Learning With Differential Privacy Presenter: Xiaojun Xu Deep Learning Framework

Differential Privacy: An Economic Method for Choosing Epsilon Justin Hsu 1 Marco Gaboardi 2