Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm Design (3/3) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1

k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition group values by key a 1 5 b 2 7 c 2 9 8 * * * reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 * Important detail: reducers process keys in sorted order We now talk more about combiner design 2

Importance of Local Aggregation Ideal scaling characteristics: Twice the data, twice the running time Twice the resources, half the running time Why can’t we achieve this? Synchronization requires communication Communication kills performance Thus… avoid communication! Reduce intermediate data via local aggregation Combiners can help 3

Combiner Design Combiners and reducers share same method signature Sometimes, reducers can serve as combiners Often, not… Remember: combiner are optional optimizations Should not affect algorithm correctness May be run 0, 1, or multiple times Example: find average of integers associated with the same key 4

Computing the Mean: Version 1 class Mapper { (a, 7) def map(key: String, value: Int) = { emit(key, value) (a,18) } (c, 4) } (b,1) class Reducer { (c, 10) def reduce(key: String, values: Iterable[Int]) { for (value <- values) { (a, 3) sum += value … cnt += 1 } emit(key, sum/cnt) } } Why can’t we use reducer as combiner? AVG (4, 4, 2, 2, 2) != AVG (AVG (4, 4), AVG(2, 2, 2)) = 3 No, because we cannot take partial averages! The math will be wrong 5

Computing the Mean: Version 2 class Mapper { def map(key: String, value: Int) = emit(key, value) (a, 7) } class Combiner { (a,18) def reduce(key: String, values: Iterable[Int]) = { (c, 4) for (value <- values) { sum += value (b,1) cnt += 1 (c, 10) } emit(key, (sum, cnt)) (a, 3) } … } class Reducer { def reduce(key: String, values: Iterable[Pair]) = { for ((s, c) <- values) { sum += s cnt += c } emit(key, sum/cnt) Why doesn’t this work? } } The input to reducer might be coming from mapper or combiner however the output of mapper and combiner differ. This implementation assumes that combiners always run but this is not true. 6

Computing the Mean: Version 3 class Mapper { def map(key: String, value: Int) = emit(key, (value, 1)) } class Combiner { def reduce(key: String, values: Iterable[Pair]) = { for ((s, c) <- values) { sum += s cnt += c } emit(key, (sum, cnt)) } } class Reducer { def reduce(key: String, values: Iterable[Pair]) = { for ((s, c) <- values) { sum += s cnt += c } emit(key, sum/cnt) } } The problem is fixed by modifying the output of mapper to match the output of combiner. 7

(a, 7) Performance (a,18) 200m integers across three char keys (c, 4) (b,1) (c, 10) (a, 3) … Time V1 Baseline ~120s V3 + Combiner ~90s Using combiner significantly improves the performance. 8

In-Mapper Combiner 9

Word count with in-mapper combiner class Mapper { val counts = new Map() Key idea: preserve state across def map(key: Long, value: String) = { input key-value pairs! for (word <- tokenize(value)) { counts(word) += 1 } } def cleanup() = { for ((k, v) <- counts) { emit(k, v) } } } 10

In-mapper combining Fold the functionality of the combiner into the mapper by preserving state across multiple map calls Advantages Speed Why is this faster than actual combiners? Disadvantages Explicit memory management required In-mapper is faster than regular combiners because it is done in memory, in contrast with regular combining which is a disk to disk operation. 11

Computing the Mean: Version 4 class Mapper { val sums = new Map() (a, 7) val counts = new Map() (a,18) (c, 4) def map(key: String, value: Int) = { (b,1) sums(key) += value counts(key) += 1 (c, 10) } (a, 3) … def cleanup() = { for (key <- counts.keys) { emit(key, (sums(key), counts(key))) } } } Using IMC to improve the performance of computing the mean. 12

Performance 200m integers across three char keys Time V1 Baseline ~120s V3 + Combiner ~90s V4 ~60s + IMC 13

Algorithm Design 14

Term co-occurrence Term co-occurrence matrix for a text collection M = N x N matrix (N = vocabulary size) M ij : number of times i and j co-occur in some context (for concreteness, let’s say context = sentence) Why? Distributional profiles as a way of measuring semantic distance Semantic distance useful for many language processing tasks Applications in lots of other domains 15

How many times two words co-occur? Two approaches: Pairs Stripes 16

First Try: “Pairs” Each mapper takes a sentence: Generate all co-occurring term pairs For all pairs, emit (a, b) → count Reducers sum up counts associated with these pairs Use combiners! 17

Pairs: Pseudo-Code class Mapper { def map(key: Long, value: String) = { for (u <- tokenize(value)) { for (v <- neighbors(u)) { emit((u, v), 1) } } } } class Reducer { def reduce(key: Pair, values: Iterable[Int]) = { for (value <- values) { sum += value } emit(key, sum) } } 18

“Pairs” Analysis Advantages Easy to implement, easy to understand Disadvantages Lots of pairs to sort and shuffle around (upper bound?) Not many opportunities for combiners to work 19

Another Try: “Stripes” Idea: group together pairs into an associative array (a, b) → 1 (a, c) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } (a, d) → 5 (a, e) → 3 (a, f) → 2 Each mapper takes a sentence: Generate all co-occurring term pairs For each term, emit a → { b: count b , c: count c , d: count d … } Reducers perform element-wise sum of associative arrays a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } + a → { b: 2, c: 2, d: 7, e: 3, f: 2 } 20

Stripes: Pseudo-Code class Mapper { def map(key: Long, value: String) = { for (u <- tokenize(value)) { val map = new Map() for (v <- neighbors(u)) { map(v) += 1 } emit(u, map) a → { b: 1, c: 2, d: 5, e: 3, f: 2 } } } } class Reducer { def reduce(key: String, values: Iterable[Map]) = { val map = new Map() for (value <- values) { a → { b: 1, d: 5, e: 3 } map += value a → { b: 1, c: 2, d: 2, f: 2 } + } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } emit(key, map) } } 21

“Stripes” Analysis Advantages Far less sorting and shuffling of key-value pairs Can make better use of combiners Disadvantages More difficult to implement Underlying object more heavyweight Overhead associated with data structure manipulations Fundamental limitation in terms of size of event space 22

Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed) 23

Stripes Pairs There is a tradeoff at work here! Pairs will operate better than Stripes in a smaller cluster because communication is fairly limited anyways (less machines means that each machine does more of the work and that results can be aggregated more locally), and thus, the overhead of Stripes causes it to perform worse. However, as the cluster grows, communication increases, and Stripes start to shine 24

Tradeoffs Pairs: Generates a lot more key-value pairs Less combining opportunities More sorting and shuffling Simple aggregation at reduce Stripes: Generates fewer key-value pairs More opportunities for combining Less sorting and shuffling More complex (slower) aggregation at reduce 25

Relative Frequencies How do we estimate relative frequencies from counts? Why do we want to do this? How do we do this with MapReduce? cs451 26

f(B|A): “Stripes” a → {b 1 :3, b 2 :12, b 3 :7, b 4 :1, … } Easy! One pass to compute (a, *) Another pass to directly compute f(B|A) 27

f (B|A): “ Pairs ” What’s the issue? Computing relative frequencies requires marginal counts But the marginal cannot be computed until you see all counts Buffering is a bad idea! Solution: What if we could get the marginal count to arrive at the reducer first? 28

f (B|A): “ Pairs ” (a, *) → 32 Reducer holds this value in memory (a, b 1 ) → 3 (a, b 1 ) → 3 / 32 (a, b 2 ) → 12 (a, b 2 ) → 12 / 32 (a, b 3 ) → 7 (a, b 3 ) → 7 / 32 (a, b 4 ) → 1 (a, b 4 ) → 1 / 32 … … For this to work: Emit extra (a, *) for every b n in mapper Make sure all a’s get sent to same reducer (use partitioner) Make sure (a, *) comes first (define sort order) Hold state in reducer across different key-value pairs 29

Pairs: Pseudo-Code One more thing … class Partitioner { def getPartition(key: Pair, value: Int, numTasks: Int): Int = { return key.left % numTasks } } 30

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm Design (3/3) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1 k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Resilience Engineering (RE) A system is resilient if it can adjust its functioning prior to,

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

Lecture 15 The Ultimate Speed Limit and E=mc2 Relativistic mass and Announcements Relation of

Motion Estimation Lecture 6 Announcement - Project proposal due on October 16 (next Wednesday)

Collaboration Collaboration NQR Ce-115s Uppsala Ce-115 ARPES Nick Curro Saad Elgazzar J.D.

Lecture 6/Chapters 5&6 backward in time, about the past. Observational Studies & Review

Test-Driven Apache Module Development Geoffrey Young geoff@modperlcookbook.org

Using Feature Locality: Can We Motivation Leverage History to Avoid Failures Failure Avoidance

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm Design (3/3) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1 k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Resilience Engineering (RE) A system is resilient if it can adjust its functioning prior to,

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

Lecture 15 The Ultimate Speed Limit and E=mc2 Relativistic mass and Announcements Relation of

Motion Estimation Lecture 6 Announcement - Project proposal due on October 16 (next Wednesday)

Collaboration Collaboration NQR Ce-115s Uppsala Ce-115 ARPES Nick Curro Saad Elgazzar J.D.

Lecture 6/Chapters 5&amp;6 backward in time, about the past. Observational Studies &amp; Review

Test-Driven Apache Module Development Geoffrey Young geoff@modperlcookbook.org

Using Feature Locality: Can We Motivation Leverage History to Avoid Failures Failure Avoidance

Lecture 6/Chapters 5&6 backward in time, about the past. Observational Studies & Review