[PPT] - Common Patterns and Pitfalls for Implementing Algorithms in Spark PowerPoint Presentation

SLIDE 1

Common Patterns and Pitfalls for Implementing Algorithms in Spark

Hossein Falaki @mhfalaki hossein@databricks.com

SLIDE 2

Challenges of numerical computation over big data

When applying any algorithm to big data watch for

1. Correctness
2. Performance
3. Trade-off between accuracy and performance

2

SLIDE 3

Three Practical Examples

Point estimation (Variance)
Approximate estimation (Cardinality)
Matrix operations (PageRank)

3

We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data.

SLIDE 4

1. Big Data Variance

> The plain variance formula requires two passes

ver data

4

Var(X) = 1 N (xi − µ)2

i=1 N

∑

First pass Second pass

SLIDE 5

Fast but inaccurate solution

Var(X) = E[X 2]− E[X]2

= x2

∑

N − x

∑

N ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

2

Can be performed in a single pass, but Subtracts two very close and large numbers!

5

SLIDE 6

Accumulator Pattern

6

An object that incrementally tracks the variance

Class RunningVar { var variance: Double = 0.0

// Compute initial variance for numbers

def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) }

// Update variance for a single value

def add(value: Double) { ... } }

SLIDE 7

Parallelize for performance

7

Distribute adding values in map phase
Merge partial results in reduce phase

Class RunningVar { ... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = { ... } }

SLIDE 8

Computing Variance in Spark

8

doubleRDD .mapPartitions(v => Iterator(new RunningVar(v))) .reduce((a, b) => a.merge(b))

Use the RunningVar in Spark
Or simply use the Spark API

doubleRDD.variance()

SLIDE 9

2. Approximate Estimations
Often an approximate estimate is good enough

especially if it can be computed faster or cheaper

1. Trade accuracy with memory
2. Trade accuracy with running time
We really like the cases where there is a bound on

error that can be controlled

9

SLIDE 10

Cardinality Problem

10

Using a HashSet requires ~10GB of memory
This can be much worse in many real world

applications involving large strings, such as counting web visitors

Example: Count number of unique words in Shakespeare’s work.

SLIDE 11

Linear Probabilistic Counting

1. Allocate a bitmap of size m and initialize to zero.
A. Hash each value to a position in the bitmap
B. Set corresponding bit to 1
2. Count number of empty bit entries: v

11

count ≈ −mln v m

SLIDE 12

The Spark API

12

rdd .mapPartitions(v => Iterator(new LPCounter(v))) .reduce((a, b) => a.merge(b)).getCardinality

Use the LogLinearCounter in Spark
Or simply use the Spark API

myRDD.countApproxDistinct(0.01)

SLIDE 13

3. Google PageRank

13

Popular algorithm originally introduced by Google

SLIDE 14

PageRank Algorithm

Start each page with a rank of 1
On each iteration:

14

PageRank Algorithm

contrib = curRank | neighbors |

curRank = 0.15 + 0.85 contribi

neighbors

∑

A. B.

SLIDE 15

PageRank Example

15

1.0 1.0 1.0 1.0

SLIDE 16

PageRank Example

16

1.0 1.0 1.0 1.0

1.0 0.5 0.5 0.5 1.0

SLIDE 17

PageRank Example

17

0.58 0.58 1.85 1.0

SLIDE 18

PageRank Example

18

0.58 0.58 1.85 1.0

0.58 0.29 0.5 0.5 1.85

SLIDE 19

PageRank Example

19

0.58 0.39 1.31 1.72

SLIDE 20

PageRank Example

20

0.73 0.46 1.44 1.37

Eventually

SLIDE 21

PageRank as Matrix Multiplication

Rank of each page is the probability of landing on

that page for a random surfer on the web

Probability of visiting all pages after k steps is

21

Vk = Ak ×V t

V: the initial rank vector A: the link structure (sparse matrix)

SLIDE 22

Data Representation in Spark

22

Each page is identified by its unique URL rather

than an index

Ranks vectors (V): RDD[(URL, Double)]
Links matrix (A): RDD[(URL, List(URL))]

SLIDE 23

Spark Implementation

23

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

SLIDE 24

Matrix Multiplication

Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors) Ranks (url, rank)

…

iteration 1 iteration 2 iteration 3

Same file read

ver and over

SLIDE 25

Spark can do much better

25

Using cache(), keep neighbors in memory
Do not write intermediate results on disk

Links (url, neighbors) Ranks (url, rank)

join join join

…

Grouping same RDD

ver and over

SLIDE 26

Spark can do much better

26

Do not partition neighbors every time

Links (url, neighbors) Ranks (url, rank)

join join join

…

partitionBy

Same node

SLIDE 27

Spark Implementation

27

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs

links.partitionBy(hashFunction).cache()
for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

SLIDE 28

Conclusions

When applying any algorithm to big data watch for

1. Correctness
2. Performance
Cache RDDs to avoid I/O
Avoid unnecessary computation
3. Trade-off between accuracy and performance

28

SLIDE 29