Common Patterns and Pitfalls for Implementing Algorithms in Spark - - PowerPoint PPT Presentation

common patterns and pitfalls for implementing algorithms
SMART_READER_LITE
LIVE PREVIEW

Common Patterns and Pitfalls for Implementing Algorithms in Spark - - PowerPoint PPT Presentation

Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2.


slide-1
SLIDE 1

Common Patterns and Pitfalls for Implementing Algorithms in Spark

Hossein Falaki @mhfalaki hossein@databricks.com

slide-2
SLIDE 2

Challenges of numerical computation over big data

When applying any algorithm to big data watch for

  • 1. Correctness
  • 2. Performance
  • 3. Trade-off between accuracy and performance

2

slide-3
SLIDE 3

Three Practical Examples

  • Point estimation (Variance)
  • Approximate estimation (Cardinality)
  • Matrix operations (PageRank)

3

We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data.

slide-4
SLIDE 4
  • 1. Big Data Variance

> The plain variance formula requires two passes

  • ver data

4

Var(X) = 1 N (xi − µ)2

i=1 N

First pass Second pass

slide-5
SLIDE 5

Fast but inaccurate solution

Var(X) = E[X 2]− E[X]2

= x2

N − x

N ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

2

Can be performed in a single pass, but Subtracts two very close and large numbers!

5

slide-6
SLIDE 6

Accumulator Pattern

6

An object that incrementally tracks the variance

Class RunningVar { var variance: Double = 0.0

  • // Compute initial variance for numbers

def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) }

  • // Update variance for a single value

def add(value: Double) { ... } }

slide-7
SLIDE 7

Parallelize for performance

7

  • Distribute adding values in map phase
  • Merge partial results in reduce phase

Class RunningVar { ... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = { ... } }

slide-8
SLIDE 8

Computing Variance in Spark

8

doubleRDD .mapPartitions(v => Iterator(new RunningVar(v))) .reduce((a, b) => a.merge(b))

  • Use the RunningVar in Spark
  • Or simply use the Spark API

doubleRDD.variance()

slide-9
SLIDE 9
  • 2. Approximate Estimations
  • Often an approximate estimate is good enough

especially if it can be computed faster or cheaper

  • 1. Trade accuracy with memory
  • 2. Trade accuracy with running time
  • We really like the cases where there is a bound on

error that can be controlled

9

slide-10
SLIDE 10

Cardinality Problem

10

  • Using a HashSet requires ~10GB of memory
  • This can be much worse in many real world

applications involving large strings, such as counting web visitors

Example: Count number of unique words in Shakespeare’s work.

slide-11
SLIDE 11

Linear Probabilistic Counting

  • 1. Allocate a bitmap of size m and initialize to zero.
  • A. Hash each value to a position in the bitmap
  • B. Set corresponding bit to 1
  • 2. Count number of empty bit entries: v

11

count ≈ −mln v m

slide-12
SLIDE 12

The Spark API

12

rdd .mapPartitions(v => Iterator(new LPCounter(v))) .reduce((a, b) => a.merge(b)).getCardinality

  • Use the LogLinearCounter in Spark
  • Or simply use the Spark API

myRDD.countApproxDistinct(0.01)

slide-13
SLIDE 13
  • 3. Google PageRank

13

Popular algorithm originally introduced by Google

slide-14
SLIDE 14

PageRank Algorithm

  • Start each page with a rank of 1
  • On each iteration:

14

PageRank Algorithm

contrib = curRank | neighbors |

curRank = 0.15 + 0.85 contribi

neighbors

A. B.

slide-15
SLIDE 15

PageRank Example

15

1.0 1.0 1.0 1.0

slide-16
SLIDE 16

PageRank Example

16

1.0 1.0 1.0 1.0

1.0 0.5 0.5 0.5 1.0

slide-17
SLIDE 17

PageRank Example

17

0.58 0.58 1.85 1.0

slide-18
SLIDE 18

PageRank Example

18

0.58 0.58 1.85 1.0

0.58 0.29 0.5 0.5 1.85

slide-19
SLIDE 19

PageRank Example

19

0.58 0.39 1.31 1.72

slide-20
SLIDE 20

PageRank Example

20

0.73 0.46 1.44 1.37

Eventually

slide-21
SLIDE 21

PageRank as Matrix Multiplication

  • Rank of each page is the probability of landing on

that page for a random surfer on the web

  • Probability of visiting all pages after k steps is

21

Vk = Ak ×V t

V: the initial rank vector A: the link structure (sparse matrix)

slide-22
SLIDE 22

Data Representation in Spark

22

  • Each page is identified by its unique URL rather

than an index

  • Ranks vectors (V): RDD[(URL, Double)]
  • Links matrix (A): RDD[(URL, List(URL))]
slide-23
SLIDE 23

Spark Implementation

23

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs

  • for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

slide-24
SLIDE 24

Matrix Multiplication

  • Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors) Ranks (url, rank)

iteration 1 iteration 2 iteration 3

Same file read

  • ver and over
slide-25
SLIDE 25

Spark can do much better

25

  • Using cache(), keep neighbors in memory
  • Do not write intermediate results on disk

Links (url, neighbors) Ranks (url, rank)

join join join

Grouping same RDD

  • ver and over
slide-26
SLIDE 26

Spark can do much better

26

  • Do not partition neighbors every time

Links (url, neighbors) Ranks (url, rank)

join join join

partitionBy

Same node

slide-27
SLIDE 27

Spark Implementation

27

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs

  • links.partitionBy(hashFunction).cache()
  • for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

slide-28
SLIDE 28

Conclusions

When applying any algorithm to big data watch for

  • 1. Correctness
  • 2. Performance
  • Cache RDDs to avoid I/O
  • Avoid unnecessary computation
  • 3. Trade-off between accuracy and performance

28

slide-29
SLIDE 29