Numerical Computing with Spark Hossein Falaki Challenges of - - PowerPoint PPT Presentation

numerical computing with spark
SMART_READER_LITE
LIVE PREVIEW

Numerical Computing with Spark Hossein Falaki Challenges of - - PowerPoint PPT Presentation

Numerical Computing with Spark Hossein Falaki Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2 Three Practical


slide-1
SLIDE 1

Numerical Computing with Spark

Hossein Falaki

slide-2
SLIDE 2

Challenges of numerical computation over big data

When applying any algorithm to big data watch for

  • 1. Correctness
  • 2. Performance
  • 3. Trade-off between accuracy and performance

2

slide-3
SLIDE 3

Three Practical Examples

  • Point estimation (Variance)
  • Approximate estimation (Cardinality)
  • Matrix operations (PageRank)

3

We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data.

slide-4
SLIDE 4
  • 1. Big Data Variance

The plain variance formula requires two passes

  • ver data

4

Var(X) = 1 N (xi − µ)2

i=1 N

First pass Second pass

slide-5
SLIDE 5

Fast but inaccurate solution

Var(X) = E[X 2]− E[X]2

= x2

N − x

N ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

2

Can be performed in a single pass, but Subtracts two very close and large numbers!

5

slide-6
SLIDE 6

Accumulator Pattern

6

An object that incrementally tracks the variance

Class RunningVar { var variance: Double = 0.0

  • // Compute initial variance for numbers

def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) }

  • // Update variance for a single value

def add(value: Double) { ... } }

slide-7
SLIDE 7

Parallelize for performance

7

  • Distribute adding values in map phase
  • Merge partial results in reduce phase

Class RunningVar { ... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = { ... } }

slide-8
SLIDE 8

Computing Variance in Spark

8

doubleRDD .mapPartitions(v => Iterator(new RunningVar(v))) .reduce((a, b) => a.merge(b))

  • Use the RunningVar in Spark
  • Or simply use the Spark API

doubleRDD.variance()

slide-9
SLIDE 9
  • 2. Approximate Estimations
  • Often an approximate estimate is good enough

especially if it can be computed faster or cheaper

  • 1. Trade accuracy with memory
  • 2. Trade accuracy with running time
  • We really like the cases where there is a bound on

error that can be controlled

9

slide-10
SLIDE 10

Cardinality Problem

10

  • Using a HashSet requires ~10GB of memory
  • This can be much worse in many real world

applications involving large strings, such as counting web visitors

Example: Count number of unique words in Shakespeare’s work.

slide-11
SLIDE 11

Linear Probabilistic Counting

  • 1. Allocate a bitmap of size m and initialize to zero.
  • A. Hash each value to a position in the bitmap
  • B. Set corresponding bit to 1
  • 2. Count number of empty bit entries: v

11

count ≈ −mln v m

slide-12
SLIDE 12

The Spark API

12

rdd .mapPartitions(v => Iterator(new LPCounter(v))) .reduce((a, b) => a.merge(b)).getCardinality

  • Use the LogLinearCounter in Spark
  • Or simply use the Spark API

myRDD.countApproxDistinct(0.01)

slide-13
SLIDE 13
  • 3. Google PageRank

13

Popular algorithm originally introduced by Google

slide-14
SLIDE 14

PageRank Algorithm

  • Start each page with a rank of 1
  • On each iteration:

14

PageRank Algorithm

contrib = curRank | neighbors |

curRank = 0.15 + 0.85 contribi

neighbors

A. B.

slide-15
SLIDE 15

PageRank Example

15

1.0 1.0 1.0 1.0

slide-16
SLIDE 16

PageRank Example

16

1.0 1.0 1.0 1.0

1.0 0.5 0.5 0.5 1.0

slide-17
SLIDE 17

PageRank Example

17

0.58 0.58 1.85 1.0

slide-18
SLIDE 18

PageRank Example

18

0.58 0.58 1.85 1.0

0.58 0.29 0.5 0.5 1.85

slide-19
SLIDE 19

PageRank Example

19

0.58 0.39 1.31 1.72

slide-20
SLIDE 20

PageRank Example

20

0.73 0.46 1.44 1.37

Eventually

slide-21
SLIDE 21

PageRank as Matrix Multiplication

  • Rank of each page is the probability of landing on

that page for a random surfer on the web

  • Probability of visiting all pages after k steps is

21

Vk = Ak ×V t

V: the initial rank vector A: the link structure (sparse matrix)

slide-22
SLIDE 22

Data Representation in Spark

22

  • Each page is identified by its unique URL rather

than an index

  • Ranks vectors (V): RDD[(URL, Double)]
  • Links matrix (A): RDD[(URL, List(URL))]
slide-23
SLIDE 23

Spark Implementation

23

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs

  • for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

slide-24
SLIDE 24

Matrix Multiplication

  • Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors) Ranks (url, rank)

iteration 1 iteration 2 iteration 3

Same file read

  • ver and over
slide-25
SLIDE 25

Spark can do much better

25

  • Using cache(), keep neighbors in memory
  • Do not write intermediate results on disk

Links (url, neighbors) Ranks (url, rank)

join join join

Grouping same RDD

  • ver and over
slide-26
SLIDE 26

Spark can do much better

26

  • Do not partition neighbors every time

Links (url, neighbors) Ranks (url, rank)

join join join

partitionBy

Same node

slide-27
SLIDE 27

Spark Implementation

27

val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs

  • links.partitionBy(hashFunction).cache()
  • for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

slide-28
SLIDE 28

Conclusions

When applying any algorithm to big data watch for

  • 1. Correctness
  • 2. Performance
  • Cache RDDs to avoid I/O
  • Avoid unnecessary computation
  • 3. Trade-off between accuracy and performance

28

slide-29
SLIDE 29

Numerical Computing with Spark