Numerical Computing with Spark
Hossein Falaki
Numerical Computing with Spark Hossein Falaki Challenges of - - PowerPoint PPT Presentation
Numerical Computing with Spark Hossein Falaki Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2 Three Practical
Hossein Falaki
When applying any algorithm to big data watch for
2
3
We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data.
The plain variance formula requires two passes
4
i=1 N
First pass Second pass
2
Can be performed in a single pass, but Subtracts two very close and large numbers!
5
6
An object that incrementally tracks the variance
Class RunningVar { var variance: Double = 0.0
def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) }
def add(value: Double) { ... } }
7
Class RunningVar { ... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = { ... } }
8
doubleRDD .mapPartitions(v => Iterator(new RunningVar(v))) .reduce((a, b) => a.merge(b))
doubleRDD.variance()
especially if it can be computed faster or cheaper
error that can be controlled
9
10
applications involving large strings, such as counting web visitors
Example: Count number of unique words in Shakespeare’s work.
11
12
rdd .mapPartitions(v => Iterator(new LPCounter(v))) .reduce((a, b) => a.merge(b)).getCardinality
myRDD.countApproxDistinct(0.01)
13
Popular algorithm originally introduced by Google
14
PageRank Algorithm
contrib = curRank | neighbors |
curRank = 0.15 + 0.85 contribi
neighbors
A. B.
15
1.0 1.0 1.0 1.0
16
1.0 1.0 1.0 1.0
1.0 0.5 0.5 0.5 1.0
17
0.58 0.58 1.85 1.0
18
0.58 0.58 1.85 1.0
0.58 0.29 0.5 0.5 1.85
19
0.58 0.39 1.31 1.72
20
0.73 0.46 1.44 1.37
Eventually
that page for a random surfer on the web
21
V: the initial rank vector A: the link structure (sparse matrix)
22
than an index
23
val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs
val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)
24
Links (url, neighbors) Ranks (url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
25
Links (url, neighbors) Ranks (url, rank)
join join join
…
Grouping same RDD
26
Links (url, neighbors) Ranks (url, rank)
join join join
…
partitionBy
Same node
27
val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs
val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)
When applying any algorithm to big data watch for
28