Numerical Computing with Spark Hossein Falaki Challenges of - PowerPoint PPT Presentation

Numerical Computing with Spark Hossein Falaki

Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2

Three Practical Examples • Point estimation (Variance) • Approximate estimation (Cardinality) • Matrix operations (PageRank) We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data. 3

1. Big Data Variance The plain variance formula requires two passes over data Second pass N Var ( X ) = 1 ∑ ( x i − µ ) 2 N i = 1 First pass 4

Fast but inaccurate solution Var ( X ) = E [ X 2 ] − E [ X ] 2 ∑ ∑ 2 ⎛ ⎞ x 2 x = − ⎜ ⎟ ⎝ ⎠ N N Can be performed in a single pass, but Subtracts two very close and large numbers! 5

Accumulator Pattern An object that incrementally tracks the variance Class RunningVar { var variance: Double = 0.0 � // Compute initial variance for numbers def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) } � // Update variance for a single value def add(value: Double) { ... } } 6

Parallelize for performance • Distribute adding values in map phase • Merge partial results in reduce phase Class RunningVar { ... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = { ... } } 7

Computing Variance in Spark • Use the RunningVar in Spark doubleRDD .mapPartitions(v => Iterator(new RunningVar(v))) .reduce((a, b) => a.merge(b)) • Or simply use the Spark API doubleRDD.variance() 8

2. Approximate Estimations • Often an approximate estimate is good enough especially if it can be computed faster or cheaper 1. Trade accuracy with memory 2. Trade accuracy with running time • We really like the cases where there is a bound on error that can be controlled 9

Cardinality Problem Example : Count number of unique words in Shakespeare’s work. • Using a HashSet requires ~10GB of memory • This can be much worse in many real world applications involving large strings, such as counting web visitors 10

Linear Probabilistic Counting 1. Allocate a bitmap of size m and initialize to zero. A. Hash each value to a position in the bitmap B. Set corresponding bit to 1 2. Count number of empty bit entries: v count ≈ − m ln v m 11

The Spark API • Use the LogLinearCounter in Spark rdd .mapPartitions(v => Iterator(new LPCounter(v))) .reduce((a, b) => a.merge(b)).getCardinality • Or simply use the Spark API myRDD.countApproxDistinct(0.01) 12

3. Google PageRank Popular algorithm originally introduced by Google 13

PageRank Algorithm PageRank Algorithm • Start each page with a rank of 1 • On each iteration: curRank contrib = A. | neighbors | ∑ curRank = 0.15 + 0.85 B. contrib i neighbors 14

PageRank Example 1.0 1.0 1.0 1.0 15

PageRank Example 1.0 0.5 1.0 1.0 1.0 1.0 0.5 0.5 1.0 16

PageRank Example 1.85 0.5 0.58 1.85 0.58 1.0 0.29 0.5 0.58 18

PageRank Example 1.44 Eventually 0.46 1.37 0.73 20

PageRank as Matrix Multiplication • Rank of each page is the probability of landing on that page for a random surfer on the web • Probability of visiting all pages after k steps is V k = A k × V t V : the initial rank vector A : the link structure (sparse matrix) 21

Data Representation in Spark • Each page is identified by its unique URL rather than an index • Ranks vectors ( V ): RDD[(URL, Double)] • Links matrix ( A ): RDD[(URL, List(URL))] 22

Spark Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs � for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) 23

Matrix Multiplication • Repeatedly multiply sparse matrix and vector Same file read � over and over Links (url, neighbors) Ranks (url, rank) … iteration 1 iteration 2 iteration 3 24

Spark can do much better • Using cache(), keep neighbors in memory • Do not write intermediate results on disk Grouping same RDD � over and over Links (url, neighbors) Ranks (url, rank) … join join join 25

Spark can do much better • Do not partition neighbors every time partitionBy Links (url, neighbors) Same node Ranks (url, rank) … join join join 26

Spark Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs � links.partitionBy(hashFunction).cache() � for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) 27

Conclusions When applying any algorithm to big data watch for 1. Correctness 2. Performance • Cache RDDs to avoid I/O • Avoid unnecessary computation 3. Trade-off between accuracy and performance 28

Numerical Computing with Spark

Numerical Computing with Spark Hossein Falaki Challenges of - PowerPoint PPT Presentation

Numerical Computing with Spark Hossein Falaki Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2 Three Practical

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Algebraic models of dependent type theory Clive Newstead HoTT/UF Workshop 2018 Oxford, UK (in

chrome slides at goo.gl/kIfUe localhost:8000/Presentations/MobileToolBelt/#26 1/29 11/7/12 The

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

2 7/1/2020 3 4

Intro to TensorFlow 2.0 MBL, August 2019 Josh Gordon (@random_forests) 1 Agenda 1 of 2

State of Kotlin in Android Florina Muntenescu FMuntenescu October, 2020 Why Android

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

Introduction to Matlab CSC420 Spring 2017 Introduction to Image Understanding Instructor: Sanja

Numerical Computing with Spark Hossein Falaki Challenges of - PowerPoint PPT Presentation

Numerical Computing with Spark Hossein Falaki Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2 Three Practical

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Algebraic models of dependent type theory Clive Newstead HoTT/UF Workshop 2018 Oxford, UK (in

chrome slides at goo.gl/kIfUe localhost:8000/Presentations/MobileToolBelt/#26 1/29 11/7/12 The

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

2 7/1/2020 3 4

Intro to TensorFlow 2.0 MBL, August 2019 Josh Gordon (@random_forests) 1 Agenda 1 of 2

State of Kotlin in Android Florina Muntenescu FMuntenescu October, 2020 Why Android

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

Introduction to Matlab CSC420 Spring 2017 Introduction to Image Understanding Instructor: Sanja

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark