Numerically Stable Parallel Computation of (Co-)Variance Erich - PowerPoint PPT Presentation

Numerically Stable Parallel Computation of (Co-)Variance Erich Schubert, Michael Gertz 30th Int. Conf. on Scientific and Statistical Database Management (SSDBM ’18) July 9–11, 2018, Bozen-Bolzano, Italy Ruprecht-Karls-Universität Heidelberg {schubert,gertz}@informatik.uni-heidelberg.de

Variance & Covariance

Variance Variance is a widely used summary statistic: � ( X − E[ X ]) 2 � Var( X ) = E where E[_] denotes the expected value (e.g., arithmetic average). Variance is the “ expected squared deviation from the mean ”. Estimate the variance from a data sample (“two pass algorithm”): � 1 1. compute µ X = i x i n � i ( x i − µ X ) 2 (or with normalization factor 1 1 2. compute Var( X ) = n ) n − 1 From this we can, e.g., compute the standard deviation σ X := √ Var( X ). This is the most common measure of spread . 1

Covariance Covariance is similar, but for two variables: � � Cov( X , Y ) = E ( X − E[ X ])( Y − E[ Y ]) In particular, we have Cov( X , X ) = Var( X ). Used for example in: ◮ Pearson correlation: √ Var( X ) · √ Var( Y ) = Cov( X , Y ) Cov( X , Y ) ρ X , Y = σ X · σ Y ◮ Principal Component Analysis (PCA): decomposition of the covariance matrix ◮ Gaussian Mixture Modeling (“EM Clustering”) uses weighted (co-)variance 2

Variance In most statistics textbooks, we will find this “textbook algorithm”: � ( X − E[ X ]) 2 � = E[ X 2 ] − E[ X ] 2 Var( X ) = E (1) This is: ◮ mathematically correct (proven, c.f. König–Huygens formula, Steiner translation) ◮ atractive (just one pass over the data, aggregate � x i , � x 2 i , N ) 3

Variance In most statistics textbooks, we will find this “textbook algorithm”: � ( X − E[ X ]) 2 � = E[ X 2 ] − E[ X ] 2 Var( X ) = E (1) This is: ◮ mathematically correct (proven, c.f. König–Huygens formula, Steiner translation) ◮ atractive (just one pass over the data, aggregate � x i , � x 2 i , N ) ◮ numerically problematic with floating point computations � Use Equation (1) only analytically , not with floating point data. 3

Catastrophic Cancellation For illustration, assume floating points with four digits of precision: E[ X 2 ] 0 . 1 2 3 4 3 7 4 0 . 1 2 3 4 3 7 6 2 − E[ X ] 2 - 0 . 0 0 0 1 2 3 4 - 0 . 1 2 3 4 1 5 2 1 = Var( X ) = 0 . 1 2 3 3 1 4 0 = 0 . 0 0 0 0 0 0 0 0 4

Catastrophic Cancellation For illustration, assume floating points with four digits of precision: E[ X 2 ] 0 . 1 2 3 4 3 7 4 0 . 1 2 3 4 3 7 6 2 − E[ X ] 2 - 0 . 0 0 0 1 2 3 4 - 0 . 1 2 3 4 1 5 2 1 = Var( X ) = 0 . 1 2 3 3 1 4 0 = 0 . 0 0 0 0 0 0 0 0 � If Var( X ) ≫ E[ X ] 2 , precision is good. But as E[ X ] 2 ≫ 0 , we lose numerical precision. � We can first center our data, E[ X ] = 0 : then Var( X ) = E[( X − E[ X ]) 2 ] = E[ X ]= 0 E[ X 2 ] But then we need two passes over the data set. For large data, this will be 2x slower. � Algorithms for computing variance in a single-pass over the data set. E.g., Welford [Wel62], Neely [Nee66], Rodden [Rod67], Van Reeken [Van68], Youngs and Cramer [YC71], Ling [Lin74], Hanson [Han75], Coton [Cot75], West [Wes79], Chan and Lewis [CL79], Donald Knuth in TAoCP II [Knu81], Chan et al. [CGL82; CGL83], ... 4

Solved Problem? Already solved since the 70s? ◮ Incremental (add one sample) variants of variance mostly ◮ Still broken (or slow) in many SQL databases & toolkits! Let us build a small unit test with two values, [ µ − 1 , µ + 1 ] and the mean µ : � ( µ − 1 − µ ) 2 + ( µ + 1 − µ ) 2 � � − 1 2 + 1 2 � 1 1 Var( X ) = = = 2 2 − 1 2 − 1 Easy with µ = 0 , but we will use µ = 100 000 000 ; and thus µ 2 = 10 16 Double precision: about 16 decimal digits (52+1 bit mantissa). Single precision: about 6 decimal digits (23+1 bit mantissa). � this breaks way too early for many use cases! 5

Solved Problem? Already solved since the 70s? ◮ Incremental (add one sample) variants of variance mostly ◮ Still broken (or slow) in many SQL databases & toolkits! PostgreSQL 9.6: SELECT VAR_SAMP(x::float8), COVAR_SAMP(x,x) FROM (SELECT 99999999 AS x UNION SELECT 100000001 AS x) AS x 0 ✗ 0 ✗ 5

Solved Problem? Already solved since the 70s? ◮ Incremental (add one sample) variants of variance mostly ◮ Still broken (or slow) in many SQL databases & toolkits! MySQL 5.6: SELECT VAR_SAMP(X) FROM (SELECT 99999999 AS X UNION SELECT 100000001 AS X) AS X no covariance function? 2 ✓ MySQL is one of the few databases that implements a numerically stable approach. 5

Solved Problem? Already solved since the 70s? ◮ Incremental (add one sample) variants of variance mostly ◮ Still broken (or slow) in many SQL databases & toolkits! MS SQL Server 2017: SELECT VAR(x) FROM (SELECT 99999999 AS x UNION SELECT 100000001 AS x) AS x; no covariance function? 0 ✗ 5

Solved Problem? Already solved since the 70s? ◮ Incremental (add one sample) variants of variance mostly ◮ Still broken (or slow) in many SQL databases & toolkits! HyPer 0.6: SELECT VAR_SAMP(x) FROM (SELECT 99999999::REAL AS x UNION SELECT 100000001::REAL AS x) no covariance function? 0 ✗ 5

Contributions In this paper, we revisit the 1970s results, and contribute: ◮ numerically stable ◮ weighted ◮ parallelizable ◮ (co-) variance 6

Contributions In this paper, we revisit the 1970s results, and contribute: ◮ numerically stable ◮ based on the 1970s methods ◮ weighted ◮ but with arbitrary weighting ◮ enabling partitioned computation (AVX, ...) ◮ parallelizable ◮ (co-) variance ◮ for covariance and variance 6

Weighted Incremental (Co-)Variance

Generalized Form To derive the general form, we first n − 1 (resp. 1 1 1. remove the scaling factor n ) for now (simplification!) 2. partition the data into parts A and B 3. add weights ω i to each observation (use Ω A = � i ∈ A ω i ) then we get for any partition A and variables X , Y : � 1 x A = i ∈ A ω i x i � weighted Ω A � means 1 y A = i ∈ A ω i y i � Ω A � Cov( X , Y ) A ∝ XY A = i ∈ A ω i ( x i − � x A )( y i − � y A ) 1 We can get the usual covariance with ω i = 1 and Cov( X , Y ) = Ω − 1 XY 7

Generalized Form Using a variant of König-Huygens and some algebra (see the paper for details), we get the equations to merge two partitions A and B : Ω AB = Ω A + Ω B all differences at 1 data precision ✓ x AB = � Ω AB (Ω A � x A + Ω B � x B ) 1 y AB = Ω AB (Ω A � y A + Ω B � y B ) � XY AB = XY A + XY B + Ω A Ω B Ω AB ( � x A − � x B ) ( � y A − � y B ) Benefits of this form: ◮ a partition P can be summarized to a few values: Ω P , � x P , � y P , XY P ◮ two partition summaries can be combined into one ◮ we can partition our data using AVX, GPU, clusters, ... Note: for | B | = 1 and ω i = 1 , this gives the “online” equations known from literature: | A | XY Ab = XY A + 0 + | A | + 1 ( � x A − x b ) ( � y A − y b ) 8

Example: AVX Parallelization of Variance Advanced Vector Extensions (AVX) are modern vector instructions that perform the same instruction on 4–8 doubles (8-16 single-precision floats) in parallel. Input data AVX register Squared deviations Partition sums AVX accumulators needs weighted needs weighted aggregation aggregation Reduction � � ( X − E[ X ]) 2 Output X Because the final reduction cost is negligible for larger data sets, our parallel AVX versions are ≈ 4 × faster than the comparable non-parallel versions. On GPUs, we could do this 1000 × parallel (but beware other GPU precision challenges)! 9

Experiments

Numeric Precision of Variance Numeric precision of different (unweighted) variance algorithms: 10 5 Textbook: E[X 2 ]-E[X] 2 Hanson Youngs-Cramer Double precision Two-pass: E[(X-E[X]) 2 ] West Two-pass: Neely 10 0 10 -5 Error / Mean 10 -10 10 -15 10 -20 10 -25 10 -4 10 -2 10 0 10 2 10 4 10 6 10 8 10 10 Mean / Standard Deviation 10

Performance and Accuracy of Variance Excerpt of results (see article for many more variants) on 100.000.000 synthetic doubles: Method Variant Runtime (s) Precision (decimal digits) Variance Min Mean Median Max Best Mean Median Worst double 168.85 168.97 168.93 169.22 12.848 4.086 6.150 -11.153 Textbook Welford / Knuth double 929.17 929.97 929.93 931.18 13.224 7.441 8.787 -0.963 Youngs & Cramer double 212.20 212.53 212.49 213.31 12.840 8.284 9.588 0.454 Welford / Knuth AVX × 4 50.93 52.83 14.041 8.892 11.306 -0.547 51.47 51.22 Youngs & Cramer AVX × 4 49.82 54.85 14.353 9.524 10.600 0.805 52.88 52.98 Two-pass double 336.78 337.02 336.93 337.38 14.135 10.168 12.383 1.045 Two-pass AVX × 4 91.04 92.17 91.74 95.31 14.239 11.907 13.595 1.108 Two-pass (Neely) double 337.15 337.41 337.32 338.08 14.135 12.372 12.485 10.042 Two-pass (Neely) AVX × 4 89.07 90.47 89.75 95.90 14.375 13.240 13.591 10.707 Textbook: E[ X 2 ] − E[ X ] 2 ; Two-Pass: E[ X − E[ X ]]; Welford: [Wel62]; Knuth: [Knu81]; Youngs-Cramer: [YC71]; Neely’s two-pass improvement [Nee66] Pipelining 11

Example: Gaussian Mixture Modeling Comparing different EM clustering implementations: 10 20 R Mclust, two-pass Spark single-core, textbook 10 15 Spark multi-core, textbook sklearn, two-pass 10 10 ELKI, textbook ELKI, two-pass Average Error 10 5 ELKI, Welford 10 0 10 -5 10 -10 10 -15 10 -20 1x10 6 1x10 7 1x10 8 0.1 1 10 100 1000 10000 100000 Relative distance of clusters 12

Numerically Stable Parallel Computation of (Co-)Variance Erich - PowerPoint PPT Presentation

Numerically Stable Parallel Computation of (Co-)Variance Erich Schubert, Michael Gertz 30th Int. Conf. on Scientific and Statistical Database Management (SSDBM 18) July 911, 2018, Bozen-Bolzano, Italy Ruprecht-Karls-Universitt

Solving Numerically a Problem Modelling Cancer Therapy Maria Emilia Castillo 26, May 2010

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Numerically Solving the Coupled Motion of Fluid and Contained Elastic Objects Elijah Newren

Stable Marriage Problem Stable Marriage Problem Small town with n boys and n girls. Stable

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

A Generalization of Panjers Recursion and Numerically Stable Risk Aggregation Workshop on

Numerically Stable Binary Gradient Coding Neophytos Charalambides Hessam Mahdavifar Alfred Hero

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am

$US 50-60 oil prices stable at $US 50-60 construction slows oil prices stable at job market

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Module 15 Standard Costing and Variance Analysis Dr. Varadraj Bapat 1 Standard Costing

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Numerical computations with a view towards R Sren Hjsgaard Department of Mathematical

Rao-Blackwellized Stochastic Gradients for Discrete Distributions Runjing (Bryan) Liu June 11,

Divide and Conquer Summary Divide Identify one or more subproblems Conquer Solve

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

Accuracy and Reliability Accuracy and Reliability Numerical Precision Is the Very Soul of

Lecture 24: Recursive Algorithms and Basic Counting Rules Dr. Chengjiang Long Computer Vision

Supero: Making Haskell Faster Neil Mitchell, Colin Runciman www.cs.york.ac.uk/~ndm/supero The

Java for Non Majors Final Study Guide April 26, 2017 The test consists of 1. Multiple choice