Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: - PowerPoint PPT Presentation

. . October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang October 25th, 2012 Hyun Min Kang Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: . . Summary PCA . Bulk Simple Regression Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . 1 / 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. Linear Regression October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . . Key inferences under linear model . . . . Linear model . 2 / 43 Summary . PCA Bulk Simple Regression Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • y = X β + ϵ , where X is n × p matrix • Under normality assumption, y i ∼ N ( X i β, σ 2 ) . • Effect size : ˆ β = ( X T X ) − 1 X T y • Residual variance : ˆ σ 2 = ( y − X ˆ β ) T ( y − X ˆ β )/( n − p ) • Variance/SE of ˆ ˆ Var (ˆ σ 2 ( X T X ) − 1 β : β ) = ˆ • p-value for testing H 0 : β i = 0 or H o : R β = 0 .

. Bulk Simple Regression October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . Using R to solve linear model Summary . PCA 3 / 43 . . Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > y <- rnorm(100) > x <- rnorm(100) > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -2.15759 -0.69613 0.08565 0.70014 2.62488 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02722 0.10541 0.258 0.797 x -0.18369 0.10559 -1.740 0.085 . --- Signif. codes: ... Residual standard error: 1.05 on 98 degrees of freedom Multiple R-squared: 0.02996, Adjusted R-squared: 0.02006 F-statistic: 3.027 on 1 and 98 DF, p-value: 0.08505

. Bulk Simple Regression October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . Summary . PCA 4 / 43 . Multiple Regression . Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dealing with large data with lm > y <- rnorm(5000000) > x <- rnorm(5000000) > system.time(print(summary(lm(y~x)))) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -5.1310 -0.6746 0.0004 0.6747 5.0860 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0005130 0.0004473 -1.147 0.251 x 0.0002359 0.0004473 0.527 0.598 Residual standard error: 1 on 4999998 degrees of freedom Multiple R-squared: 5.564e-08, Adjusted R-squared: -1.444e-07 F-statistic: 0.2782 on 1 and 4999998 DF, p-value: 0.5979 user system elapsed 57.434 14.229 100.607

. . October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang Can we leverage this simplicity to make a faster inference? . . Question of interest . . . A simpler model . A case for simple linear regression Summary Simple Regression . . . . . . . . . . . . . . . . . Multiple Regression . Bulk Simple Regression PCA 5 / 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . • y = β 0 + x β 1 + ϵ • X = [1 x ] , β = [ β 0 β 1 ] T .

. . October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang x x . . Making faster inferences . y . . . . Ingredients for simplification 6 / 43 A faster inference with simple linear model . . . . . . . . . . . . . Summary . Simple Regression Multiple Regression Bulk Simple Regression . . . PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • σ 2 y = ( y − y ) T ( y − y )/( n − 1) • σ 2 x = ( x − x ) T ( x − x )/( n − 1) • σ xy = ( x − x ) T ( y − y )/( n − 1) √ σ 2 x σ 2 • ρ xy = σ xy / √ • ˆ σ 2 y / σ 2 β 1 = ρ xy √ • SE (ˆ σ 2 y (1 − ρ 2 xy )/( n − 2) σ 2 β 1 ) = √ ( n − 2)/(1 − ρ 2 • t = ρ xy xy ) follows t-distribution with d.f. n − 2

. Bulk Simple Regression October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . Summary . PCA 7 / 43 Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A faster R implementation # note that this is an R function, not C++ fastSimpleLinearRegression <- function(y, x) { y <- y - mean(y) x <- x - mean(x) n <- length(y) stopifnot(length(x) == n) # for error handling s2y <- sum( y * y ) / ( n - 1 ) # \sigma_y^2 s2x <- sum( x * x ) / ( n - 1 ) # \sigma_x^2 sxy <- sum( x * y ) / ( n - 1 ) # \sigma_xy rxy <- sxy / sqrt( s2y * s2x ) # \rho_xy b <- rxy * sqrt( s2y / s2x ) se.b <- sqrt( s2y * ( 1 - rxy * rxy ) / (n-2) / s2x ) tstat <- rxy * sqrt( ( n - 2 ) / ( 1 - rxy * rxy ) ) p <- pt( abs(tstat) , n - 2 , lower.tail=FALSE )*2 return(list( beta = b , se.beta = se.b , t.stat = tstat, p.value = p )) }

. Bulk Simple Regression October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . Now it became must faster Summary . PCA 8 / 43 Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > system.time(print(fastSimpleLinearRegression(y,x))) $beta [1] 0.0002358472 $se.beta [1] 0.0004473 $t.stat [1] 0.5274646 $p.value [1] 0.597871 user system elapsed 0.382 1.849 3.042

• As fast performance as before • But do not store all the data into memory • R cannot be the solution in such cases - use C++ instead . . What we want . . . . . . . . Hyun Min Kang Biostatistics 615/815 - Lecture 14 October 25th, 2012 . . . Problem . . . . . . . . . . . . . . . . . Simple Regression Multiple Regression Bulk Simple Regression PCA . Summary Dealing with even larger data . 9 / 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . • Supposed that we now have 5 billion input data points • The issue is how to load the data • Storing 10 billion double will require 80 GB or larger memory

. Summary October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . . What we want . . . Problem . Dealing with even larger data . . PCA . . . . . . . . . . . . . . . . . Simple Regression Multiple Regression Bulk Simple Regression 9 / 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . • Supposed that we now have 5 billion input data points • The issue is how to load the data • Storing 10 billion double will require 80 GB or larger memory • As fast performance as before • But do not store all the data into memory • R cannot be the solution in such cases - use C++ instead

x n y n xy n . y i n • nx x n i • • . . . . . . ny i n n October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang nxy xy i • . ny y i n • nx x . Extracting sufficient statistics from stream . PCA . Sufficient statistics for simple linear regression . Streaming the inputs to extract sufficient statistics Summary . Bulk Simple Regression . Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . . . 1 n . . . . . 10 / 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˆ 2 σ 2 x = Var ( x ) = ( x − x ) T ( x − x )/( n − 1) ˆ 3 σ 2 y = Var ( y ) = ( y − y ) T ( y − y )/( n − 1) ˆ 4 σ xy = Cov ( x , y ) = ( x − x ) T ( y − y )/( n − 1)

. . . 1 n . . . . . . . Extracting sufficient statistics from stream . . Hyun Min Kang Biostatistics 615/815 - Lecture 14 October 25th, 2012 . . Sufficient statistics for simple linear regression Bulk Simple Regression . . . . . . . . . . . . . . . . . Simple Regression . Multiple Regression 10 / 43 Summary . Streaming the inputs to extract sufficient statistics PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˆ 2 σ 2 x = Var ( x ) = ( x − x ) T ( x − x )/( n − 1) ˆ 3 σ 2 y = Var ( y ) = ( y − y ) T ( y − y )/( n − 1) ˆ 4 σ xy = Cov ( x , y ) = ( x − x ) T ( y − y )/( n − 1) • ∑ n i =1 x = nx • ∑ n i =1 y = ny i =1 x 2 = σ 2 x ( n − 1) + nx 2 • ∑ n i =1 y 2 = σ 2 y ( n − 1) + ny 2 • ∑ n • ∑ n i =1 xy = σ xy ( n − 1) + nxy

. Bulk Simple Regression October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang . Implementation : Streamed simple linear regression Summary . PCA 11 / 43 Multiple Regression Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . #include <iostream> #include <fstream> #include <boost/math/distributions/students_t.hpp> using namespace boost::math; // for calculating p-values from t-statistic int main(int argc, char** argv) { std::ifstream ifs(argv[1]); // read file from the file arguments double x, y; // temporay values to store the input double sumx = 0, sumsqx = 0, sumy = 0, sumsqy = 0, sumxy = 0; int n = 0; // extract a set of sufficient statistics while( ifs >> y >> x ) { // assuming each input line feeds y and x sumx += x; sumy += y; sumxy += (x*y); sumsqx += (x*x); sumsqy += (y*y); ++n; }

Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: - PowerPoint PPT Presentation

. . October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang October 25th, 2012 Hyun Min Kang Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: . . Summary PCA . Bulk Simple Regression Multiple Regression

Lecture 1 : Introduction to Statistical Computing Biostatistics 615/815 - Statistical Computing .

Lecture 1 : Introduction to Statistical Computing Biostatistics 615/815 - Statistical Computing .

R packages, and Matrix Library Biostatistics 615/815 Lecture 13: . . . . Matrix Computation

Statistical Computing Biostatistics 615/815 . . . . . . . Implementation . Overview .

Biostatistics 615/815 Lecture 13: . Programming with Matrix Hyun Min Kang February 17th, 2011

Interfacing C++ and R Biostatistics 615/815 Lecture 12: . . Summary . 1 / 34 . Cumsum

Multidimensional Optimizations Biostatistics 615/815 Lecture 19: . . . . . . Summary . .

and Elementary Data Structures Linear Sorting Algorithms Biostatistics 615/815 Lecture 6: . .

Biostatistics 615/815 Lecture 4: . . . . . . . User-defined Data Types, Divide and Conquer

Implementing Linear Regression Biostatistics 615/815 Lecture 14: . . . . . . . Summary

C++ Basics & Implementing Fishers Exact Test Biostatistics 615/815 - Lecture 3 . .

C++ Basics & Implementing Fishers Exact Test Biostatistics 615/815 - Lecture 3 . .

Data Structures Biostatistics 615/815 Lecture 7: . . . . . . Summary . . Tree List

Dynamic Programming Biostatistics 615/815 Lecture 8: . . Summary Edit Distance . MTP

and Monte Carlo Methods Random Numbers Biostatistics 615/815 Lecture 15: . . Summary . 1 /

Graph Algorithms Boost Library Biostatistics 615/815 Lecture 10: . . . . . . .

Post-Mortem Trust Planning, Modifications and Allocations: Tax Elections Available to the Executor

Los Angeles County Los Angeles County Regional DMMP Regional DMMP Pilot Studies Pilot Studies

Full year results presentation October 2017 02 Redefine International P .L.C. 2017 Full Year

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

Irradiated Material Advanced Repair Welding Molten Salt Reactor Workshop 2018 October 3, 2018 1

Research D a y 2 0 2 0 Welcome to Research Day 2020. Let us introduce you to the most important

1) 30% showing of interest required by 603 of the Public Employe Relations Act (PERA). Showing

WestConnex Acquisition Transurban equity raising For illustrative purposes only and is not

Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: - PowerPoint PPT Presentation

. . October 25th, 2012 Biostatistics 615/815 - Lecture 14 Hyun Min Kang October 25th, 2012 Hyun Min Kang Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: . . Summary PCA . Bulk Simple Regression Multiple Regression

Lecture 1 : Introduction to Statistical Computing Biostatistics 615/815 - Statistical Computing .

Lecture 1 : Introduction to Statistical Computing Biostatistics 615/815 - Statistical Computing .

R packages, and Matrix Library Biostatistics 615/815 Lecture 13: . . . . Matrix Computation

Statistical Computing Biostatistics 615/815 . . . . . . . Implementation . Overview .

Biostatistics 615/815 Lecture 13: . Programming with Matrix Hyun Min Kang February 17th, 2011

Interfacing C++ and R Biostatistics 615/815 Lecture 12: . . Summary . 1 / 34 . Cumsum

Multidimensional Optimizations Biostatistics 615/815 Lecture 19: . . . . . . Summary . .

and Elementary Data Structures Linear Sorting Algorithms Biostatistics 615/815 Lecture 6: . .

Biostatistics 615/815 Lecture 4: . . . . . . . User-defined Data Types, Divide and Conquer

Implementing Linear Regression Biostatistics 615/815 Lecture 14: . . . . . . . Summary

C++ Basics &amp; Implementing Fishers Exact Test Biostatistics 615/815 - Lecture 3 . .

C++ Basics &amp; Implementing Fishers Exact Test Biostatistics 615/815 - Lecture 3 . .

Data Structures Biostatistics 615/815 Lecture 7: . . . . . . Summary . . Tree List

Dynamic Programming Biostatistics 615/815 Lecture 8: . . Summary Edit Distance . MTP

and Monte Carlo Methods Random Numbers Biostatistics 615/815 Lecture 15: . . Summary . 1 /

Graph Algorithms Boost Library Biostatistics 615/815 Lecture 10: . . . . . . .

Post-Mortem Trust Planning, Modifications and Allocations: Tax Elections Available to the Executor

Los Angeles County Los Angeles County Regional DMMP Regional DMMP Pilot Studies Pilot Studies

Full year results presentation October 2017 02 Redefine International P .L.C. 2017 Full Year

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

Irradiated Material Advanced Repair Welding Molten Salt Reactor Workshop 2018 October 3, 2018 1

Research D a y 2 0 2 0 Welcome to Research Day 2020. Let us introduce you to the most important

1) 30% showing of interest required by 603 of the Public Employe Relations Act (PERA). Showing

WestConnex Acquisition Transurban equity raising For illustrative purposes only and is not

C++ Basics & Implementing Fishers Exact Test Biostatistics 615/815 - Lecture 3 . .

C++ Basics & Implementing Fishers Exact Test Biostatistics 615/815 - Lecture 3 . .