On corrections of classical multivariate tests for high-dimensional - PowerPoint PPT Presentation

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with Zhidong Bai , Dandan Jiang , Shurong Zheng :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions Overview Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marˇ cenko-Pastur distributions Bai and Silverstein’s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marˇ cenko-Pastur distributions Bai and Silverstein’s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions High-dimensional data and new challenge in statistics High dimensional data High dimensional data � = high dimensional models ◮ Nonparametric regression: a very high-dimensional model (i.e. infinite dimensional model) but with one-dimensional data : y i = f ( x i ) + ε i , f : R �→ R , i = 1 , . . . , n ◮ High-dimensional data : observation vectors y i ∈ R p , with p relatively high w.r.t. the sample size n :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions High-dimensional data and new challenge in statistics High dimensional data Some typical data dimensions : data ratio n / p data dimension p sample size n n / p portfolio ∼ 50 500 10 climate survey 320 600 1.9 a · 10 2 b · 10 2 speech analysis ∼ 1 ORL face data base 1440 320 1.2 micro-arrays 2000 200 0.1 ◮ Important: data ratio n / p not always large ; could be ≪ 1 ◮ Note: use of the Inverse data ratio: y = p / n :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem High-dimensional effect by an example The two-sample problem: ◮ two independent samples: x 1 , . . . , x n 1 ∼ ( µ 1 , Σ) , y 1 , . . . , y n 2 ∼ ( µ 2 , Σ) ◮ want to test H 0 : µ 1 = µ 2 against H 1 : µ 1 � = µ 2 . ◮ Classical approach: Hotelling’s T 2 test T 2 = n 1 n 2 ( x − y ) ′ S − 1 n ( x − y ) , n where n 1 n 2 X X = x i , y = y j , n = n 1 + n 2 , x i =1 j =1 " n 1 # X X n 2 1 ( x i − x i )( x i − x i ) ′ + ( y j − y j )( y j − y i ) ′ S n = . n − 2 i =1 j =1 S n : a sample covariance matrix :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem The two-sample problem: Hotelling’s T 2 test: nice properties ◮ invariance under linear transformations; ◮ finite-sample optimality if Gaussian; asymptotic optimality otherwise. Hotelling’s T 2 test: bad news ◮ low power even for moderate data dimensions; ◮ high instability in computing S − 1 even for p = 40; n ◮ very few is known for the non Gaussian case; ◮ fatal deficiency: when p > n − 2, S n is not invertible. :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Dempster’s non-exact test (NET) Dempster A.P., ’58, ’60 ◮ A reasonable test must be based on x − y even when p > n − 2; ◮ choose a new basis in R n , project the data such that 1. axis 1 � Ground mean: ( n 1 µ 1 + n 2 µ 2 ) / n 2. axis 2 � ( x − y ) . n × p = ( x 1 , . . . , x n 1 , y 1 , . . . , y n 2 ) ′ , and the (orthonormal) ◮ let the data matrix X base change H n : 0 1 0 1 0 1 h ′ z ′ n 2 √ nn 1 1 n 1 1 1 1 B C B C B C . . n × p = H n Z n × n X = . A X = . A , h 1 = √ n 1 n , h 2 = A . @ @ @ . . n 1 − √ nn 2 1 n 2 h ′ z ′ n n Under normality, we have: ◮ the z i ’s are n independent N p ( ∗ , Σ); 1 E z 2 = n 1 n 2 ◮ E z 1 = √ n ( n 1 µ 1 + n 2 µ 2 ) , √ n ( µ 1 − µ 2 ) , E z 3 = 0 , i = 3 , . . . , n . :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Dempster’s non-exact test (NET) Test statistic: � z 2 � 2 ◮ F D = ( n − 2) � z 3 � 2 + · · · + � z n � 2 ◮ Under H 0 , r X � z j � 2 ∼ Q := α k χ 2 1 ( k ) , k =1 where α 1 ≥ · · · α r > 0 are the non null eigenvalues of Σ. ◮ The distribution of F D is complicated ◮ Approximations - so the NET test : think as Σ = I p , 1. Q ≃ m χ 2 r ; 2. next estimate r by ˆ r ; ◮ Finally, under H 0 , F D ≃ F (ˆ r , ( n − 2)ˆ r ) . :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Dempster’s non-exact test (NET) Problems with the NET test: ◮ Difficult to construct the orthogonal transformation H n = { h j } for large n ; ◮ even under Gaussianity, the exact power function depend on H n . :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Bai and Saranadasa’s test (ANT) Bai & Saranadasa, ’96 n ◮ Consider directly the statistic M n = � x − y � 2 − n 1 n 2 tr S n ; ◮ generally under very mild conditions (here RMT comes!), n 2 M n n − 1 n − 2 tr Σ 2 . σ 2 = ⇒ N (0 , 1) , n := Var( M n ) = σ 2 n 2 1 n 2 n 2 ◮ A ratio consistent estimator: » – n = 2 n ( n − 1)( n − 2) 1 σ 2 tr S 2 n − 2( tr S n ) 2 σ 2 n /σ 2 P b n − , b − → 1 . n n 1 n 2 ( n − 3) ◮ Finally, under H 0 , Z n = M n = ⇒ N (0 , 1) σ 2 b n This is the Bai-Saranadasa’s asymptotic normal test (ANT). :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Comparison between T 2 , NET and ANT Power functions: ◮ Assuming p → ∞ , n → ∞ , p / n → y ∈ (0 , 1), n 1 / n → κ ; ◮ Hotelling’s T 2 , Dempster’s NET and Bai-Saranadasa’s ANT: s ! n (1 − y ) κ (1 − κ ) � Σ − 1 / 2 µ � 2 β H ( µ ) = Φ − ξ α + + o (1) , 2 y „ « n 2 tr Σ 2 κ (1 − κ ) � µ � 2 √ β D ( µ ) = Φ − ξ α + + o (1) = β BS ( µ ) . where α = test size, and ξ α = Φ − 1 (1 − α ) . µ = µ 1 − µ 2 , ◮ Important: because of the factor (1 − y ), T 2 losses power when y increases, i.e. p increases relatively to n . :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Comparison between T 2 , NET and ANT Simulation results 1: Gaussian case Σ = (1 − ρ ) I p + ρ J p , J p = 1 p 1 ′ ◮ Choice of covariance: p ◮ noncentral parameter η = � µ 1 − µ 2 � 2 √ , ( n 1 , n 2 ) = (25 , 20), n = 45 tr Σ 2 :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem A summary of the introduction ◮ High-dimensional effect need to be taken into account ; ◮ Surprisingly, asymptotic methods with RMT perform well even for small p (as low as p = 4) ; ◮ many of classical multivariate analysis methods have to be examined with respect to high-dimensional effects. :

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marˇ cenko-Pastur distributions Bai and Silverstein’s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions :

On corrections of classical multivariate tests for high-dimensional - PowerPoint PPT Presentation

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions On corrections of classical multivariate tests for high-dimensional data

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Corrections Office of Community Corrections Division of Criminal Justice SECTION 1 State

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Tests for Multivariate Linear Models with the car Package John Fox McMaster University Hamilton,

Finite size corrections to the classical radiation reaction. Tams Herpay(KFKI-RMKI)

Se r vic e Awar ds August 2, 2016 Corrections Community Corrections Marie A. Eady 20 Years

Tests for Multivariate Means Max Turgeon STAT 7200Multivariate Statistics Objectives

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Advanced PHP Dr. Steven Bitner A/B and Multivariate testing Why use multivariate testing If

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Multivariate Normal Distribution Max Turgeon STAT 4690Applied Multivariate Analysis Building

Comparing User-Provided Tests to Developer-Provided Tests Ren Just, Chris Parnin, Ian Drosos,

Introduction Multivariate procedures in R Until version 2.1.0, R had limited support for

Why Are Cities Located Where They Are? 9 Taxonomy of Location Problems Location Decision

Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from : Percy Liang

2020 On a Projective Ensemble Approach to Two Sample Test for Equality of Distributions Zhimei

in one cloud FIW Research Conference Verti-zontal Differentiation in Monopolistic

Nursing Home Choice, Family Bargaining and Optimal Policy in a Hotelling Economy M-L Leroux

Wireless Network Pricing Chapter 6: Oligopoly Pricing Jianwei Huang & Lin Gao Network

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (

On corrections of classical multivariate tests for high-dimensional - PowerPoint PPT Presentation

Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions On corrections of classical multivariate tests for high-dimensional data

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Corrections Office of Community Corrections Division of Criminal Justice SECTION 1 State

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Tests for Multivariate Linear Models with the car Package John Fox McMaster University Hamilton,

Finite size corrections to the classical radiation reaction. Tams Herpay(KFKI-RMKI)

Se r vic e Awar ds August 2, 2016 Corrections Community Corrections Marie A. Eady 20 Years

Tests for Multivariate Means Max Turgeon STAT 7200Multivariate Statistics Objectives

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Advanced PHP Dr. Steven Bitner A/B and Multivariate testing Why use multivariate testing If

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Multivariate Normal Distribution Max Turgeon STAT 4690Applied Multivariate Analysis Building

Comparing User-Provided Tests to Developer-Provided Tests Ren Just, Chris Parnin, Ian Drosos,

Introduction Multivariate procedures in R Until version 2.1.0, R had limited support for

Why Are Cities Located Where They Are? 9 Taxonomy of Location Problems Location Decision

Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from : Percy Liang

2020 On a Projective Ensemble Approach to Two Sample Test for Equality of Distributions Zhimei

in one cloud FIW Research Conference Verti-zontal Differentiation in Monopolistic

Nursing Home Choice, Family Bargaining and Optimal Policy in a Hotelling Economy M-L Leroux

Wireless Network Pricing Chapter 6: Oligopoly Pricing Jianwei Huang &amp; Lin Gao Network

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (

Wireless Network Pricing Chapter 6: Oligopoly Pricing Jianwei Huang & Lin Gao Network