Insights and algorithms for the multivariate square-root lasso - PowerPoint PPT Presentation

Insights and algorithms for the multivariate square-root lasso Aaron J. Molstad Department of Statistics and Genetics Institute University of Florida June 12th, 2020 Statistical Learning Seminar

Outline of the talk 1. Multivariate response linear regression 2. Considerations in high-dimensional settings 3. The multivariate square-root lasso ◮ Motivation/interpretation ◮ Theoretical tuning ◮ Computation ◮ Simulation studies ◮ Genomic data example

Multivariate response linear regression model The multivariate response linear regression model assumes the measured response for the i th subject y i ∈ R q is a realization of the random vector Y i = β ′ x i + ǫ i , ( i = 1 , . . . , n ) , where ◮ x i ∈ R p is the p -variate predictor for the i th subject, ◮ β ∈ R p × q is the unknown regression coefficient matrix, ◮ ǫ i ∈ R q are iid random vectors with mean zero and covariance Σ ≡ Ω − 1 ∈ S q + . Let the observed data be organized into: ◮ Y = ( y 1 , . . . , y n ) ′ ∈ R n × q , X = ( x 1 , . . . , x n ) ′ ∈ R n × p .

Multivariate response linear regression model Most natural estimator when n > p is the least-squares estimator (i.e., squared Frobenius norm): ˆ β ∈ R p × q � Y − X β � 2 β OLS = arg min F where � A � 2 F = tr( A ′ A ) = � i , j A 2 i , j . Setting the gradient to zero, X ′ X ˆ β OLS − X ′ Y = 0 = ⇒ ˆ β OLS = ( X ′ X ) − 1 X ′ Y . = ⇒ same estimator we would get if we performed q separate least squares regressions.

Multivariate response linear regression model If we assume the errors are multivariate normal, then the maximum likelihood estimator is � � n − 1 ( Y − X β )Ω( Y − X β ) ′ � � arg min tr − log det(Ω) . β ∈ R p × q , Ω ∈ S q + ◮ Equivalent to least squares only if Ω ∝ I q is known and fixed. However, the first order optimality conditions for β are X ′ X ˆ β MLE Ω − X ′ Y Ω = 0 , which implies β MLE = ( X ′ X ) − 1 X ′ Y = ˆ ˆ β OLS . Intuitive...? What about the errors?

When n < p , ˆ β OLS is non-unique, so we may want to apply some type of shrinkage/regularization, or impose some type of parsimonious parametric restriction.

Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p Bi-level sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) Low-rank: P ( β ) = � min( p , q ) ϕ j ( β ) (Yuan et al., 2007; j = 1 Bunea et al., 2011; Chen et al., 2013)

Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) ◮ Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p Bi-level sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) Low-rank: P ( β ) = � min( p , q ) ϕ j ( β ) (Yuan et al., 2007; j = 1 Bunea et al., 2011; Chen et al., 2013)

Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) ◮ Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p ◮ “Bi-level” sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) Low-rank: P ( β ) = � min( p , q ) ϕ j ( β ) (Yuan et al., 2007; j = 1 Bunea et al., 2011; Chen et al., 2013)

Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) ◮ Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p ◮ “Bi-level” sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) ◮ Low-rank: P ( β ) = � β � ∗ = � min( p , q ) ϕ j ( β ) or j = 1 P ( β ) = Rank ( β ) (Yuan et al., 2007; Bunea et al., 2011; Chen et al., 2013)

Can we ignore the error covariance in these high-dimensional settings? No! But of course, Ω is unknown in practice.

High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q The first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P ) . ⇒ ˆ = β P depends on Ω Equivalent to penalized least squares if Ω ∝ I q

High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q Then, the first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P . ⇒ ˆ = β P depends on Ω Equivalent to penalized least squares if Ω ∝ I q

High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q Then, the first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P . ⇒ ˆ = β P , the shrinkage estimator, depends on Ω . Equivalent to penalized least squares if Ω ∝ I q

High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q Then, the first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P . ⇒ ˆ = β P , the shrinkage estimator, depends on Ω . Equivalent to penalized least squares if Ω ∝ I q .

Can we ignore the error covariance in these high-dimensional settings? No! But of course, Ω is unknown in practice.

Penalized normal maximum likelihood When Ω is unknown and the ǫ i are normally distributed, the (doubly) penalized maximum likelihood estimator is: arg min {F ( β, Ω) + λ β P β ( β ) + λ Ω P Ω (Ω) } , β ∈ R p × q , Ω ∈ S q + where � n − 1 ( Y − X β )Ω( Y − X β ) ′ � F ( β, Ω) = tr − log det(Ω) .

Penalized normal maximum likelihood When Ω is unknown and the ǫ i are normally distributed, the (doubly) penalized maximum likelihood estimator is: arg min {F ( β, Ω) + λ β P β ( β ) + λ Ω P Ω (Ω) } , β ∈ R p × q , Ω ∈ S q + where � n − 1 ( Y − X β )Ω( Y − X β ) ′ � F ( β, Ω) = tr − log det(Ω) . ◮ Rothman et al. (2010) and Yin and Li (2011) use ℓ 1 -penalties on both β and Ω .

Insights and algorithms for the multivariate square-root lasso - PowerPoint PPT Presentation

Insights and algorithms for the multivariate square-root lasso Aaron J. Molstad Department of Statistics and Genetics Institute University of Florida June 12th, 2020 Statistical Learning Seminar Outline of the talk 1. Multivariate response

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

PRESS ROOT TO PRESS ROOT TO CONTINUE: PRESS ROOT TO PRESS ROOT TO CONTINUE: PRESS ROOT TO

Root River Fisheries Root River Fisheries Craig Helker Craig Helker WDNR WDNR Root River

Certicate Transparency Root Explorer Nikita Korzhitskii Niklas Carlsson Web Public Key

5/15/2019 Square Root - Direct Method Square Root - Direct Method In IEEE floating point standard

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

F root anycast: What, why and how Joo Damas ISC Overview What is a root server? What is

Thoughts on F-Root Futures Jeff Osborn President, Internet Systems Consortium Whats the

Root Cause Analysis 1 Root Cause Analysis Root Cause Analysis is a method that is used to

The smoothed multivariate square-root Lasso: an optimization lens on concomitant estimation

BARE ROOT AND BARE ROOT AND CONTAINERIZED FOREST CONTAINERIZED FOREST PLANTS PLANTS PLANTS

Scaling the Root A study of the impact on the DNS root system of increasing the size and

Tutorial on Root Server System Root Server System Advisory Committee | October 2015 Outline 1.

Root C t Cause An Analysis Presented by: Isaac Garcia, RCC Objec ectives es Define Root

The Cross Language Image Retrieval Track ImageCLEF 2009 Henning Mller 1 , Barbara Caputo 2 ,

ORTHOGONAL NMF-BASED TOP-K PATIENT MUTATION PROFILE SEARCHING Ref. Publication: Kim, S., Sael,

A fundamental task in IE An important and challenging task in biomedical text mining Critical

Mol2Net Study of the functional properties of the corn flour proteins ( Zea mays ), barley (

WHO 2016 UPDATE OF CNS TUMORS Arie Perry, M.D. Director, Neuropathology Courtesy of Dr. David

and Features in Deep Learning Interpretation Sahil Singla Joint work with Eric Wallace, Shi

Subsampling versus bootstrap in resampling-based model selection for multivariable regression

PRACTICAL USE OF MOLECULAR MARKERS IN DIAGNOSTIC NEUROPATHOLOGY NOTHING TO DISCLOSE Tarik