 
              A Fast New Bayesian Approach to High-Dimensional Nonparametric Regression Without MCMC Ray Bai Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Joint work with Gemma E. Moran (co-first author), Joseph Antonelli (co-first author), Yong Chen, and Mary R. Boland April 2, 2019 Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 1 / 55
Overview Nonparametric Regression and Generalized Additive Models 1 The Spike-and-Slab Group Lasso (SSGL) Prior 2 Fast Implementation of the SSGL 3 Generalized Additive Models with Interaction 4 Simulation Studies 5 Case Study: Estimating the Health Effects of Environmental Exposures 6 Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 2 / 55
Classical Linear Regression Consider the classical linear regression model, ε ∼ N ( 0 , σ 2 ■ n ) ② = ❳ β + ε , where: ② is an n -dimensional response vector, ❳ n × ( p + 1 ) = [ 1 , ❳ 1 , . . . , ❳ p ] is a design matrix with n samples and p covariates (and a column 1 = ( 1, . . . , 1 ) T for the intercept). We are mainly interested in estimating β = ( β 0 , β 1 , . . . , β p ) ′ and performing model selection from the p covariates. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 3 / 55
Classical Linear Regression ε ∼ N ( 0 , σ 2 ■ n ) ② = ❳ β + ε , PROS: Widely used. Relatively easy to interpret. CONS: The assumption that the covariates have a linear relationship with the response is very restrictive. We typically need to check model diagnostics like residual plots to ensure that the linear model is a good fit. What if it’s not ? Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 4 / 55
Example Where the Linear Model Fails The below plot shows ragweed pollen levels plotted against the day in the current ragweed season. There seems to be a relationship between pollen levels and day in the ragweed season, with a peak around day 25 and then a plateau. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 5 / 55
Example Where the Linear Model Fails Below, we plot the ordinary least squares (OLS) linear model for this data y = � β 0 + � set: � β 1 x . As you can see, the linear model fails to capture the true relationship between day and pollen level. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 6 / 55
Nonparametric Regression Below, we fit a smoothing spline between day and pollen level instead, i.e. y = � β 0 + � f ( x ) , where � � f is a nonlinear function of x . As we can see, the nonparametric method provides a much better fit. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 7 / 55
Nonparametric Regression Henceforth, we assume that ② = ( y 1 , . . . , y n ) ′ has been centered, i.e. ∑ n i = 1 y i = 0, so there is no intercept in our model. We can model the response as a (possibly nonlinear) function of the covariates. Let ① i = ( x i 1 , . . . , x ip ) ′ denote the vector of the p observed predictor values for observation i . We have the following model: E ( ε i ) = 0, E ( ε 2 y i = f ( ① i ) = f ( x i 1 , . . . , x ip ) + ε i , i ) < ∞ . For nonparametric regression (as opposed to linear regression), we make minimal or no assumptions about the specific functional form of f . Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 8 / 55
Ways to Perform Nonparametric Regression E ( ε i ) = 0, E ( ε 2 y i = f ( ① i ) + ε i , i ) < ∞ . We can model the function f ( · ) using a number of methods: random forests gradient boosting neural networks kernel smoothing generalized additive models (the topic of this talk) We can also perform model selection from the p covariates using these methods, so there is no real loss in interpretability. And we have added flexibility. These methods are gaining popularity in machine learning because they often outperform linear regression for both estimation and prediction. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 9 / 55
Parametric vs. Nonparametric Statistics Table: Parametric vs. nonparametric statistics. Parametric Nonparametric Parameter space is finite - Parameter space is infinite - dimensional ( p + 2 unknowns) dimensional p + 1 coefficients in β ∈ R p + 1 e.g. the set of all continuous and noise variance σ 2 functions f on [0,1] (say) Strong assumptions about Minimal or no assumptions relationship (e.g. linearity) about relationship (can be and/or distributional any shape) or the family (e.g. normality) distributional family Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 10 / 55
Generalized Additive Models (GAMs) Suppose we have p covariates. Let ( x i 1 , . . . , x ip ) ′ denote the p observed predictor values for the i th observation. Using generalized additive models , we model the response y i as follows: p ∑ y i = f 1 ( x i 1 ) + f 2 ( x i 2 ) + . . . + f p ( x ip ) + ε i = f j ( x ij ) + ε i , j = 1 where the f j ’s can be smooth, nonlinear functions and we assume ε i ∼ N ( 0, σ 2 ) . Question : How can we estimate the f j ’s, j = 1, . . . , p ? Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 11 / 55
Generalized Additive Models Assume that each f j can be approximated by a linear combination of basis functions B j = { g j 1 , . . . , g jd } , i.e., d ∑ f j ( X ij ) ≈ g jk ( X ij ) β jk k = 1 We have a system of equations for each f j . For the j th covariate, we are approximating for the n observations: f j ( X 1 j ) ≈ g j 1 ( X 1 j ) β j 1 + g j 2 ( X 1 j ) β j 2 + . . . + g jd ( X 1 j ) β jd , f j ( X 2 j ) ≈ g j 1 ( X 2 j ) β j 1 + g j 2 ( X 2 j ) β j 2 + . . . + g jd ( X 2 j ) β jd , . . . f j ( X nj ) ≈ g j 1 ( X nj ) β j 1 + g j 2 ( X 2 j ) β j 2 + . . . + g jd ( X nj ) β jd . We denote the unknown weight vectors β j = ( β j 1 , . . . , β jd ) T . Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 12 / 55
Matrix Representation of GAMs Let � ❳ j denote the n × d matrix with the ( i , k ) th entry ❳ j ( i , k ) = g jk ( X ij ) . Let β j = ( β j 1 , . . . , β jd ) T be the unknown weight � vectors. Then we may represent the GAM in matrix form as p � ε ∼ N n ( 0 , σ 2 ■ n ) , ∑ ② − δ = ❳ j β j + ε , j = 1 where δ is an n × 1 vector of the lower-order bias (or approximation error). The approximation error is typically O ( n − α ) for some α > 0, so the bias goes to zero as sample size increases. We have p design matrices � ❳ j , j = 1, . . . , p , each of dimension p × d . These are matrices of basis functions of the covariates. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 13 / 55
Choosing a Basis Expansion How do we choose the set of basis functions B j = { g j 1 , . . . , g jd } to approximate f j ? There are a lot of possibilities: Hermite polynomials Laguerre polynomials Fourier series splines Of the above, splines are the most commonly used in practice since they are the most flexible (although Fourier series are useful for wavelet analysis and modeling data that is known to be periodic). Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 14 / 55
Splines Splines are piecewise polynomial functions over an interval [ a , b ] , separated into sub-intervals. The endpoints of the sub-intervals are called knots . Cubic splines impose the conditions that the piecewise polynomials are cubic and that they continuous over [ a , b ] , C 1 -continuous, and C 2 -continuous (that is, the first and second derivatives are also continuous at the inner knots). Figure: Image retrieved from https://calculus7.org/tag/spline/ . Accessed 12 Mar. 2019. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 15 / 55
Natural Cubic Splines Suppose that the interval [ a , b ] is partitioned into n + 1 knots: a : = t 0 < t 1 < t 2 < . . . < t n − 1 < t n : = b . Most software uses equidistant points for the knots and chooses a “default” number of knots but allows the practitioner to override these defaults. Define the piecewise polynomials as  S 0 ( x ) , t 0 ≤ x ≤ t 1 ,     t 1 ≤ x ≤ t 2 , S 1 ( x ) , . .  .    S n − 1 ( x ) , t n − 1 ≤ x ≤ t n . The natural cubic spline imposes the condition that S ′′ 0 ( t 0 ) = S ′′ n − 1 ( t n ) = 0. Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 16 / 55
Variable Selection with GAMs Letting f j ( ❳ j ) denote an n × 1 vector with i th component f j ( x ij ) and letting � ❳ j denote the j th design matrix of basis functions corresponding to the j th predictor, i.e. � ❳ j ( i , k ) = g jk ( X ij ) , recall that we have chosen to model our data as: p p � ε ∼ N n ( 0 , σ 2 ■ n ) , ∑ ∑ ② = f j ( ❳ j ) + ε = ❳ j β j + ε , j = 1 j = 1 When the dimensionality of the covariates p is high (including p ≫ n ), we often want to impose a low-dimensional structure such as sparsity . That is, we assume that the response y depends on only a few of the p covariates. Thus, most of the f j ’s are f j = 0 and thus do not contribute to predicting the response. This is equivalent to assuming that most of the weight vectors β j = 0 d . Ray Bai (University of Pennsylvania) Spike-and-Slab Group Lasso April 2, 2019 17 / 55
Recommend
More recommend