Variations on Nonparametric Additive Models: Computational and - PowerPoint PPT Presentation

Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty Department of Statistics & Department of Computer Science University of Chicago

Collaborators Sivaraman Balakrishnan (CMU) Mathias Drton (Chicago) Rina Foygel (Chicago) Michael Horrell (Chicago) Han Liu (Princeton) Kriti Puniyani (CMU) Pradeep Ravikumar (Univ. Texas, Austin) Larry Wasserman (CMU) 2

Perspective Even the simplest models can be interesting, challenging, and useful for large, high-dimensional data. 3

Motivation Great progress has been made on understanding sparsity for high dimensional linear models Many problems have clear nonlinear structure We are interested in purely functional methods for high dimensional, nonparametric inference • no basis expansions 4

Additive Models 300 190 180 250 170 200 160 150 150 100 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Age Bmi 240 160 150 200 140 160 130 120 120 110 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Map Tc 5

Additive Models Fully nonparametric methods appear hopeless • Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty and Wasserman (2008)) Additive models are useful compromise • Exponential scaling, p = exp ( n c ) (e.g., “SpAM” Ravikumar et al. (2009)) 6

Themes of this talk • Variations on additive models enjoy most of the good statistical and computational properties of sparse linear models • Thresholded backfitting algorithms, via subdifferential calculus • RKHS formulations are problematic • A little nonparametricity goes a long way 7

Outline • Sparse additive models • Nonparametric reduced rank regression • Functional sparse CCA • The nonparanormal • Conclusions 8

Sparse Additive Models Ravikumar, Lafferty, Liu and Wasserman, JRSS B (2009) Y i = � p Additive Model: j = 1 m j ( X ij ) + ε i , i = 1 , . . . , n High dimensional: n ≪ p , with most m j = 0. � � 2 Y − � Optimization: minimize j m j ( X j ) E p � � E ( m 2 subject to j ) ≤ L n j = 1 E ( m j ) = 0 Related work by B¨ uhlmann and van de Geer (2009), Koltchinskii and Yuan (2010), Raskutti, Wainwright and Yu (2011) 9

Sparse Additive Models � � � � m ∈ R 4 : m 2 11 + m 2 m 2 12 + m 2 C = 21 + 22 ≤ L π 12 C = π 13 C = 10

Stationary Conditions Lagrangian � � 2 p � � Y − � p L ( f , λ, µ ) = 1 E ( m 2 j = 1 m j ( X j ) + λ j ( X j )) 2 E j = 1 Let R j = Y − � k � = j m k ( X k ) be j th residual. Stationary condition m j − E ( R j | X j ) + λ v j = 0 a . e . � E ( m 2 where v j ∈ ∂ j ) satisfies m j if E ( m 2 v j = j ) � = 0 � E ( m 2 j ) � E v 2 ≤ 1 otherwise j 11

Stationary Conditions Rewriting, m j + λ v j = E ( R j | X j ) ≡ P j   λ  1 +  m j P j if E ( P 2 � = j ) > λ E ( m 2 j ) m j = 0 otherwise This implies   λ  1 −  m j = � P j E ( P 2 j ) + 12

SpAM Backfitting Algorithm Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � m k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j � E [ P j ] 2 Estimate norm: s j = � � 1 − λ � Soft-threshold: � m j ← P j s j � + m ( X i ) = � Output: Estimator � j � m j ( X ij ) . 13

Example: Boston Housing Data Predict house value Y from 10 covariates. We added 20 irrelevant (random) covariates to test the method. Y = house value; n = 506, p = 30. Y = β 0 + m 1 ( crime ) + m 2 ( tax ) + · · · + · · · m 30 ( X 30 ) + ǫ. Note that m 11 = · · · = m 30 = 0. We choose λ by minimizing the estimated risk. SpAM yields 6 nonzero functions. It correctly reports that m 11 = · · · = � � m 30 = 0. 14

Example Fits 20 20 10 10 −10 −10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 15

L 2 norms of fitted functions versus 1 /λ 4 3 Component Norms 10 2 8 6 3 1 17 7 5 0 0.0 0.2 0.4 0.6 0.8 1.0 16

RKHS Version Raskutti, Wainwright and Yu (2011) Sample optimization p n � � 2 � � � � 1 min y i − m j ( x ij ) + λ � m j � H j + µ � m j � L 2 ( P n ) n f i = 1 j = 1 j j � � n 1 i = 1 m 2 where � m j � L 2 ( P n ) = j ( x ij ) . n By Representer Theorem, with m j ( · ) = K j α j , p � � n � � 2 � � � � 1 α T α T j K 2 min y i − K j α j + λ j K j α j + µ j α j n f i = 1 j = 1 j j Finite dimensional SOCP , but no scalable algorithms known. 17

Open Problems • Under what conditions do the backfitting algorithms converge? • What guarantees can be given on the solution to the infinite dimensional optimization? • Is it possible to simultaneously adapt to unknown smoothness and sparsity? 18

Multivariate Regression Y ∈ R q and X ∈ R p . Regression function M ( X ) = E ( Y | X ) . Linear model M ( X ) = BX where B ∈ R q × p . Reduced rank regression: r = rank ( B ) ≤ C . Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm min ( p , q ) � � B � ∗ := σ j ( B ) j = 1 as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011) 19

Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (2012) Nonparametric multivariate regression M ( X ) = ( m 1 ( X ) , . . . , m q ( X )) T Each component an additive model p � m k ( X ) = m k j ( X j ) j = 1 What is the nonparametric analogue of � B � ∗ penalty? 20

� � � � � � � � Recall: Sparse Vectors and ℓ 1 Relaxation sparse vectors convex hull � X � 0 ≤ t � X � 1 ≤ t 21

Low-Rank Matrices and Convex Relaxation low rank matrices convex hull rank ( X ) ≤ t � X � ∗ ≤ t 22

Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball � X � ∗ ≤ t : • Compute the SVD: B = U diag ( σ ) V T • Soft threshold the singular values: B ← U diag ( Soft λ ( σ )) V T 23

Low Rank Functions What does it mean for a set of functions m 1 ( x ) , . . . , m q ( x ) to be low rank? Let x 1 , . . . , x n be a collection of points. We require the n × q matrix M ( x 1 : n ) = [ m k ( x i )] is low rank. Stochastic setting: M = [ m k ( X i )] . Natural penalty is � q q � � λ s ( M T M ) � M � ∗ = σ s ( M ) = s = 1 s = 1 Population version: � � � � Σ( M ) 1 / 2 � � � � � � ||| M ||| ∗ := Cov ( M ( X )) ∗ = � � � ∗ 24

Constrained Rank Additive Models (CRAM) Let Σ j = Cov ( M j ) . Two natural penalties: � � � � � � � � � � � � � Σ 1 / 2 � Σ 1 / 2 � Σ 1 / 2 ∗ + ∗ + · · · + � � � p 1 2 ∗ � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) � p 1 2 ∗ Population risk functional (first penalty) � � � � �� 1 2 � � � M j � � Y − M j ( X j ) 2 + λ 2 E � ∗ j j 25

Stationary Conditions �� − 1 E ( FF ⊤ ) Subdifferential is ∂ ||| F ||| ∗ = F + H where ||| H ||| sp ≤ 1 , E ( FH ⊤ ) = 0 , E ( FF ⊤ ) H = 0 Let P ( X ) := E ( Y | X ) and consider optimization � � 1 � 2 � Y − M ( X ) 2 + λ ||| M ||| ∗ 2 E Let E ( PP T ) = U diag ( τ ) U T be the SVD. Define M = U diag ([ 1 − λ/ √ τ ] + ) U T P Then M is a stationary point of the optimization, satisfying E ( Y | X ) = M ( X ) + λ V ( X ) a . e ., for some V ∈ ∂ ||| M ||| ∗ 26

CRAM Backfitting Algorithm (Penalty 1) Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � f k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j n � P j � Compute SVD: 1 P T j = U diag ( τ ) U T M j = U diag ([ 1 − λ/ √ τ ] + ) U T � Soft-threshold: � P j M ( X i ) = � Output: Estimator � j � M j ( X ij ) . 27

Example Data of Smith et al. (1962), chemical measurements for 33 individual urine specimens. q = 5 response variables: pigment creatinine, and the concentrations (in mg/ml) of phosphate, phosphorus, creatinine and choline. p = 3 covariates: weight of subject, volume and specific gravity of specimen. We use Penalty 2 with local linear smoothing. We take λ = 1 and bandwidth h = . 3. 28

Variations on Nonparametric Additive Models: Computational and - PowerPoint PPT Presentation

Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty Department of Statistics & Department of Computer Science University of Chicago Collaborators Sivaraman Balakrishnan (CMU) Mathias Drton

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

RLRsim : Testing for Random Effects or Nonparametric Regression Functions in Additive Mixed

Monthly & Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Lattice and Non-Lattice Markov Additive Models Jevgenijs Ivanovs, Guy Latouche and Peter Taylor

Nonparametric combinatorial sequence models Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic

Nonparametric Minimax Estimation of the Estimation of the Volatility in High- Volatility in

Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference Dr Jennifer

Nonparametric Estimation of Additive Multivariate Diffusion Processes Berthold R. Haag

Generalized Additive Models David L Miller Overview What is a GAM? What is smoothing? How do

Lecture 14. Nonparametric GLMs (cont.) Nan Ye School of Mathematics and Physics University of

Lecture 13. Nonparametric GLMs Nan Ye School of Mathematics and Physics University of Queensland

Additive Manufacturing Turning Mind into Matter Neal de Beer (Ph.D) Overview Introduction to

Undecidable Problems Z. Sawa (TU Ostrava) Introd. to Theoretical Computer Science March 18, 2020

Jeopardy Rules Team captains picks question for their team to answer Everyone on team

Report Writing for Justice Professionals How to write complex

Constants of the Lesinurad * Milan Meloun 1 , Aneta pov 1 , Lucie Pilaov 1 , and Tom

CSE 158 Web Mining and Recommender Systems Introduction What is CSE 158? In this course we will

Chapter 5 Born about 370 AD. She was the first woman to make a substantial contribution

Behavior Patient Safety Connection Behavior Patient Safety Connection Does Any Doubt Remain?

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Variations on Nonparametric Additive Models: Computational and - PowerPoint PPT Presentation

Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty Department of Statistics & Department of Computer Science University of Chicago Collaborators Sivaraman Balakrishnan (CMU) Mathias Drton

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

RLRsim : Testing for Random Effects or Nonparametric Regression Functions in Additive Mixed

Monthly &amp; Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Lattice and Non-Lattice Markov Additive Models Jevgenijs Ivanovs, Guy Latouche and Peter Taylor

Nonparametric combinatorial sequence models Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic

Nonparametric Minimax Estimation of the Estimation of the Volatility in High- Volatility in

Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference Dr Jennifer

Nonparametric Estimation of Additive Multivariate Diffusion Processes Berthold R. Haag

Generalized Additive Models David L Miller Overview What is a GAM? What is smoothing? How do

Lecture 14. Nonparametric GLMs (cont.) Nan Ye School of Mathematics and Physics University of

Lecture 13. Nonparametric GLMs Nan Ye School of Mathematics and Physics University of Queensland

Additive Manufacturing Turning Mind into Matter Neal de Beer (Ph.D) Overview Introduction to

Undecidable Problems Z. Sawa (TU Ostrava) Introd. to Theoretical Computer Science March 18, 2020

Jeopardy Rules Team captains picks question for their team to answer Everyone on team

Report Writing for Justice Professionals How to write complex

Constants of the Lesinurad * Milan Meloun 1 , Aneta pov 1 , Lucie Pilaov 1 , and Tom

CSE 158 Web Mining and Recommender Systems Introduction What is CSE 158? In this course we will

Chapter 5 Born about 370 AD. She was the first woman to make a substantial contribution

Behavior Patient Safety Connection Behavior Patient Safety Connection Does Any Doubt Remain?

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Monthly & Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff