Regularization and shrinkage for model selection in sparse GLM - PowerPoint PPT Presentation

Regularization and shrinkage for model selection in sparse GLM models. Challenging problems in Statistical Learning Workshop A. Antoniadis LJK-Universit´ e Joseph Fourier. Grenoble, March 17 & 18, 2011 0-0

Thresholding and regularization Introduction During the 1990s, the nonparametric regression and signal processing literature was dominated by (nonlinear) wavelet shrinkage and wavelet thresholding estimators. When sampling points are not equi-spaced, Antoniadis & Fan (2001) address the problem with some new regularization procedures as penalized least squares regression and establish their connexion with model selection in nonparametric regression models. They suggest using some nonconvex penalties (SCAD) to increase model sparsity and accuracy. This was extended to handle variable selection via penalized ordinary least squares regression in general sparse linear models by Li & Fan (2001).

Thresholding and regularization Summary Starting from the thresholding rules, we review several thresholding procedures that have been used for wavelet denoising and establish their connexion with penalized ordinary least squares with separable penalties. When dealing with nonorthogonal designs in high-dimensional linear models sparsity can be achieved via thresholding-based iterative selection procedures for model selection and shrinkage. Finally, we extend the thresholding iterative procedures to generalized linear models with possibly nonorthogonal designs since one may use them as features selection tools in high-dimensional logistic regression or multinomial regression.

Thresholding and regularization Outline. Objective: Build a model with a subset of “predictors”. Denoising – Wavelet thresholding –Shrinkage and nonlinear diffusion Relations to variational methods – Convenient penalties Extension to nonequispaced designs – Connexions with LASSO Penalized least squares and iterative thresholding – Surrogates and the MM algorithm Penalized likelihood and iterative thresholding for GLMs – Appropriate surrogates

Thresholding and regularization Wavelet decompositions A mother wavelet ψ together with its translations and dilatations ψ j , k ( x ) = 2 j /2 ψ ( 2 j x − k ) provide the orthogonal expansion f = ∑ � f , ψ j , k � ψ j , k j , k ∈ Z

Thresholding and regularization and with the help of the scaling function φ : f = ∑ ∑ � f , φ j 0 , k � φ j 0 , k + � f , ψ j , k � ψ j , k . k ∈ Z k ∈ Z , j ≥ j 0

Thresholding and regularization The discrete wavelet transform Given a vector of function values g = ( g ( t 1 ) , ..., g ( t n )) ′ at equally spaced points t i , the discrete wavelet transform of g is given by d = W g , where d is an n × 1 vector comprising both discrete scaling coefficients, c j 0 k , and discrete wavelet coefficients, d jk , and W is an orthogonal n × n matrix associated with the orthonormal wavelet basis chosen. The c j 0 k and d jk are related to their continuous counterparts � g , φ j 0 , k � and � g , ψ j , k � (with an approximation error of order n − 1 ) via the relationships c j 0 k ≈ √ d jk ≈ √ n � g , φ j 0 , k � n � g , ψ j , k � . and The factor √ n arises because of the difference between the continuous and discrete orthonormality conditions.

Thresholding and regularization Denoising by wavelet thresholding Wavelet series allow a parsimonious and sparse expansion for a wide variety of functions, including inhomogeneous cases. Due to the orthogonality of the matrix W , the DWT of white noise is also an array of independent N ( 0, 1 ) random variables, so k = 0, 1, . . . , 2 j 0 − 1, = c j 0 k + σ ǫ jk , c j 0 k ˆ k = 0, . . . , 2 j − 1, ˆ = d jk + σ ǫ jk , j = j 0 , . . . , J − 1, d jk c j 0 k and ˆ where ˆ d jk are respectively the empirical scaling and the empirical wavelet coefficients of the the noisy data y , and ǫ jk are independent N ( 0, 1 ) random variables.

Thresholding and regularization Exploiting sparsity The sparseness of the wavelet expansion makes it reasonable to assume that essentially only a few ‘large’ d jk contain information about the underlying function g , while ‘small’ d jk can be attributed to the noise which uniformly contaminates all wavelet coefficients. Thus, simple denoising algorithms that use the wavelet transform consist of three steps: 1) Calculate the wavelet transform of the noisy signal. 2) Modify the noisy wavelet coefficients according to some rule. 3) Compute the inverse transform using the modified coefficients.

Thresholding and regularization Thresholding rules Mathematically wavelet coefficients are estimated using either the hard or soft thresholding rule given respectively by � if | ˆ d jk | ≤ λ 0 λ ( ˆ δ H d jk ) = ˆ if | ˆ d jk | > λ d jk and  if | ˆ d jk | ≤ λ 0    δ S λ ( ˆ ˆ if ˆ d jk ) = d jk − λ d jk > λ   ˆ if ˆ  d jk + λ d jk < − λ .

Thresholding and regularization Avantages and disadvantages Thresholding allows the data itself to decide which wavelet coefficients are significant; hard thresholding (a discontinuous function) is a ‘keep’ or ‘kill’ rule, while soft thresholding (a continuous function) is a ‘shrink’ or ‘kill’ rule. Bruce & Gao (1996) and Marron, Adak, Johnstone, Newmann & Patil (1998) have shown that simple threshold values with hard thresholding results in larger variance in the function estimate, while the same threshold values with soft thresholding shift the estimated coefficients by an amount of λ even when | ˆ d jk | stand way out of noise level, creating unnecessary bias when the true coefficients are large. Also, due to its discontinuity, hard thresholding can be unstable – that is, sensitive to small changes in the data.

Thresholding and regularization Remedies To remedy the drawbacks of both hard and soft thresholding rules, Gao (1998) considered the nonnegative garrote thresholding  if | ˆ d jk | ≤ λ 0  λ ( ˆ δ G d jk ) = d jk − λ 2 ˆ if | ˆ d jk | > λ  ˆ d jk which also is a “shrink” or “kill” rule (a continuous function). The resulting wavelet thresholding estimators offer, in small samples, advantages over both hard thresholding and soft thresholding.

Thresholding and regularization Other rules In the same spirit to that in Gao (1998), Antoniadis & Fan (2001) (AF for short) suggested the SCAD thresholding rule  sign ( ˆ d jk ) max ( 0, | ˆ if | ˆ d jk | − λ ) d jk | ≤ 2 λ    ( a − 1 ) ˆ d jk − a λ sign ( ˆ d jk ) ( ˆ δ SCAD if 2 λ < | ˆ d jk ) = d jk | ≤ a λ λ a − 2   ˆ if | ˆ  d jk | > a λ d jk which is a “shrink” or “kill” rule (a piecewise linear function). It does not over penalize large values of | ˆ d jk | and hence does not create excessive bias when the wavelet coefficients are large. AF (2001), based on a Bayesian argument, have recommended to use the value of α = 3.7.

Thresholding and regularization Standard thresholding functions δ λ Hard (1994) Soft (1994) NNG (1998) SCAD (2001) Hard : High variance due to discontinuities at ± λ Soft : Oversmoothing (important bias due to constant attenuation) NNG, SCAD : Compromise between Hard and Soft.

Thresholding and regularization Wavelet shrinkage and nonlinear diffusion Nonlinear diffusion filtering and wavelet shrinkage are methods that serve the same purpose, namely discontinuity-preserving denoising. One drawback of the DWT is that the coefficients of the discretized signal are not circularly shift equivariant, so that circularly shifting the observed series by some amount will not circularly shift the discrete wavelet transform coefficients by the same amount, which seriously degrades the quality of the denoising achieved. The idea of denoising via cycle spinning is to apply denoising not only to y , but also to all possible unique circularly shifted versions of y , and to average the results.

Thresholding and regularization Translation invariant Haar wavelet shrinkage We can now view a general connection between translation invariant Haar wavelet shrinkage and a discretized version of a nonlinear diffusion. The scaling and wavelet filters h and ˜ h corresponding to the Haar transform are 1 1 ˜ h = √ ( . . . , 0, 1, 1, 0, . . . ) h = √ ( . . . , 0, − 1, 1, 0, . . . ) . 2 2 Given a discrete signal f = ( f k ) k ∈ Z , we can see that a shift-invariant soft wavelet shrinkage of f on a single level decomposition with the Haar wavelet creates a filtered signal u = ( u k ) k ∈ Z given by � f k + 1 − f k � f k − f k − 1 � � �� 1 1 − δ S + δ S 4 ( f k − 1 + 2 f k + f k + 1 ) + √ √ √ , λ λ 2 2 2 2 where δ S λ denotes the soft shrinkage operator with threshold λ .

Thresholding and regularization Diffusion Because the filters of the Haar wavelet are simple difference filters (a finite difference approximation of derivatives) the above rule looks a little like a discretized version of a differential equation. f k + f k + 1 − f k − f k − f k − 1 = u k 4 4 � f k + 1 − f k � f k − f k − 1 � � �� 1 − δ S + δ S √ √ √ + λ λ 2 2 2 2 � f k + 1 − f k � ( f k + 1 − f k ) �� 1 δ S √ √ = f k + − λ 4 2 2 2 � f k − f k − 1 � ( f k − f k − 1 ) �� 1 δ S − − √ √ , λ 4 2 2 2

Regularization and shrinkage for model selection in sparse GLM - PowerPoint PPT Presentation

Regularization and shrinkage for model selection in sparse GLM models. Challenging problems in Statistical Learning Workshop A. Antoniadis LJK-Universit e Joseph Fourier. Grenoble, March 17 & 18, 2011 0-0 Thresholding and

Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of

Advanced Econometrics 2, Hilary term 2020 Shrinkage in the Normal means model Maximilian Kasy

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of

GAMs: Model Selection David L Miller, Eric Pedersen, and Gavin L Simpson August 6th, 2016

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Shrinkage estimation of the three-parameter logistic model Michela Battauz (joint with Ruggero

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Shrinkage priors Dr. Jarad Niemi Iowa State University August 24, 2017 Jarad Niemi (Iowa State)

Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Regularization in Directable Environments with Application to Tetris Jan Malte Lichtenberg

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

RECSM Summer School: Machine Learning for Social Sciences Session 1.4: Ridge Regression Reto

Shrinkage Overview Joint DN Presentation 25 th October 2016 Matt Marshall (National Grid) John

Environmental Outputs Output Primary Incentive mech. Category measure Business Shrinkage

OWL Semantics COMP62342 Sean Bechhofer sean.bechhofer@manchester.ac.uk Uli Sattler

SECURITY, ADVERSARIAL SECURITY, ADVERSARIAL LEARNING, AND PRIVACY LEARNING, AND PRIVACY

Legislative Update Heidi Junge June 20, 2019 The webinar will begin shortly. Phone |

Cross-platform fi le names in Rust WTF-8: a wonderful and horrifying hack! Simon Sapin, Mozilla

Opening Thoughts These two chapters are going to span 20 years Comprised of two 7-year

Helen Tracey Overview Practicalities Eligibility Notification Agreement Examples During

memoria del puerto USB. I - Memoria Flash USB Materiale requisito : Windows XP o superiore

Logic and Knowledge Representation K n o w l e d g e r e p r e s e n t a t i

Regularization and shrinkage for model selection in sparse GLM - PowerPoint PPT Presentation

Regularization and shrinkage for model selection in sparse GLM models. Challenging problems in Statistical Learning Workshop A. Antoniadis LJK-Universit e Joseph Fourier. Grenoble, March 17 & 18, 2011 0-0 Thresholding and

Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of

Advanced Econometrics 2, Hilary term 2020 Shrinkage in the Normal means model Maximilian Kasy

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of

GAMs: Model Selection David L Miller, Eric Pedersen, and Gavin L Simpson August 6th, 2016

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Shrinkage estimation of the three-parameter logistic model Michela Battauz (joint with Ruggero

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Shrinkage priors Dr. Jarad Niemi Iowa State University August 24, 2017 Jarad Niemi (Iowa State)

Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Regularization in Directable Environments with Application to Tetris Jan Malte Lichtenberg

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

RECSM Summer School: Machine Learning for Social Sciences Session 1.4: Ridge Regression Reto

Shrinkage Overview Joint DN Presentation 25 th October 2016 Matt Marshall (National Grid) John

Environmental Outputs Output Primary Incentive mech. Category measure Business Shrinkage

OWL Semantics COMP62342 Sean Bechhofer sean.bechhofer@manchester.ac.uk Uli Sattler

SECURITY, ADVERSARIAL SECURITY, ADVERSARIAL LEARNING, AND PRIVACY LEARNING, AND PRIVACY

Legislative Update Heidi Junge June 20, 2019 The webinar will begin shortly. Phone |

Cross-platform fi le names in Rust WTF-8: a wonderful and horrifying hack! Simon Sapin, Mozilla

Opening Thoughts These two chapters are going to span 20 years Comprised of two 7-year

Helen Tracey Overview Practicalities Eligibility Notification Agreement Examples During

memoria del puerto USB. I - Memoria Flash USB Materiale requisito : Windows XP o superiore

Logic and Knowledge Representation K n o w l e d g e r e p r e s e n t a t i

Regularization Overview Regularization Overview Problems & Multicollinearity We will

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?