Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo - PowerPoint PPT Presentation

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao http://www.princeton.edu/˜tuoz Department of Computer Science Johns Hopkins University Mar. 25. 2015

A General Theory of Pathwise Coordinate Optimization Collaborators This is joint work with Prof. Han Liu at Princeton University, Prof. Tong Zhang at Rutgers University and Baidu, Xingguo Li at University of Minnesota. Manuscript: http://arxiv.org/abs/1412.7477 Software Package: http://cran.r-project.org/web/packages/picasso/ Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 2 / 45

A General Theory of Pathwise Coordinate Optimization Outline Background Pathwise Coordinate Optimization Computational and Statistical Theories Numerical Simulations Conclusions Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 3 / 45

Background

A General Theory of Pathwise Coordinate Optimization Regularized M-Estimation Let β ∗ denote the parameter to be estimated. We solve the following regularized M-estimation problem β ∈ R d L ( β ) + R λ ( β ) min , � �� F λ ( β ) where L ( β ) is a smooth loss function, and R λ ( β ) is a regularization function with a tuning parameter λ . Examples: Lasso, Logistic Lasso (Tibshirani, 1996), Group Lasso (Yuan and Lin, 2006), Graphical Lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al. 2008), ... Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 5 / 45

A General Theory of Pathwise Coordinate Optimization Regularization Functions R λ ( β ) is coordinate separable, d � R λ ( β ) = r λ ( β j ) . j = 1 R λ ( β ) is decomposable, d � R λ ( β ) = λ � β � 1 + H λ ( β ) = λ [ | β j | + h λ ( β j )] . j = 1 Examples: Smooth Clipped Absolute Deviation (SCAD, Fan and Li, 2001) and Minimax Concavity Penalty (MCP, Zhang, 2010) Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 6 / 45

A General Theory of Pathwise Coordinate Optimization Regularization Functions For any γ > 2, SCAD is defined as  λ | β j | if | β j | � λ ,   − | β j | 2 − 2 λγ | β j | + λ 2    if λ < | β j | � λγ , r λ ( β j ) = 2 ( γ − 1 )   ( γ + 1 ) λ 2    if | β j | > λγ . 2  0 if | β j | � λ ,   2 λ | β j | − | β j | 2 − λ 2    if λ < | β j | � λγ , h λ ( β j ) = 2 ( γ − 1 )  ( γ + 1 ) λ 2 − 2 λ | β j |     if | β j | > λγ . 2 Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 7 / 45

A General Theory of Pathwise Coordinate Optimization Regularization Functions For any γ > 1, MCP is defined as  � � | β j | − | β j | 2   λ if | β j | � λγ ,  2 λγ r λ ( β j ) = λ 2 γ   if | β j | > λγ .  2  − | β j | 2   if | β j | � λγ ,  2 γ h λ ( β j ) = λ 2 γ − 2 λ | β j |   if | β j | > λγ .  2 Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 8 / 45

A General Theory of Pathwise Coordinate Optimization Regularization Functions 3.0 0.0 ` 1 SCAD 2.5 − 0.5 MCP 2.0 − 1.0 h λ ( θ j ) r λ ( θ j ) 1.5 − 1.5 1.0 − 2.0 ` 1 0.5 SCAD MCP − 2.5 0.0 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 θ j θ j Figure: 1. λ = 1 and γ = 2 . 01. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 9 / 45

A General Theory of Pathwise Coordinate Optimization Loss Functions X ∈ R n × d – design matrix, y ∈ R n – response vector. Least Square Loss: L ( β ) = 1 2 n � y − X β � 2 2 . Logistic Loss: � � � � n L ( β ) = 1 � 1 + exp ( X T − y i X T i ∗ β ) . log i ∗ β n i = 1 Others: Huber Loss, Multi-category Logistic Loss,... Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 10 / 45

A General Theory of Pathwise Coordinate Optimization Reformulation We rewrite the regularized M-estimation problem as � min L λ ( β ) + λ � β � 1 . β ∈ R d � �� F λ ( β ) � L λ ( β ) is smooth but nonconvex, � L λ ( β ) = L ( β ) + H λ ( β ) . λ � β � 1 is nonsmooth but convex. Remark: Amenable to theoretical analysis. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 11 / 45

A General Theory of Pathwise Coordinate Optimization Randomized Coordinate Descent Algorithm At the t -th iteration, we randomly select a coordinate j from d coordinates. We then take β ( t + 1 ) ← β ( t ) \ j , and \ j Exact Coordinate Minimization (Fu, 1998) β ( t + 1 ) L λ ( β j ; β ( t ) � ← arg min \ j ) + λ | β j | . j β j Inexact Coordinate Minimization (Shalev-Shwartz, 2011) L λ ( β ( t ) ) + L β ( t + 1 ) 2 ( β j − β ( t ) ) 2 + λ | β j | , ( β j − β ( t ) ) ∇ j � ← arg min j β j where L is the step size parameter. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 12 / 45

A General Theory of Pathwise Coordinate Optimization Examples Sparse Linear Regression + MCP:  β ( t + 1 ) β ( t + 1 ) � if | � | � γλ ,  j j  T j , λ ( β ( t ) ) = β ( t + 1 ) S λ ( � ) j if | � β ( t + 1 )  | < γλ .  j 1 − 1 /γ β ( t + 1 ) ∗ j ( y − X ∗ \ j β ( t ) where � = X T \ j ) / n . j Sparse Logistic Regression + MCP: T j , λ ( β ( t ) ) = S λ ( β ( t ) − ∇ j � L λ ( β ( t ) ) / L ) Remark: Sublinear Convergence to Local Optima without Statistical Guarantees (Shalev-Shwartz, 2011). Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 13 / 45

Pathwise Coordinate Optimization

A General Theory of Pathwise Coordinate Optimization Pathwise Coordinate Optimization Much faster than other competing algorithms. Very simple implementation. Easily scale to large problems. NO computational analysis in existing literature NO statistical guranratee on the obtained estimator. Our Contribution: The FIRST pathwise coordinate optimization algorithm with both computational and statistical guarantees. The FIRST two-step estimator with both computational and statistical guarantees. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 15 / 45

A General Theory of Pathwise Coordinate Optimization Pathwise Coordinate Optimization Friedman et al. 2007, Mazumder et al. 2011 Inner loop Active set Active coordinate Regularization parameter Active set minimization initialization identification n o t i u Convergence o l S a l i Convergence i t Coordinate updating n I Middle loop Warm start initialization Output solution Outer loop Figure: 2. The pathwise coordinate optimization framework contains 3 nested loops : (I) Warm start initialization; (II) Active set identification; (III) Active coordinate minimization. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 16 / 45

A General Theory of Pathwise Coordinate Optimization Restricted Strong Convexity and Smoothness Motivation: For any β , β ′ ∈ R d such that |{ j | β j � 0 or β ′ j � 0 }| � s , we have � � L λ ( β ) � C − ( s ) L λ ( β ) − ( β ′ − β ) T ∇ � � β ′ − β � 2 L λ ( β ′ ) − � � 2 , 2 � � L λ ( β ) � C + ( s ) L λ ( β ) − ( β ′ − β ) T ∇ � � β ′ − β � 2 L λ ( β ′ ) − � � 2 , 2 where C − ( s ) , C + ( s ) > 0 are two constants depending on s . Remark: An algorithm, which can maintain SPARSE solutions throughout all iterations, behaves like minimizing a STRONGLY CONVEX function. Therefore a linear convergence can be expected. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 17 / 45

A General Theory of Pathwise Coordinate Optimization Warm Start Initialization (Outer Loop) We choose a sequence of DECREASING regularization parameters { λ K } N K = 1 : λ 0 � λ 1 � λ 2 � ... � λ N − 1 � λ N . The algorithm yields a sequence of output solutions { K } } N { � β K = 0 from sparse to dense, { K } ← min � � β L λ K ( β ) + λ K � β � 1 . β Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 18 / 45

A General Theory of Pathwise Coordinate Optimization Warm Start Initialization (Outer Loop) We choose λ 0 = �∇ L ( 0 ) � ∞ , then have { 0 } = 0 . �∇ L ( 0 ) + ∇ H λ ( 0 ) + λ 0 ξ � ∞ = 0 and � min β ξ ∈ ∂ � 0 � 1 The regularization sequence { λ K } N K = 0 is geometrically decreasing λ K = ηλ K − 1 with η ∈ ( 0 , 1 ) . When solving the optimization problem with λ K , we use { K − 1 } as INITIALIZATION. � β Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 19 / 45

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo - PowerPoint PPT Presentation

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao http://www.princeton.edu/tuoz Department of Computer Science Johns Hopkins University Mar. 25. 2015 A General Theory of Pathwise Coordinate Optimization Collaborators

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Nonconvex Sparse Graph Learning under Laplacian-structured Graphical Model a talk by Jiaxi Ying,

The pathwise solution of an SPDE with fractal noise Elena Issoglio Friedrich-Schiller

Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part of the slides on path wise

Robust trading strategies, pathwise It o calculus, and generalized Takagi functions Alexander

Asymptotically exponential hitting times and metastability: a pathwise approach without

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

Sparse Coding and Dictionary Learning for Image Analysis Part I: Optimization for Sparse Coding

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

A Tutorial of the Mobile Multimedia Wireless Sensor Network OMNeT++ Framework Zhongliang

High-dimensional omics data analysis using a variable screening protocol with prior knowledge

Resource-Oriented Architecture (ROA) <div

Taming JavaScript with Cloud9 IDE: a Tale of Tree Hugging Zef Hemel (@zef) .js browser.js

POLI 120N: Contention and Conflict in Africa Professor Adida Kenya & Solutions to Electoral

Modeling Financial Durations Using Estimating Functions Yaohua Zhang 1 Jian Zou 2 Nalini

ABCs of ODC DCDC Rebecca Salwin ADC Ryan Little ADC Chloe Fasi ODCs Main Functions

Non-Convex Relaxations for Rank Regularization Carl Olsson 2019-05-01 Carl Olsson 2019-05-01 1

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo - PowerPoint PPT Presentation

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao http://www.princeton.edu/tuoz Department of Computer Science Johns Hopkins University Mar. 25. 2015 A General Theory of Pathwise Coordinate Optimization Collaborators

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

Nonconvex Sparse Graph Learning under Laplacian-structured Graphical Model a talk by Jiaxi Ying,

The pathwise solution of an SPDE with fractal noise Elena Issoglio Friedrich-Schiller

Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part of the slides on path wise

Robust trading strategies, pathwise It o calculus, and generalized Takagi functions Alexander

Asymptotically exponential hitting times and metastability: a pathwise approach without

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

Sparse Coding and Dictionary Learning for Image Analysis Part I: Optimization for Sparse Coding

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

A Tutorial of the Mobile Multimedia Wireless Sensor Network OMNeT++ Framework Zhongliang

High-dimensional omics data analysis using a variable screening protocol with prior knowledge

Resource-Oriented Architecture (ROA) &lt;div

Taming JavaScript with Cloud9 IDE: a Tale of Tree Hugging Zef Hemel (@zef) .js browser.js

POLI 120N: Contention and Conflict in Africa Professor Adida Kenya &amp; Solutions to Electoral

Modeling Financial Durations Using Estimating Functions Yaohua Zhang 1 Jian Zou 2 Nalini

ABCs of ODC DCDC Rebecca Salwin ADC Ryan Little ADC Chloe Fasi ODCs Main Functions

Non-Convex Relaxations for Rank Regularization Carl Olsson 2019-05-01 Carl Olsson 2019-05-01 1

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Resource-Oriented Architecture (ROA) <div

POLI 120N: Contention and Conflict in Africa Professor Adida Kenya & Solutions to Electoral