Online Learning with Model Selection Lizhe Sun, Adrian Barbu - PowerPoint PPT Presentation

Online Learning with Model Selection Lizhe Sun, Adrian Barbu Florida State University abarbu@stat.fsu.edu October 16, 2019 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 1 / 40

Outline 1 Introduction 2 Literature Review 3 Online Learning Algorithms by Running Averages 4 Theoretical Analysis 5 Numerical Results 6 Future Work Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 2 / 40

Introduction Online Learning Online Learning In big data learning, we often encounter datasets so large that they 1 cannot fit in the computer memory. Online learning methods are capable of addressing these issues by 2 constructing the model sequentially, one example at a time. We assume that the samples are i.i.d / adversary. 3 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 3 / 40

Introduction The Framework for An Online Learning Algorithm Assuming w 1 = 0, and we only can access data samples { ( x i , y i ) : i = 1 , 2 , · · · } streaming in one at a time. for i = 1, 2 · · · Receive observation x i ∈ R n Predict ˆ y i Receive the true value y i ∈ R Suffer the loss function f ( w i ; z i ) in which z i = ( x i , y i ) Update w i +1 from w i and z i end Target : minimize the cumulative loss 1 � n i =1 f ( w i ; z i ). n Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 4 / 40

Introduction Regret In the theoretical analysis of online learning, it is of interest to bound the regret: n n R n = 1 1 � � f ( w i ; z i ) − min f ( w ; z i ) , n n w i =1 i =1 which measures what is lost compared to offline learning, in a way measuring the convergence speed of online algorithms. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 5 / 40

Literature Review Literature Review: SGD Stochastic Gradient Descent (SGD) SGD is the most widely used in traditional online learning area. The original idea can be traced back to Robbins and Monro (1951) and Kiefer and Wolfowitz (1952). However, the SGD algorithm cannot select features. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 6 / 40

Literature Review Literature Review: Online Learning with Sparsity To learn a better model, we need to consider feature selection in online learning. Langford et al. (2009) proposed the framework of truncated gradient. Shalev-Shwartz and Tewari (2011) designed stochastic mirror descent. Truncated second order methods in Fan et al. (2018); Langford et al. (2009); Wu et al. (2017). Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 7 / 40

Literature Review Literature Review: OPG and RDA Two main frameworks for online learning with regularization Online Proximal Gradient Descent (OPG) 1 Regularized Dual Averaging (RDA) 2 OPG is designed by Duchi and Singer (2009) and Duchi et al. (2010), and RDA is proposed by Xiao (2010). Some variants, designed by Suzuki (2013) and Ouyang et al. (2013). OPG-SADMM 1 RDA-SADMM 2 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 8 / 40

Literature Review Literature Review Hazan et al. (2007) An online Newton method Uses a similar idea with running averages, to update the inverse of the Hessian matrix Has O ( p 2 ) computational complexity Did not address the issues of variable standardization and feature selection. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 9 / 40

Literature Review Literature Review: Summary The classical online learning algorithms, such as SGD, cannot select features. In recent years, many new online learning algorithms are proposed to select features. However, no matter in theory or numerical experiments, the proposed algorithms cannot recover the true features. This concern motivates us to develop our running averages framework. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 10 / 40

Online Learning Algorithms by Running Averages Framework of Running Averages Algorithms Figure: The running averages are updated as the data is received. The model is extracted from the running averages only when desired. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 11 / 40

Online Learning Algorithms by Running Averages Running Averages We have samples x i = ( x i 1 , x i 2 , · · · , x ip ) T ∈ R p and responses y i ∈ R , we can compute running averages as follows: S x = µ x = 1 � n i =1 x i , S y = µ y = 1 � n i =1 y i n n S xx = 1 � n i =1 x i x T i n S xy = 1 � n i =1 y i x i n S yy = 1 � n i =1 y 2 i n Sample size: n Can be updated online, e.g. n 1 µ ( n +1) n + 1 µ ( n ) = + n + 1 x n +1 . x x Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 12 / 40

Online Learning Algorithms by Running Averages Standardization of Running Averages The standardization of data matrix X and vector y . ˜ X = ( X − 1 n µ T x ) D ˜ y = ( y − µ y 1 n ) D is diagonal matrix with the inverse of the standard deviation of X i . The equivalent standardization using running averages: X T ˜ n ˜ y = 1 y = 1 n DX T y − µ y D µ x = DS xy − µ y D µ x S ˜ x ˜ X = D ( X T X X T ˜ n ˜ x = 1 − µ x µ T x ) D = D ( S xx − µ x µ T S ˜ x ) D x ˜ n We will assume data is standardized in all algorithms below Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 13 / 40

Online Learning Algorithms by Running Averages Online Least Squares ( OLS ) Normal equations 1 n X T X β = 1 n X T y . Since 1 n X T X and 1 n X T y can be computed by running averages, we obtain: S xx β = S xy . Thus, online least squares is equivalent to offline least squares. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 14 / 40

Online Learning Algorithms by Running Averages Online Least Squares with Thresholding ( OLSth ) Aimed at solving the following constrained minimization problem: 1 2 n � y − X β � 2 . min β , � β � 0 ≤ k A non-convex and NP-hard problem because of the sparsity constraint. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 15 / 40

Online Learning Algorithms by Running Averages Algorithm 1 Online Least Squares with Thresholding (OLSth) Input: Running averages S xx , S xy , sample size n , sparsity level k . Output: Trained regression parameter vector β with � β � 0 ≤ k . 1: Fit the model by OLS, obtaining ˆ β 2: Keep only the k variables with largest | ˆ β j | 3: Fit the model on the selected features by OLS Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 16 / 40

Online Learning Algorithms by Running Averages Online Feature Selection with Annealing ( OFSA ) An iterative thresholding algorithm (Barbu et al., 2017). Can simultaneously estimate coefficients and select features. � y − X β � 2 ∂ Uses the gradient = S xx β − S xy , which can be updated ∂ β N online. Uses an annealing schedule M e to gradually remove features Figure: Different annealing schedules M e vs iteration number e . Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 17 / 40

Online Learning Algorithms by Running Averages Algorithm 2 Online Feature Selection with Annealing (OFSA) Input: Running averages S xx , S xy , sample size n , sparsity level k , annealing parameter µ . Output: Trained regression parameter vector β with � β � 0 ≤ k . Initialize β = 0. for t = 1 to N iter do Update β ← β − η ( S xx β − S xy ) Keep only the M t variables with highest | β j | and renumber them 1 , ..., M t . end for Fit the model on the selected features by OLS. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 18 / 40

Online Learning Algorithms by Running Averages Online Lasso and Online Adaptive Lasso The Lasso estimator, proposed in (Tibshirani, 1996), solves the optimization problem p 1 2 n � y − X β � 2 + λ � | β j | , arg min β j =1 where λ > 0 is a tuning parameter. However, because Lasso estimator cannot recover the true features, Zou (2006) proposed the adaptive Lasso, which solves the weighted Lasso p 1 2 � y − X β � 2 + λ n � arg min w j | β j | , j = 1 , 2 , · · · , p , β j =1 where w j is the weight for β j . We can use the OLS coefficients as weights when n > p . Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 19 / 40

Online Learning Algorithms by Running Averages Algorithm 3 Online Adaptive Lasso (OALa) Input: Running averages S xx , S xy , sample size n , penalty λ . Output: Trained sparse regression parameter vector β . ols . 1: Compute the OLS estimate ˆ β ols | as diagonal 2: Define a p × p diagonal weight matrix Σ w with the | ˆ β entries. 3: Denote S w xx = Σ w S xx Σ w and S w xy = Σ w S xy Initialize β = 0. for t = 1 to N iter do Update β ← β − η ( S w xx β − S w xy ) Update β ← S ηλ ( β ) ( S ηλ ( · ) is the soft thresholding operator). end for Fit the model on the selected features by OLS. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 20 / 40

Online Learning with Model Selection Lizhe Sun, Adrian Barbu - PowerPoint PPT Presentation

Online Learning with Model Selection Lizhe Sun, Adrian Barbu Florida State University abarbu@stat.fsu.edu October 16, 2019 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 1 / 40 Outline 1 Introduction 2

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Class of 2024 1 Course selection worksheet 1 Course selection online directions for

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

De Develop opment of of the new Research Infrastructure for or Europ opes Na Natural Sc

One-Pass Ranking Models for Low-Latency Product Recommendations Martin Saveski @msaveski

Distribution A: Approved for Public Release 20 April 2016 1 > GP BOMBS / Theater Mission

The Crystallography Open Database new perspectives Saulius Graulis Andrius Merkys Antanas

klaR: A Package Including Various Classification Tools Christian R over, Nils Raabe, Karsten

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup

Session: O OCL CLC Ca C Cataloging News OCLC Cataloging Community Meeting Robin S Six

Stat 8931 (Aster Models) Lecture Slides Deck 7 Parametric Bootstrap Charles J. Geyer School of

Online Learning with Model Selection Lizhe Sun, Adrian Barbu - PowerPoint PPT Presentation

Online Learning with Model Selection Lizhe Sun, Adrian Barbu Florida State University abarbu@stat.fsu.edu October 16, 2019 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 1 / 40 Outline 1 Introduction 2

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Class of 2024 1 Course selection worksheet 1 Course selection online directions for

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

De Develop opment of of the new Research Infrastructure for or Europ opes Na Natural Sc

One-Pass Ranking Models for Low-Latency Product Recommendations Martin Saveski @msaveski

Distribution A: Approved for Public Release 20 April 2016 1 &gt; GP BOMBS / Theater Mission

The Crystallography Open Database new perspectives Saulius Graulis Andrius Merkys Antanas

klaR: A Package Including Various Classification Tools Christian R over, Nils Raabe, Karsten

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup

Session: O OCL CLC Ca C Cataloging News OCLC Cataloging Community Meeting Robin S Six

Stat 8931 (Aster Models) Lecture Slides Deck 7 Parametric Bootstrap Charles J. Geyer School of

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Distribution A: Approved for Public Release 20 April 2016 1 > GP BOMBS / Theater Mission