Model Selection and Fast Rates for Regularized Least-Squares Andrea - PowerPoint PPT Presentation

DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1

Plan • Regularized least-squares (RLS) in statistical learning • Bounds on the expected risk and model selection • Evaluating the approximation and the sample errors • Fast rates of convergence of the risk to its minimum 2

Training sets • The set Z = X × Y , with the input space X a compact R n and the output space Y a compact in I in I R. • The probability measure ρ on the space Z . • The training set z = (( x 1 , y 1 ) , · · · , ( x ℓ , y ℓ )), a sequence of ℓ independent identically distributed elements in Z . 3

Regression using RLS The estimator f λ z is defined as the unique hypothesis minimizing the sum of loss and complexity   ℓ  1 ( f ( x i ) − y i ) 2 + λ � f � 2 � f λ  , z = argmin H ℓ f ∈H i =1 where • the hypothesis space H is the reproducing kernel Hilbert space (RKHS) with kernel K : X × X → I R, • the parameter λ tunes the balance between the two terms. 4

A criterion for model selection In the context of RLS, a criterion for model selection is represented by a rule to choose λ in order to achieve high performance. The performance of the estimator f λ z is measured by the expected risk � z ( x ) − y ) 2 dρ ( x, y ) . I [ f λ X × Y ( f λ z ] = • It is a random variable, • it depends on the unknown distribution ρ . 5

A criterion for model selection (cont.) The best we can do is determining a function B ( λ, η, ℓ ) which bounds with confidence level 1 − η the expected risk I [ f λ z ], that is � � I [ f λ Prob z ] ≤ inf f ∈H I [ f ] + B ( λ, η, ℓ ) ≥ 1 − η. z ∈ Z ℓ Then, a natural criterion for model selection consists of the choice for the regularization parameter minimizing this bound λ 0 ( η, ℓ ) = argmin {B ( λ, η, ℓ ) } . λ> 0 6

Main contributions in the literature • Model selection performed by bounds using covering numbers as a measure of capacity of a compact hypothesis space [F.Cucker, S. Smale, 2001, 2002] • Use of stability of the estimator and concentration inequalities as tools to bound the risk [O. Bousquet, A. Elisseeff, 2000] • Direct estimates of integral operators by concentration inequalities, no need of covering numbers [E. De Vito et al. , 2004] • Use of a Bernstein form of McDiarmid concentration inequality to improve the rates [S. Smale, D. Zhou, 2004] 7

A concentration inequality (McDiarmid, 1989) • Let ξ be a random variable, ξ : Z ℓ → I R, • let z i be the training set with the i th example replaced by ( x ′ i , y ′ i ), • assume that there is a constant C such that | ξ ( z ) − ξ ( z i ) | for all z , z i , i ≤ C then McDiarmid inequality tells us that � � − 2 ǫ 2 Prob z ∈ Z ℓ ( | ξ ( z ) − E z ( ξ ) | ≥ ǫ ) ≤ 2exp . ℓC 2 8

A Bernstein form of McDiarmid inequality (Y. Ying, 2004) • Bounding both variations | ξ ( z ) − E i ξ ( z i ) | ≤ C for all z , i • and variances E i ( ξ ( z ) − E i ξ ( z i ) ) 2 σ 2 for all z , i ≤ it holds � � ǫ 2 Prob z ∈ Z ℓ ( | ξ ( z ) − E z ( ξ ) | ≥ ǫ ) ≤ 2exp − . 2( Cǫ/ 3 + ℓσ 2 ) 9

Structure of the bound 2   I [ f λ z ] ≤ f ∈H I [ f ] inf +  A ( λ ) + S ( λ, η, ℓ )  .   � �� approximation err . sample err . � �� irreducible err . • The irreducible error is a measure of the intrinsic ran- domness of the outputs y for a drawn input x . • The approximation error A ( λ ) is a measure of the increase of risk due to the regularization. • The sample error S ( λ, η, ℓ ) is a measure of the increase of risk due to finite sampling. 10

The bound on the sample error It can be proved that, given 0 < η < 1 and λ > 0, with probability at least 1 − η , the sample error is bounded by � � � � κMC η ( ℓλ ) − 1 1 + κ ( ℓλ ) − 1 1 + κ 2 C η ℓ − 1 2 λ − 1 S ( λ, η, ℓ ) = , 2 2 where the constants M , κ and C η are defined by Y ⊂ [ − M, M ] , κ 2 ≥ K ( x, x ) for all x , 1 − 4 � = 3 log η + − 8 log η . C η 11

The approximation error It can be proved that � � � f λ − f ρ A ( λ ) = � � � L 2 ( X,ν ) , where � • f ρ ( x ) = Y ydρ ( y | x ) is the regression function • f λ is the RLS estimator in the limit case of infinite sampling , that is f λ = argmin { I [ f ] + λ � f � 2 H } . f ∈H • ν is the marginal distribution of ρ on the input space X . 12

Bounding the approximation error It is well known that bounding the approximation error requires some assumption on the distribution ρ . • Let us denote by L K the integral operator on L 2 ( X, ν ) defined by � ( L K f )( s ) = X K ( s, x ) f ( x ) dν ( x ) . • Assuming that the regression function f ρ belongs to the range of the operator ( L K ) r (for some r ∈ (0 , 1]), then A ( λ ) ≤ C r λ r . 13

Rates of convergence Given the explicit form for the bound on the expected risk, the associated optimal choice for λ can be directly com- puted. It results that λ 0 ( ℓ ) = O ( ℓ − α ), where  2 for 0 < r ≤ 1  2 r +3 2   α =  1 for 1  2 < r ≤ 1  2 r +1 this choice implies the following convergence rate of the risk to its minimum I [ f λ z ] − inf f ∈H I [ f ] ≤ O ( ℓ − β ), where  4 r for 0 < r ≤ 1  2 r +3 2   β =  2 r for 1  2 < r ≤ 1  2 r +1 14

Fast rates Under the maximum regularity assumption r = 1 ( f ρ belonging to the range of L K ) these results give the optimal rate O ( ℓ − 2 I [ f λ 3 log 1 /η ) z ] − inf f ∈H I [ f ] ≤ This improves • the rate in [T.Zhang 2003] in its dependency on the confidence level η from O ( η − 1 ) to logarithmic. • and the rate in [S.Smale, D.Zhou 2004] from O ( ℓ − 1 / 2 ) to O ( ℓ − 2 / 3 ) dependency. 15

The degree of ill-posedness of L K We will assume the following decay condition on the eigenvalues σ 2 i of the integral operator L K , for some p ≥ 1 σ 2 C p i − p . ≤ i • The parameter p is known as the degree of ill-posedness of the operator L K . • This condition can be related to the smoothness properties of the kernel K and the marginal probability den- sity. 16

Improved bound on the sample error Defined the function  �  � 1 � � C p p � κC η ( ℓλ ) − 1  κ ( ℓλ ) − 1 p 2 + �   Θ( λ, η, ℓ ) =  , 2 p − 1 λ and given λ , η and ℓ such that Θ( λ, η, ℓ ) ≤ 1, then with probability at least 1 − η , the sample error is bounded by � � 1 2 κC r C η λ r − 1 2 ℓ − 1 1 + κ ( ℓλ ) − 1 (1 − Θ) − 1 S ( λ, η, ℓ ) = 2 2 � � 1 + 1 1 + M κ − 1 λ 2Θ(1 − Θ) − 1 2 Θ . 17

Improved rates of convergence The new bound can be used to obtain improved rates of convergence when 1 2 < r ≤ 1, in fact in this case p O ( ℓ − α ) λ 0 ( ℓ ) = with α = 2 rp + 1 and correspondingly 2 rp I [ f λ O ( ℓ − β ) z ] − inf f ∈H I [ f ] with β = ≤ 2 rp + 1 . For large p the found convergence rate approaches O ( ℓ − 1 ). 18

Conclusions • The estimate of the sample error S ( λ, η, ℓ ) does not require using covering numbers as a capacity measure of the hypothesis space, • under the assumption of exponential decay of the eigenvalues of L K , rates arbitrarily close to O ( ℓ − 1 ) can be achieved, • due to the logarithmic dependence on the confidence in the expression of the bounds, convergence results hold almost surely and not just in probability . 19

Model Selection and Fast Rates for Regularized Least-Squares Andrea - PowerPoint PPT Presentation

DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1 Plan Regularized least-squares (RLS) in statistical

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Solving Regularized Total Least Squares Problems Based on Eigensolvers Heinrich Voss

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Prediction, model selection, and causal inference with regularized regression Introducing two

Prediction, model selection, and causal inference with regularized regression Introducing two

Advanced Macroeconomics 7. Exchange Rates, Interest Rates and Expectations Karl Whelan School of

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Kernels to detect abrupt changes in time series Alain Celisse 1 UMR 8524 CNRS - Universit e

Self-bounding functions and concentration of variance Andreas Maurer Advances in stochastic

A connection between the Uncertainty Principles on the real line (Heisenberg) and on the circle

California: Trying to keep it real in the Trump years. Policy Insights 2018 CBPC Jared

Matrix-valued Chernoff Bounds and Applications China Theory Week Anastasios Zouzias University

SzegTaikov inequality for conjugate polynomials Polina Glazyrina Ural Federal University

Supremacy Experiments Complexity-Theoretic Foundations of Quantum . . . . . . 1 / 29 UT

New characterizations of completely monotone functions and Bernstein functions, a converse to

Model Selection and Fast Rates for Regularized Least-Squares Andrea - PowerPoint PPT Presentation

DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1 Plan Regularized least-squares (RLS) in statistical

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Solving Regularized Total Least Squares Problems Based on Eigensolvers Heinrich Voss

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Prediction, model selection, and causal inference with regularized regression Introducing two

Prediction, model selection, and causal inference with regularized regression Introducing two

Advanced Macroeconomics 7. Exchange Rates, Interest Rates and Expectations Karl Whelan School of

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Kernels to detect abrupt changes in time series Alain Celisse 1 UMR 8524 CNRS - Universit e

Self-bounding functions and concentration of variance Andreas Maurer Advances in stochastic

A connection between the Uncertainty Principles on the real line (Heisenberg) and on the circle

California: Trying to keep it real in the Trump years. Policy Insights 2018 CBPC Jared

Matrix-valued Chernoff Bounds and Applications China Theory Week Anastasios Zouzias University

SzegTaikov inequality for conjugate polynomials Polina Glazyrina Ural Federal University

Supremacy Experiments Complexity-Theoretic Foundations of Quantum . . . . . . 1 / 29 UT

New characterizations of completely monotone functions and Bernstein functions, a converse to

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?