model selection and fast rates for regularized least
play

Model Selection and Fast Rates for Regularized Least-Squares Andrea - PowerPoint PPT Presentation

DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1 Plan Regularized least-squares (RLS) in statistical


  1. DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1

  2. Plan • Regularized least-squares (RLS) in statistical learning • Bounds on the expected risk and model selection • Evaluating the approximation and the sample errors • Fast rates of convergence of the risk to its minimum 2

  3. Training sets • The set Z = X × Y , with the input space X a compact R n and the output space Y a compact in I in I R. • The probability measure ρ on the space Z . • The training set z = (( x 1 , y 1 ) , · · · , ( x ℓ , y ℓ )), a sequence of ℓ independent identically distributed elements in Z . 3

  4. Regression using RLS The estimator f λ z is defined as the unique hypothesis min- imizing the sum of loss and complexity   ℓ  1 ( f ( x i ) − y i ) 2 + λ � f � 2 � f λ  , z = argmin H ℓ f ∈H i =1 where • the hypothesis space H is the reproducing kernel Hilbert space (RKHS) with kernel K : X × X → I R, • the parameter λ tunes the balance between the two terms. 4

  5. A criterion for model selection In the context of RLS, a criterion for model selection is represented by a rule to choose λ in order to achieve high performance. The performance of the estimator f λ z is measured by the expected risk � z ( x ) − y ) 2 dρ ( x, y ) . I [ f λ X × Y ( f λ z ] = • It is a random variable, • it depends on the unknown distribution ρ . 5

  6. A criterion for model selection (cont.) The best we can do is determining a function B ( λ, η, ℓ ) which bounds with confidence level 1 − η the expected risk I [ f λ z ], that is � � I [ f λ Prob z ] ≤ inf f ∈H I [ f ] + B ( λ, η, ℓ ) ≥ 1 − η. z ∈ Z ℓ Then, a natural criterion for model selection consists of the choice for the regularization parameter minimizing this bound λ 0 ( η, ℓ ) = argmin {B ( λ, η, ℓ ) } . λ> 0 6

  7. Main contributions in the literature • Model selection performed by bounds using covering numbers as a measure of capacity of a compact hypothesis space [F.Cucker, S. Smale, 2001, 2002] • Use of stability of the estimator and concentration inequalities as tools to bound the risk [O. Bousquet, A. Elisseeff, 2000] • Direct estimates of integral operators by concentration inequal- ities, no need of covering numbers [E. De Vito et al. , 2004] • Use of a Bernstein form of McDiarmid concentration inequality to improve the rates [S. Smale, D. Zhou, 2004] 7

  8. A concentration inequality (McDiarmid, 1989) • Let ξ be a random variable, ξ : Z ℓ → I R, • let z i be the training set with the i th example replaced by ( x ′ i , y ′ i ), • assume that there is a constant C such that | ξ ( z ) − ξ ( z i ) | for all z , z i , i ≤ C then McDiarmid inequality tells us that � � − 2 ǫ 2 Prob z ∈ Z ℓ ( | ξ ( z ) − E z ( ξ ) | ≥ ǫ ) ≤ 2exp . ℓC 2 8

  9. A Bernstein form of McDiarmid inequality (Y. Ying, 2004) • Bounding both variations | ξ ( z ) − E i ξ ( z i ) | ≤ C for all z , i • and variances E i ( ξ ( z ) − E i ξ ( z i ) ) 2 σ 2 for all z , i ≤ it holds � � ǫ 2 Prob z ∈ Z ℓ ( | ξ ( z ) − E z ( ξ ) | ≥ ǫ ) ≤ 2exp − . 2( Cǫ/ 3 + ℓσ 2 ) 9

  10. Structure of the bound 2   I [ f λ z ] ≤ f ∈H I [ f ] inf +  A ( λ ) + S ( λ, η, ℓ )  .   � �� � � �� � approximation err . sample err . � �� � irreducible err . • The irreducible error is a measure of the intrinsic ran- domness of the outputs y for a drawn input x . • The approximation error A ( λ ) is a measure of the increase of risk due to the regularization. • The sample error S ( λ, η, ℓ ) is a measure of the increase of risk due to finite sampling. 10

  11. The bound on the sample error It can be proved that, given 0 < η < 1 and λ > 0, with probability at least 1 − η , the sample error is bounded by � � � � κMC η ( ℓλ ) − 1 1 + κ ( ℓλ ) − 1 1 + κ 2 C η ℓ − 1 2 λ − 1 S ( λ, η, ℓ ) = , 2 2 where the constants M , κ and C η are defined by Y ⊂ [ − M, M ] , κ 2 ≥ K ( x, x ) for all x , 1 − 4 � = 3 log η + − 8 log η . C η 11

  12. The approximation error It can be proved that � � � f λ − f ρ A ( λ ) = � � � L 2 ( X,ν ) , where � • f ρ ( x ) = Y ydρ ( y | x ) is the regression function • f λ is the RLS estimator in the limit case of infinite sampling , that is f λ = argmin { I [ f ] + λ � f � 2 H } . f ∈H • ν is the marginal distribution of ρ on the input space X . 12

  13. Bounding the approximation error It is well known that bounding the approximation error requires some assumption on the distribution ρ . • Let us denote by L K the integral operator on L 2 ( X, ν ) defined by � ( L K f )( s ) = X K ( s, x ) f ( x ) dν ( x ) . • Assuming that the regression function f ρ belongs to the range of the operator ( L K ) r (for some r ∈ (0 , 1]), then A ( λ ) ≤ C r λ r . 13

  14. Rates of convergence Given the explicit form for the bound on the expected risk, the associated optimal choice for λ can be directly com- puted. It results that λ 0 ( ℓ ) = O ( ℓ − α ), where  2 for 0 < r ≤ 1  2 r +3 2   α =  1 for 1  2 < r ≤ 1  2 r +1 this choice implies the following convergence rate of the risk to its minimum I [ f λ z ] − inf f ∈H I [ f ] ≤ O ( ℓ − β ), where  4 r for 0 < r ≤ 1  2 r +3 2   β =  2 r for 1  2 < r ≤ 1  2 r +1 14

  15. Fast rates Under the maximum regularity assumption r = 1 ( f ρ belonging to the range of L K ) these results give the optimal rate O ( ℓ − 2 I [ f λ 3 log 1 /η ) z ] − inf f ∈H I [ f ] ≤ This improves • the rate in [T.Zhang 2003] in its dependency on the confidence level η from O ( η − 1 ) to logarithmic. • and the rate in [S.Smale, D.Zhou 2004] from O ( ℓ − 1 / 2 ) to O ( ℓ − 2 / 3 ) dependency. 15

  16. The degree of ill-posedness of L K We will assume the following decay condition on the eigenvalues σ 2 i of the integral operator L K , for some p ≥ 1 σ 2 C p i − p . ≤ i • The parameter p is known as the degree of ill-posedness of the operator L K . • This condition can be related to the smoothness prop- erties of the kernel K and the marginal probability den- sity. 16

  17. Improved bound on the sample error Defined the function  �  � 1 � � C p p � κC η ( ℓλ ) − 1  κ ( ℓλ ) − 1 p 2 + �   Θ( λ, η, ℓ ) =  , 2 p − 1 λ and given λ , η and ℓ such that Θ( λ, η, ℓ ) ≤ 1, then with probability at least 1 − η , the sample error is bounded by � � 1 2 κC r C η λ r − 1 2 ℓ − 1 1 + κ ( ℓλ ) − 1 (1 − Θ) − 1 S ( λ, η, ℓ ) = 2 2 � � 1 + 1 1 + M κ − 1 λ 2Θ(1 − Θ) − 1 2 Θ . 17

  18. Improved rates of convergence The new bound can be used to obtain improved rates of convergence when 1 2 < r ≤ 1, in fact in this case p O ( ℓ − α ) λ 0 ( ℓ ) = with α = 2 rp + 1 and correspondingly 2 rp I [ f λ O ( ℓ − β ) z ] − inf f ∈H I [ f ] with β = ≤ 2 rp + 1 . For large p the found convergence rate approaches O ( ℓ − 1 ). 18

  19. Conclusions • The estimate of the sample error S ( λ, η, ℓ ) does not require using covering numbers as a capacity measure of the hypothesis space, • under the assumption of exponential decay of the eigen- values of L K , rates arbitrarily close to O ( ℓ − 1 ) can be achieved, • due to the logarithmic dependence on the confidence in the expression of the bounds, convergence results hold almost surely and not just in probability . 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend