Model Selection and Fast Rates for Regularized Least-Squares Andrea - - PowerPoint PPT Presentation

model selection and fast rates for regularized least
SMART_READER_LITE
LIVE PREVIEW

Model Selection and Fast Rates for Regularized Least-Squares Andrea - - PowerPoint PPT Presentation

DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1 Plan Regularized least-squares (RLS) in statistical


slide-1
SLIDE 1

DISI, Universit` a di Genova

Genova, October 30 2004

CBCL, Massachusetts Institute of Technology

Model Selection and Fast Rates for Regularized Least-Squares

Andrea Caponnetto

1

slide-2
SLIDE 2

Plan

  • Regularized least-squares (RLS) in statistical learning
  • Bounds on the expected risk and model selection
  • Evaluating the approximation and the sample errors
  • Fast rates of convergence of the risk to its minimum

2

slide-3
SLIDE 3

Training sets

  • The set Z = X ×Y , with the input space X a compact

in I Rn and the output space Y a compact in I R.

  • The probability measure ρ on the space Z.
  • The training set z = ((x1, y1), · · · , (xℓ, yℓ)), a sequence
  • f ℓ independent identically distributed elements in Z.

3

slide-4
SLIDE 4

Regression using RLS

The estimator fλ

z is defined as the unique hypothesis min-

imizing the sum of loss and complexity fλ

z = argmin

f∈H

 1

  • i=1

(f(xi) − yi)2 + λ f2

H

  ,

where

  • the hypothesis space H is the reproducing kernel Hilbert

space (RKHS) with kernel K : X × X → I R,

  • the parameter λ tunes the balance between the two

terms.

4

slide-5
SLIDE 5

A criterion for model selection

In the context of RLS, a criterion for model selection is represented by a rule to choose λ in order to achieve high performance. The performance of the estimator fλ

z is measured by the

expected risk I[fλ

z ] =

  • X×Y (fλ

z (x) − y)2 dρ(x, y).

  • It is a random variable,
  • it depends on the unknown distribution ρ.

5

slide-6
SLIDE 6

A criterion for model selection (cont.)

The best we can do is determining a function B(λ, η, ℓ) which bounds with confidence level 1 − η the expected risk I[fλ

z ], that is

Prob

z∈Zℓ

  • I[fλ

z ] ≤ inf

f∈H I[f] + B(λ, η, ℓ)

  • ≥ 1 − η.

Then, a natural criterion for model selection consists of the choice for the regularization parameter minimizing this bound λ0(η, ℓ) = argmin

λ>0

{B(λ, η, ℓ)}.

6

slide-7
SLIDE 7

Main contributions in the literature

  • Model selection performed by bounds using covering numbers as

a measure of capacity of a compact hypothesis space [F.Cucker,

  • S. Smale, 2001, 2002]
  • Use of stability of the estimator and concentration inequalities

as tools to bound the risk [O. Bousquet, A. Elisseeff, 2000]

  • Direct estimates of integral operators by concentration inequal-

ities, no need of covering numbers [E. De Vito et al., 2004]

  • Use of a Bernstein form of McDiarmid concentration inequality

to improve the rates [S. Smale, D. Zhou, 2004]

7

slide-8
SLIDE 8

A concentration inequality (McDiarmid, 1989)

  • Let ξ be a random variable, ξ : Zℓ → I

R,

  • let zi be the training set with the ith example replaced

by (x′

i, y′ i),

  • assume that there is a constant C such that

| ξ(z) − ξ(zi) | ≤ C for all z, zi, i then McDiarmid inequality tells us that Prob

z∈Zℓ (| ξ(z) − Ez (ξ) | ≥ ǫ) ≤ 2exp

  • − 2ǫ2

ℓC2

  • .

8

slide-9
SLIDE 9

A Bernstein form of McDiarmid inequality

(Y. Ying, 2004)

  • Bounding both variations

| ξ(z) − Eiξ(zi) | ≤ C for all z, i

  • and variances

Ei( ξ(z) − Eiξ(zi) )2 ≤ σ2 for all z, i it holds Prob

z∈Zℓ (| ξ(z) − Ez (ξ) | ≥ ǫ) ≤ 2exp

ǫ2 2(Cǫ/3 + ℓσ2)

  • .

9

slide-10
SLIDE 10

Structure of the bound

I[fλ

z ] ≤

inf

f∈H I[f]

  • irreducible err.

+

  

A(λ)

approximation err.

+ S(λ, η, ℓ)

  • sample err.

  

2

.

  • The irreducible error is a measure of the intrinsic ran-

domness of the outputs y for a drawn input x.

  • The approximation error A(λ) is a measure of the

increase of risk due to the regularization.

  • The sample error S(λ, η, ℓ) is a measure of the increase
  • f risk due to finite sampling.

10

slide-11
SLIDE 11

The bound on the sample error

It can be proved that, given 0 < η < 1 and λ > 0, with probability at least 1 − η, the sample error is bounded by S(λ, η, ℓ) = κMCη(ℓλ)−1

2

  • 1 + κ(ℓλ)−1

2

1 + κ2Cηℓ−1

2λ−1

  • ,

where the constants M, κ and Cη are defined by Y ⊂ [−M, M] , κ2 ≥ K(x, x) for all x , Cη = 1 − 4 3 log η +

  • −8 log η .

11

slide-12
SLIDE 12

The approximation error

It can be proved that A(λ) =

  • fλ − fρ
  • L2(X,ν) ,

where

  • fρ(x) =
  • Y ydρ(y|x) is the regression function
  • fλ is the RLS estimator in the limit case of infinite

sampling, that is fλ = argmin

f∈H

{I[f] + λ f2

H}.

  • ν is the marginal distribution of ρ on the input space

X.

12

slide-13
SLIDE 13

Bounding the approximation error

It is well known that bounding the approximation error requires some assumption on the distribution ρ.

  • Let us denote by LK the integral operator on L2(X, ν)

defined by (LKf)(s) =

  • X K(s, x)f(x)dν(x).
  • Assuming that the regression function fρ belongs to

the range of the operator (LK)r (for some r ∈ (0, 1]), then A(λ) ≤ Cr λr.

13

slide-14
SLIDE 14

Rates of convergence

Given the explicit form for the bound on the expected risk, the associated optimal choice for λ can be directly com-

  • puted. It results that λ0(ℓ) = O(ℓ−α), where

α =

      

2 2r+3

for 0 < r ≤ 1

2 1 2r+1

for 1

2 < r ≤ 1

this choice implies the following convergence rate of the risk to its minimum I[fλ

z ] − inff∈H I[f] ≤ O(ℓ−β), where

β =

      

4r 2r+3

for 0 < r ≤ 1

2 2r 2r+1

for 1

2 < r ≤ 1

14

slide-15
SLIDE 15

Fast rates

Under the maximum regularity assumption r = 1 (fρ belonging to the range of LK) these results give the

  • ptimal rate

I[fλ

z ] − inf

f∈H I[f]

≤ O(ℓ−2

3 log 1/η)

This improves

  • the rate in [T.Zhang 2003] in its dependency on the

confidence level η from O(η−1) to logarithmic.

  • and the rate in [S.Smale, D.Zhou 2004] from O(ℓ−1/2)

to O(ℓ−2/3) dependency.

15

slide-16
SLIDE 16

The degree of ill-posedness of LK

We will assume the following decay condition on the eigenvalues σ2

i of the integral operator LK, for some p ≥ 1

σ2

i

≤ Cp i−p.

  • The parameter p is known as the degree of ill-posedness
  • f the operator LK.
  • This condition can be related to the smoothness prop-

erties of the kernel K and the marginal probability den- sity.

16

slide-17
SLIDE 17

Improved bound on the sample error

Defined the function Θ(λ, η, ℓ) = κCη(ℓλ)−1

2

   κ(ℓλ)−1

2 +

  • p

p − 1

Cp

λ

1

p

   ,

and given λ, η and ℓ such that Θ(λ, η, ℓ) ≤ 1, then with probability at least 1 − η, the sample error is bounded by S(λ, η, ℓ) = 1 2 κCrCηλr−1

2ℓ−1 2

  • 1 + κ(ℓλ)−1

2

  • (1 − Θ)−1

+ M κ−1λ

1 2Θ

  • 1 + 1

2Θ(1 − Θ)−1

  • .

17

slide-18
SLIDE 18

Improved rates of convergence

The new bound can be used to obtain improved rates of convergence when 1

2 < r ≤ 1, in fact in this case

λ0(ℓ) = O(ℓ−α) with α = p 2rp + 1 and correspondingly I[fλ

z ] − inf

f∈H I[f]

≤ O(ℓ−β) with β = 2rp 2rp + 1. For large p the found convergence rate approaches O(ℓ−1).

18

slide-19
SLIDE 19

Conclusions

  • The estimate of the sample error S(λ, η, ℓ) does not

require using covering numbers as a capacity measure

  • f the hypothesis space,
  • under the assumption of exponential decay of the eigen-

values of LK, rates arbitrarily close to O(ℓ−1) can be achieved,

  • due to the logarithmic dependence on the confidence

in the expression of the bounds, convergence results hold almost surely and not just in probability.

19