Learning from examples as an inverse problem E. De Vito - - PowerPoint PPT Presentation

learning from examples as an inverse problem
SMART_READER_LITE
LIVE PREVIEW

Learning from examples as an inverse problem E. De Vito - - PowerPoint PPT Presentation

Learning from examples as an inverse problem E. De Vito Dipartimento di Matematica, Universit` a di Modena e Reggio Emilia Genova, October 30 2004 1 Plan of the talk 1. Motivations 2. Statistical learning theory and regularized least-squares


slide-1
SLIDE 1

Learning from examples as an inverse problem

  • E. De Vito

Dipartimento di Matematica, Universit` a di Modena e Reggio Emilia

Genova, October 30 2004

1

slide-2
SLIDE 2

Plan of the talk

  • 1. Motivations
  • 2. Statistical learning theory and regularized least-squares

algorithm

  • 3. Linear inverse problem
  • 4. Formal connection between 2. and 3.
  • 5. Conclusions

2

slide-3
SLIDE 3

Motivations

  • 1. The learning theory is mainly developed in a probabilis-

tic framework

  • 2. The learning problem can be seen as the regression

problem of approximating a function from sparse data and, hence, is an ill-posed problem

  • 3. The learning algorithms are a particular instance of

the regularization theory developed for ill-posed prob- lems

  • 4. The stability of the solution is with respect to pertur-

bations of the data, which play the role of noise

3

slide-4
SLIDE 4

A question and a few references

———————————————————————– Is the learning theory a linear inverse problem ? ———————————————————————–

  • 1. T.Poggio, F. Girosi, 247 Science (1990) 978-982
  • 2. Girosi, M. Jones, T. Poggio, 7 Neural Comp. (1995) 219-269
  • 3. V. Vapnik, Statistical learning theory, 1998
  • 4. T. Evgeniou, M. Pontil, T. Poggio, 13 Adv.Comp.Math. (2000)

1-50

  • 5. F. Cucker, S. Smale, Bull. Amer. Math. Soc., 39 (2002) 1-49

4

slide-5
SLIDE 5

Statistical learning theory: building blocks

  • 1. An relation between two sets of variables, X and Y . The relation

is unknown, up to a set of ℓ-examples z = ((x1, y1), . . . , (xℓ, yℓ)), and the aim of the learning theory is to describe it by means of a function f : X → Y

  • 2. A quantitative measure of how well a function f describes the

relation between x ∈ X and y ∈ Y

  • 3. A hypothesis space H of functions encoding some a-priori knowl-

edge on the relation

  • 4. An algorithm that provides an estimator fz ∈ H for any training

set z

  • 5. A quantitative measure of the performance of the algorithm

5

slide-6
SLIDE 6
  • 1. The distribution ρ
  • 1. The input space X is a subset of Rm
  • 2. The output space Y is R (regression)
  • 3. The relation between x and y is described by an un-

known probability distribution ρ on X × Y

6

slide-7
SLIDE 7
  • 2. The expected risk
  • 1. The expected risk of a function f : X → Y is

I[f] =

  • X×Y (f(x) − y)2 dρ(x, y)

and measures how well f describes the relation be- tween x and y modeled by ρ

  • 2. The regression function

g(x) =

  • Y ydρ(y|x)

is the minimizer of the expected risk over the set of all functions f : X → R ( ρ(y|x) is the conditional distribution of y given x )

7

slide-8
SLIDE 8
  • 3. The hypothesis space H

The space H is a reproducing kernel Hilbert space

  • 1. The elements of H are functions f : X → R
  • 2. The following reproducing property holds

f(x) = f, KxH Kx ∈ H

  • 3. The function

fH = argmin

f∈H

{ I[f]} is the best estimator in H

8

slide-9
SLIDE 9
  • 4. The regularized least-squares algorithm
  • 1. The examples (x1, y1), . . . , (xℓ, yℓ) are drawn indepen-

dently and are identically distributed according to ρ

  • 2. Given λ > 0, the regularized least-squares estimator

is fzλ = argmin

f∈H

{ 1 ℓ

  • i=1

(f(xi) − yi)2 + λ f2

H}

for each training set z ∈ (X × Y )ℓ

  • 3. fzλ is a random variable defined on the probability space

(X × Y )ℓ and taking values in the Hilbert space H

9

slide-10
SLIDE 10
  • 5. Probabilistic bounds and consistency
  • 1. A probabilistic bound B(λ, ℓ, η) is a function depending
  • n the regularization parameter λ, the number ℓ of

examples and the confidence level 1 − η such that Prob

z∈(X×Y )ℓ

  • 0 ≤ I[fzλ] − I[fH] ≤ B(λ, ℓ, η)
  • ≥ 1 − η
  • 2. B(λ, ℓ, η) measures the performance of the algorithm
  • 3. B decreases as function of η and of ℓ
  • 4. The algorithm is consistent if it is possible to choose

λ, as a function of ℓ, so that, for all ǫ > 0, lim

ℓ→+∞

Prob

z∈(X×Y )ℓ

  • I[fzλℓ] − I[fH] ≥ ǫ
  • = 0

10

slide-11
SLIDE 11

Plan of the talk

  • 1. Motivations
  • 2. Statistical learning and regularized least-squares algo-

rithm

  • 3. Linear inverse problem
  • 4. Formal connection between 2. and 3.
  • 5. Conclusions

11

slide-12
SLIDE 12

The linear inverse problem

  • 1. The operator A : H → K
  • 2. The exact datum g ∈ K
  • 3. The exact problem: f ∈ H such that Af = g
  • 4. The noisy datum gδ ∈ K
  • 5. The measure of the noise g − gδK ≤ δ
  • 6. The regularized solution of the noisy problem is

δ = argmin f∈H

{ Af − gδ2

K + λ f2 H}

λ > 0

12

slide-13
SLIDE 13

Comments

  • 1. The regularization parameter λ > 0 ensures existence

and uniqueness of the minimizer fλ

δ

  • 2. The theory can be extended to the case of a noisy
  • perator Aδ : H → K
  • 3. The measures of the noise are

g − gδH ≤ δ1 A − AδL(H,K) ≤ δ2

  • 4. Both g and gδ belong to the same space
  • 5. Both A and Aδ belong to the same space

13

slide-14
SLIDE 14

The reconstruction error

  • 1. The reconstruction error

δ − f†

  • H measures the dis-

tance between fλ

δ and the generalized solution

f† = argmin

f∈H

{ Af − g2

K}

(if the minimizer is not unique, f† is the minimizer of minimal norm)

  • 2. The parameter λ is chosen, as a function of δ, so that

lim

δ→0

  • fλδ

δ

− f†

  • H

= 0

14

slide-15
SLIDE 15

The residual

  • 1. The residual of fλ

δ is

  • Afλ

δ − Af†

  • K =
  • Afλ

δ − Pg

  • K

where P is the projection onto the closure of Im A.

  • 2. The residual is a weaker measure than the recon-

struction error Af − PgL2(X,ν) ≤ AL(H,K) f − fHH

15

slide-16
SLIDE 16

Plan of the talk

  • 1. Motivations
  • 2. Statistical learning and regularized least-squares algo-

rithm

  • 3. Linear inverse problem
  • 4. Formal connection between 2. and 3.

[E. De Vito, A. Caponnetto, L. Rosasco, preprint (’04)]

  • 5. Conclusions

16

slide-17
SLIDE 17

I am looking for ...

  • 1. An operator A : H → K
  • 2. An exact datum g such that fH is the generalized so-

lution of the inverse problem Af = g

  • 3. A noisy datum gδ and, possibly, a noisy operator Aδ
  • 4. A noise measure δ in terms of the number ℓ of examples

in the training set with the property that the algorithm is consistent, if δ converges to zero

17

slide-18
SLIDE 18

The power of the square

The expected risk of a function f : X → R is I[f] =

  • X×Y (f(x) − y)2 dρ(x, y)

= f − g2

L2(X,ν) + I[g]

where ν is the marginal distribution on X, f2

L2(X,ν) =

  • X f(x)2 dν(x)

and g is the regression function

18

slide-19
SLIDE 19

The exact problem

The equation I[f] = f − g2

L2(X,ν) + I[g]

suggests that:

  • 1. the data space K is L2(X, ν)
  • 2. the exact operator A : H → L2(X, ν) is the canonical

immersion Af = f ( the norm of f in H is different from the norm of f in L2(X, ν) )

  • 3. the exact datum is the regression function g

19

slide-20
SLIDE 20

Comments

  • 1. The ideal solution fH, which is the minimizer of the

expected risk over H, is the generalized solution of the inverse problem Af = g.

  • 2. For any f ∈ H,

I[f] − I[fH] = Af − Pg2

L2(X,ν)

where P is the projection onto the closure of H ⊂ L2(X, ν)

  • 3. The function f is a good estimator if it is an approx-

imation of Pg in L2-norm, that is, if f has a small residual

20

slide-21
SLIDE 21

but ...

  • 1. The regularized least-squares estimator is

fzλ = argmin

f∈H

{ 1 ℓ Axf − y2

Rℓ + λfH

2}

Ax : H → Rℓ (Axf)i = f(xi)

x = (x1, . . . , xℓ) ∈ Xℓ y = (y1, . . . , yℓ) ∈ Rℓ

  • 2. fzλ is the regularized solution of the discretized problem

Axf = y

21

slide-22
SLIDE 22

Where has the noise gone ?

  • 1. The exact problem: Af = g

A : H → L2(X, ν) g ∈ L2(X, ν)

  • 2. The noisy problem: Axf = y

Ax : H → Rℓ

y ∈ Rℓ

  • 3. g and y belongs to different spaces
  • 4. A and Ax belongs to different spaces
  • 5. Ax and y are random variables

22

slide-23
SLIDE 23

A possible solution

  • 1. The regularized solution of the inverse problem Af = g

is fλ = argmin

f∈H

{ Af − g2

L2(X,ν) + λfH 2}

  • 2. The functions fλ and fzλ are explicitly given by

fλ = (T + λ)−1A∗g T = A∗A h = A∗g fzλ = (Tx + λ)−1hz Tx = Ax∗Ax hz = A∗y

  • 3. the vectors h and hz belong to H
  • 4. T and Tx are Hilbert-Schmidt operators from H to H

23

slide-24
SLIDE 24

The noise

  • 1. The quantities

δ1 = hz − hH δ2 = Tx − TL(H) are the measures of the noise associated to the training set z = (x, y)

  • 2. By means of a rescaling of the constants
  • I[fzλ] − I[fH] −
  • I[fλ] − I[fH]
  • ≤ 1

√ λ

Tx − TL(H)

√ λ + hz − hH

  • 3. δ1 and δ2 do not depend on λ and are of probabilistic

nature, the effect of the regularization procedure is factorized by analytic methods

24

slide-25
SLIDE 25

Generalized Bennett inequality

  • 1. Since H is a reproducing kernel Hilbert space, that is,

f(x) = f, KxH hz =

1 ℓ

i=1 yiKxi

h =

Ex,y[yKx]

Tx =

1 ℓ

i=1 ·, KxiHKxi

T = Ex[·, Kx Kx]

  • 2. Theorem [Smale-Yao (’04)]

Let ξ : X × Y → H be a random variable, ξ(x, y)H ≤ 1 Prob

z∈(X×Y )ℓ   

  • 1

  • i=1

ξ(xi, yi) − Ex,y(ξ)

  • H

≥ ǫ

   ≤ 2 exp

  • − ℓ

2ǫ log(1 + ǫ)

  • 25
slide-26
SLIDE 26

The probabilistic bound

With probability greater than 1 − η

  • I[fzλ] − I[fH] −
  • I[fλ] − I[fH]
  • C1

√ λ2ℓ + C2 √ λℓ log 4 η+o

  • 1

√ λ2ℓ

  • 1. The subset of z ∈ (X × Y )ℓ for which the bound holds,

depends on ℓ and η, but not on λ

  • 2. C1 and C2 are numerical simple constants
  • 3. o(

1 √ λ2ℓ) depends also on η

26

slide-27
SLIDE 27

Consistency

  • 1. For given λ > 0 and ǫ > 0

lim

ℓ→+∞

Prob

z∈(X×Y )ℓ

  • |I[fzλ] − I[fλ]| ≥ ǫ
  • = 0

that is, I[fzλ] concentrates around I[fλ]

  • 2. The theory of inverse problems gives

limλ→0 I[fλ] = I[fH]

  • 3. With the choice of λ, as a function of ℓ,

lim

ℓ→+∞ λℓ = 0

lim

ℓ→+∞

1 ℓλ2

= 0 lim

ℓ→+∞

Prob

z∈(X×Y )ℓ

  • I[fzλℓ] − I[fH] ≥ ǫ
  • = 0

27

slide-28
SLIDE 28

Summary

inverse problem learning theory K L2(X, ν) = norm in K expected risk = exact operator A immersion of H into L2(X, ν) = exact datum g regression function = generalized solution f† ideal solution fH = reconstruction error (residual)2 = I[f] − I[fH] ≃ noisy space

Rℓ

≃ noisy operator Ax immersion of H into Rℓ ≃ noise on g δ1 = Ax∗y − A∗gH ≃ noise on A δ2 = Ax∗Ax − A∗AL(H) ≃ regularization δ → 0 consistency ℓ → +∞ =

28

slide-29
SLIDE 29

Conclusions

  • 1. Is the learning theory a linear inverse problem ?

Yes, but almost surely

  • 2. A probabilistic bound on I[fzλ] − I[fH] is given
  • 3. By means of analytic methods the dependence of the

bound by the regularization parameter is factorized

  • 4. By means of probabilistic inequalities, the noises δ1 and

δ2 are related to the number ℓ of examples

  • 5. The bound is simple, very general, but not optimal

29