learning from examples as an inverse problem
play

Learning from examples as an inverse problem E. De Vito - PowerPoint PPT Presentation

Learning from examples as an inverse problem E. De Vito Dipartimento di Matematica, Universit` a di Modena e Reggio Emilia Genova, October 30 2004 1 Plan of the talk 1. Motivations 2. Statistical learning theory and regularized least-squares


  1. Learning from examples as an inverse problem E. De Vito Dipartimento di Matematica, Universit` a di Modena e Reggio Emilia Genova, October 30 2004 1

  2. Plan of the talk 1. Motivations 2. Statistical learning theory and regularized least-squares algorithm 3. Linear inverse problem 4. Formal connection between 2. and 3. 5. Conclusions 2

  3. Motivations 1. The learning theory is mainly developed in a probabilis- tic framework 2. The learning problem can be seen as the regression problem of approximating a function from sparse data and, hence, is an ill-posed problem 3. The learning algorithms are a particular instance of the regularization theory developed for ill-posed prob- lems 4. The stability of the solution is with respect to pertur- bations of the data, which play the role of noise 3

  4. A question and a few references ———————————————————————– Is the learning theory a linear inverse problem ? ———————————————————————– 1. T.Poggio, F. Girosi, 247 Science (1990) 978-982 2. Girosi, M. Jones, T. Poggio, 7 Neural Comp. (1995) 219-269 3. V. Vapnik, Statistical learning theory , 1998 4. T. Evgeniou, M. Pontil, T. Poggio, 13 Adv.Comp.Math. (2000) 1-50 5. F. Cucker, S. Smale, Bull. Amer. Math. Soc., 39 (2002) 1-49 4

  5. Statistical learning theory: building blocks 1. An relation between two sets of variables, X and Y . The relation is unknown , up to a set of ℓ -examples z = (( x 1 , y 1 ) , . . . , ( x ℓ , y ℓ )), and the aim of the learning theory is to describe it by means of a function f : X → Y 2. A quantitative measure of how well a function f describes the relation between x ∈ X and y ∈ Y 3. A hypothesis space H of functions encoding some a-priori knowl- edge on the relation 4. An algorithm that provides an estimator f z ∈ H for any training set z 5. A quantitative measure of the performance of the algorithm 5

  6. 1. The distribution ρ 1. The input space X is a subset of R m 2. The output space Y is R (regression) 3. The relation between x and y is described by an un- known probability distribution ρ on X × Y 6

  7. 2. The expected risk 1. The expected risk of a function f : X → Y is � X × Y ( f ( x ) − y ) 2 dρ ( x, y ) I [ f ] = and measures how well f describes the relation be- tween x and y modeled by ρ 2. The regression function � g ( x ) = Y ydρ ( y | x ) is the minimizer of the expected risk over the set of all functions f : X → R ( ρ ( y | x ) is the conditional distribution of y given x ) 7

  8. 3. The hypothesis space H The space H is a reproducing kernel Hilbert space 1. The elements of H are functions f : X → R 2. The following reproducing property holds f ( x ) = � f, K x � H K x ∈ H 3. The function f H = argmin { I [ f ] } f ∈H is the best estimator in H 8

  9. 4. The regularized least-squares algorithm 1. The examples ( x 1 , y 1 ) , . . . , ( x ℓ , y ℓ ) are drawn indepen- dently and are identically distributed according to ρ 2. Given λ > 0, the regularized least-squares estimator is ℓ { 1 f z λ = argmin ( f ( x i ) − y i ) 2 + λ � f � 2 � H } ℓ f ∈H i =1 for each training set z ∈ ( X × Y ) ℓ 3. f z λ is a random variable defined on the probability space ( X × Y ) ℓ and taking values in the Hilbert space H 9

  10. 5. Probabilistic bounds and consistency 1. A probabilistic bound B ( λ, ℓ, η ) is a function depending on the regularization parameter λ , the number ℓ of examples and the confidence level 1 − η such that � � 0 ≤ I [ f z λ ] − I [ f H ] ≤ B ( λ, ℓ, η ) Prob ≥ 1 − η z ∈ ( X × Y ) ℓ 2. B ( λ, ℓ, η ) measures the performance of the algorithm 3. B decreases as function of η and of ℓ 4. The algorithm is consistent if it is possible to choose λ , as a function of ℓ , so that, for all ǫ > 0, � � I [ f z λ ℓ ] − I [ f H ] ≥ ǫ lim Prob = 0 ℓ → + ∞ z ∈ ( X × Y ) ℓ 10

  11. Plan of the talk 1. Motivations 2. Statistical learning and regularized least-squares algo- rithm 3. Linear inverse problem 4. Formal connection between 2. and 3. 5. Conclusions 11

  12. The linear inverse problem 1. The operator A : H → K 2. The exact datum g ∈ K 3. The exact problem: f ∈ H such that Af = g 4. The noisy datum g δ ∈ K 5. The measure of the noise � g − g δ � K ≤ δ 6. The regularized solution of the noisy problem is { � Af − g δ � 2 K + λ � f � 2 f λ δ = argmin H } λ > 0 f ∈H 12

  13. Comments 1. The regularization parameter λ > 0 ensures existence and uniqueness of the minimizer f λ δ 2. The theory can be extended to the case of a noisy operator A δ : H → K 3. The measures of the noise are � g − g δ � H ≤ δ 1 � A − A δ � L ( H , K ) ≤ δ 2 4. Both g and g δ belong to the same space 5. Both A and A δ belong to the same space 13

  14. The reconstruction error � δ − f † � � f λ 1. The reconstruction error � H measures the dis- � � tance between f λ δ and the generalized solution f † = argmin { � Af − g � 2 K } f ∈H (if the minimizer is not unique, f † is the minimizer of minimal norm) 2. The parameter λ is chosen, as a function of δ , so that � � � f λ δ − f † � � lim = 0 � � δ δ → 0 � H 14

  15. The residual 1. The residual of f λ δ is � δ − Af † � � � � Af λ � Af λ � K = δ − Pg � � � � � K where P is the projection onto the closure of Im A . 2. The residual is a weaker measure than the recon- struction error � Af − Pg � L 2 ( X,ν ) ≤ � A � L ( H , K ) � f − f H � H 15

  16. Plan of the talk 1. Motivations 2. Statistical learning and regularized least-squares algo- rithm 3. Linear inverse problem 4. Formal connection between 2. and 3. [E. De Vito, A. Caponnetto, L. Rosasco, preprint (’04)] 5. Conclusions 16

  17. I am looking for ... 1. An operator A : H → K 2. An exact datum g such that f H is the generalized so- lution of the inverse problem Af = g 3. A noisy datum g δ and, possibly, a noisy operator A δ 4. A noise measure δ in terms of the number ℓ of examples in the training set with the property that the algorithm is consistent, if δ converges to zero 17

  18. The power of the square The expected risk of a function f : X → R is � X × Y ( f ( x ) − y ) 2 dρ ( x, y ) I [ f ] = � f − g � 2 = L 2 ( X,ν ) + I [ g ] where ν is the marginal distribution on X , � X f ( x ) 2 dν ( x ) � f � 2 L 2 ( X,ν ) = and g is the regression function 18

  19. The exact problem The equation I [ f ] = � f − g � 2 L 2 ( X,ν ) + I [ g ] suggests that: 1. the data space K is L 2 ( X, ν ) 2. the exact operator A : H → L 2 ( X, ν ) is the canonical immersion Af = f ( the norm of f in H is different from the norm of f in L 2 ( X, ν ) ) 3. the exact datum is the regression function g 19

  20. Comments 1. The ideal solution f H , which is the minimizer of the expected risk over H , is the generalized solution of the inverse problem Af = g . 2. For any f ∈ H , I [ f ] − I [ f H ] = � Af − Pg � 2 L 2 ( X,ν ) where P is the projection onto the closure of H ⊂ L 2 ( X, ν ) 3. The function f is a good estimator if it is an approx- imation of Pg in L 2 -norm, that is, if f has a small residual 20

  21. but ... 1. The regularized least-squares estimator is { 1 f z λ = argmin ℓ � A x f − y � 2 2 } R ℓ + λ � f � H f ∈H A x : H → R ℓ ( A x f ) i = f ( x i ) x = ( x 1 , . . . , x ℓ ) ∈ X ℓ y = ( y 1 , . . . , y ℓ ) ∈ R ℓ 2. f z λ is the regularized solution of the discretized problem A x f = y 21

  22. Where has the noise gone ? 1. The exact problem: Af = g A : H → L 2 ( X, ν ) g ∈ L 2 ( X, ν ) 2. The noisy problem: A x f = y A x : H → R ℓ y ∈ R ℓ 3. g and y belongs to different spaces 4. A and A x belongs to different spaces 5. A x and y are random variables 22

  23. A possible solution 1. The regularized solution of the inverse problem Af = g is f λ = argmin { � Af − g � 2 2 } L 2 ( X,ν ) + λ � f � H f ∈H 2. The functions f λ and f z λ are explicitly given by ( T + λ ) − 1 A ∗ g f λ A ∗ A A ∗ g = = = T h f z λ ( T x + λ ) − 1 h z A x ∗ A x A ∗ y = = = T x h z 3. the vectors h and h z belong to H 4. T and T x are Hilbert-Schmidt operators from H to H 23

  24. The noise 1. The quantities = � h z − h � H δ 1 = � T x − T � L ( H ) δ 2 are the measures of the noise associated to the training set z = ( x , y ) 2. By means of a rescaling of the constants � � T x − T � L ( H ) � � ≤ 1 � � � � I [ f z λ ] − I [ f H ] − I [ f λ ] − I [ f H ] � � √ √ + � h z − h � H � � λ λ � 3. δ 1 and δ 2 do not depend on λ and are of probabilistic nature, the effect of the regularization procedure is factorized by analytic methods 24

  25. Generalized Bennett inequality 1. Since H is a reproducing kernel Hilbert space, that is, f ( x ) = � f, K x � H 1 � ℓ = = E x,y [ yK x ] h z i =1 y i K x i h ℓ 1 � ℓ = = E x [ �· , K x � K x ] T x i =1 �· , K x i � H K x i T ℓ 2. Theorem [Smale-Yao (’04)] Let ξ : X × Y → H be a random variable, � ξ ( x, y ) � H ≤ 1  � �  ℓ 1 − ℓ � � � �   � � � Prob ξ ( x i , y i ) − E x,y ( ξ ) ≥ ǫ  ≤ 2 exp 2 ǫ log(1 + ǫ ) � � ℓ z ∈ ( X × Y ) ℓ � �  i =1 � � H 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend