1 introduction and motivations
play

1 Introduction and motivations Regression and classification from - PDF document

1 Introduction and motivations Regression and classification from an infinite dimensional predictor Settings ( X , Y ) is a random pair of variables where Y { 1 , 1 } (binary classification problem) or Y R X ( X , ., .


  1. 1 Introduction and motivations Regression and classification from an infinite dimensional predictor Settings ( X , Y ) is a random pair of variables where • Y ∈ {− 1 , 1 } (binary classification problem) or Y ∈ R • X ∈ ( X , � ., . � X ), an infinite dimensional Hilbert space. We are given a learning set S n = { ( X i , Y i ) } n i = 1 of n i.i.d. copies of ( X , Y ). Purpose : Find φ n : X → {− 1 , 1 } or R , that is universally consistent: Classi- fication case : lim n → + ∞ P ( φ n ( X ) � Y ) = L ∗ where L ∗ = inf φ : X→{− 1 , 1 } P ( φ ( X ) � Y ) �� φ n ( X ) − Y � 2 � = L ∗ where L ∗ = is the Bayes risk . Regression case : lim n → + ∞ E �� φ ( X ) − Y � 2 � inf φ : X→ R E will also be called the Bayes risk. An example Predicting the rate of yellow berry in durum wheat from its NIR spectrum . Using derivatives Practically , X ( m ) is often more relevant than X for the prediction.

  2. But X → X ( m ) induces information loss and � � φ : X→{− 1 , 1 } P ( φ ( X ) � Y ) = L ∗ φ ( X ( m ) ) � Y ≥ φ : D m X→{− 1 , 1 } P inf inf and �� � 2 � �� φ ( X ) − Y � 2 � = L ∗ . φ ( X ( m ) ) − Y ≥ φ : D m X→ R E inf φ : X→ R P inf 2

  3. Sampled functions Practically , ( X i ) i are not perfectly known; only a discrete sampling is given: X τ d i = ( X i ( t )) t ∈ τ d where τ d = { t τ d 1 , . . . , t τ d | τ d | } . The sampling can be non uniform... ... and the data can be corrupted by noise. Then , X ( m ) is estimated from X τ d i , by � X ( m ) τ d , which also induces information loss : i � � � � φ ( � X ( m ) φ ( X ( m ) ) � Y ≥ L ∗ φ : D m X→{− 1 , 1 } P inf τ d ) � Y ≥ φ : D m X→{− 1 , 1 } P inf and �� � 2 � �� � 2 � φ ( � X ( m ) φ ( X ( m ) ) − Y ≥ L ∗ . φ : D m X→ R E inf τ d ) − Y ≥ φ : D m X→ R E inf 3

  4. Purpose of the presentation Find a classifier or a regression function φ n ,τ d built from � X ( m ) τ d such that the risk of φ n ,τ d asymptotically reaches the Bayes risk L ∗ : � � φ n ,τ d ( � X ( m ) = L ∗ | τ d |→ + ∞ lim lim τ d ) � Y n → + ∞ P or �� � 2 � φ n ,τ d ( � = L ∗ X ( m ) τ d ) − Y | τ d |→ + ∞ lim lim n → + ∞ E Main idea : Use a relevant way to estimate X ( m ) from X τ d (by smoothing splines) and combine the consistency of splines with the consistency of a R | τ d | -classifier or re- gression function. 2 A general consistency result Basics about smoothing splines I Suppose that X is the Sobolev space � [0 , 1] |∀ j = 1 , . . . , m , D j h exists (weak sense) and D m h ∈ L 2 � H m = h ∈ L 2 equipped with the scalar product � m � u , v � H m = � D m u , D m v � L 2 + B j uB j v j = 1 where B are m boundary conditions such that Ker B ∩ P m − 1 = { 0 } . ( H m , � ., . � H m ) is a RKHS : ∃ k 0 : P m − 1 × P m − 1 → R and k 1 : Ker B × Ker B → R such that ∀ u ∈ P m − 1 , t ∈ [0 , 1] , � u , k 0 ( t , . ) � H m = u ( t ) and ∀ u ∈ Ker B , t ∈ [0 , 1] , � u , k 1 ( t , . ) � H m = u ( t ) See [Berlinet and Thomas-Agnan, 2004] for further details. 4

  5. Basics about smoothing splines II A simple example of boundary conditions : h (0) = h (1) (0) = . . . = h ( m − 1) (0) = 0 . Then, � m − 1 t k s k k 0 ( s , t ) = ( k !) 2 k = 0 and � 1 ( t − w ) m − 1 ( s − w ) m − 1 + + k 1 ( s , t ) = dw . ( m − 1)! 0 Estimating the predictors with smoothing splines I Assumption (A1) • | τ d | ≥ m − 1 • sampling points are distinct in [0 , 1] • B j are linearly independent from h → h ( t ) for all t ∈ τ d x λ,τ d ∈ H m solution of [Kimeldorf and Wahba, 1971] : for x τ d in R | τ d | , ∃ !ˆ � � | τ d | 1 ( h ( t l ) − x τ d ) 2 + λ ( h ( m ) ( t )) 2 dt . arg min | τ d | h ∈H m [0 , 1] l = 1 x λ,τ d = S λ,τ d x τ d where S λ,τ d : R | τ d | → H m . and ˆ These assumptions are fullfilled by the previous simple example as long as 0 � τ d . Estimating the predictors with smoothing splines II S λ,τ d is given by: ω T ( U ( K 1 + λ I | τ d | ) U T ) − 1 U ( K 1 + λ I | τ d | ) − 1 S λ,τ d = + η T ( K 1 + λ I | τ d | ) − 1 ( I | τ d | − U T ( U ( K 1 + λ I | τ d | ) − 1 U ( K 1 + λ I | τ d | ) − 1 ) ω T M 0 + η T M 1 = with • { ω 1 , . . . , ω m } is a basis of P m − 1 , ω = ( ω 1 , . . . , ω m ) T and U = ( ω i ( t )) i = 1 ,..., m t ∈ τ d ; • η = ( k 1 ( t , . )) T t ∈ τ d and K 1 = ( k 1 ( t , t ′ )) t , t ′ ∈ τ d . The observations of the predictor X (NIR spectra) are then estimated from their sampling X τ d by � X λ,τ d . 5

  6. Two important consequences 1. No information loss � � φ ( � φ : R | τ d | →{− 1 , 1 } P ( φ ( X τ d ) � Y ) φ : H m →{− 1 , 1 } P inf X λ,τ d ) � Y = inf and �� � 2 � �� φ ( X τ d ) − Y � 2 � φ ( � φ : H m →{− 1 , 1 } E inf X λ,τ d ) − Y = φ : R | τ d | →{− 1 , 1 } P inf 2. Easy way to use derivatives : ( Q λ,τ d u τ d ) T ( Q λ,τ d v τ d )( u τ d ) T M λ,τ d v τ d ( u τ d ) T M T 0 WM 0 v τ d + ( u τ d ) T M T 1 K 1 M 1 v τ d �S λ,τ d u τ d , S λ,τ d v τ d � H m � � u λ,τ d , � = v λ,τ d � u ( m ) v ( m ) ≃ � � λ,τ d , � λ,τ d � where K 1 , M 0 and M 1 have been previously defined and W = ( � ω i , ω j � H m ) i , j = 1 ,..., m . where M λ,τ d is symmetric, definite positive. where Q λ,τ d is the Choleski triangle of M λ,τ d : Q T λ,τ d Q λ,τ d = M λ,τ d . Remark : Q λ,τ d is calcu- lated only from the RKHS, λ and τ d : it does not depend on the data set. Classification and regression based on derivatives Suppose that we know a consistent classifier or regression function in R | τ d | that is based on R | τ d | scalar product or norm. The corresponding derivative based classifier or regression function is given by using the norm induced by Q λ,τ d : Example : Nonparametric kernel regression � � u − U i � R | τ d | � � n i = 1 T i K h n Ψ : u ∈ R | τ d | → � � u − U i � R | τ d | � � n i = 1 K h n 6

  7. where ( U i , T i ) i = 1 ,..., n is a learning set in R | τ d | × R . � � Q λ,τ d x τ d − Q λ,τ d X � � n τ d i � R | τ d | i = 1 Y i K h n φ n , d = Ψ ◦ Q λ,τ d : x ∈ H m → � � Q λ,τ d x τ d − Q λ,τ d X � � n τ d i � R | τ d | i = 1 K h n � � � n � x ( m ) − X ( m ) � L 2 i = 1 Y i K i h n ≃ − → � � � n � x ( m ) − X ( m ) � L 2 i = 1 K i h n Remark for consistency Classification case (approximatively the same is true for regression): � � � � − L ∗ = P φ n ,τ d ( � φ n ,τ d ( � − L ∗ d + L ∗ d − L ∗ P X λ,τ d ) � Y X λ,τ d ) � Y where L ∗ d = inf φ : R | τ d | →{− 1 , 1 } P ( φ ( X τ d ) � Y ). 1. For all fixed d , � � φ n ,τ d ( � = L ∗ n → + ∞ P lim X λ,τ d ) � Y d as long as the R | τ d | -classifier is consistent because there is a one-to-one mapping between X τ d and � X λ,τ d . �� � � � � d − L ∗ ≤ E � � 2. L ∗ � E ( Y | � with consistency of spline estimate � X λ,τ d ) − E ( Y | X ) X λ,τ d and � assumption on the regularity of E ( Y | X = . ), consistency would be proved. But continuity of E ( Y | X = . ) is a strong assumption in infinite dimensional case, and is not easy to check. Spline consistency Let λ depends on d and denote ( λ d ) d the series of regularization parameters. Also introduce ∆ τ d : = max { t 1 , t 2 − t 1 , . . . , 1 − t | τ d | } , ∆ τ d : = min 1 ≤ i < | τ d | { t i + 1 − t i } Assumption (A2) • ∃ R such that ∆ τ d / ∆ τ d ≤ R for all d ; • lim d → + ∞ | τ d | = + ∞ ; • lim d → + ∞ λ d = 0. [Ragozin, 1983] : Under (A1) and (A2), ∃ A R , m and B R , m such that for any x ∈ H m and any λ d > 0, � � � � 1 � � d → + ∞ � 2 � D m x � 2 � ˆ x λ d ,τ d − x L 2 ≤ A R , m λ d + B R , m − − − − − → 0 L 2 | τ d | 2 m 7

  8. Bayes risk consistency Assumption (A3a) � � � D m X � 2 is finite and Y ∈ {− 1 , 1 } . E L 2 or Assumption (A3b) τ d ⊂ τ d + 1 for all d and E ( Y 2 ) is finite. Under (A1)-(A3), lim d → + ∞ L ∗ d = L ∗ . Proof under assumption (A3a) Assumption (A3a) � � � D m X � 2 is finite and Y ∈ {− 1 , 1 } . E L 2 The proof is based on a result of [Faragó and Györfi, 1975] : For a pair of random variables ( X , Y ) taking their values in X × {− 1 , 1 } where X is an arbitrary metric space and for a series of functions T d : X → X such that d → + ∞ E ( δ ( T d ( X ) , X )) − − − − − → 0 then lim d → + ∞ inf φ : X→{− 1 , 1 } P ( φ ( T d ( X )) � Y ) = L ∗ . • T d is the spline estimate based on the sampling; • the inequality of [Ragozin, 1983] about this estimate is exactly the assumption of Farago and Gyorfi’s Theorem. Then the result follows. Proof under assumption (A3b) Assumption (A3b) τ d ⊂ τ d + 1 for all d and E ( Y 2 ) is finite. Under (A3b), ( E ( Y | � X λ d ,τ d )) d is a uniformly bounded martingale and thus converges for the L 1 -norm. Using the consistency of ( � X λ d ,τ d ) d to X ends the proof. 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend