nonparametric prediction l aszl o gy orfi
play

Nonparametric prediction L aszl o Gy orfi Budapest University of - PowerPoint PPT Presentation

Nonparametric prediction L aszl o Gy orfi Budapest University of Technology and Economics Department of Computer Science and Information Theory e-mail: gyorfi@szit.bme.hu www.szit.bme.hu/ gyorfi 1 Universal prediction: squared


  1. Nonparametric prediction L´ aszl´ o Gy¨ orfi Budapest University of Technology and Economics Department of Computer Science and Information Theory e-mail: gyorfi@szit.bme.hu www.szit.bme.hu/ ∼ gyorfi 1

  2. Universal prediction: squared loss y i real valued x i vector valued At time instant i the predictor is asked to guess y i 1 , y i − 1 with knowledge of the past ( x 1 , . . . , x i , y 1 , . . . y i − 1 ) = ( x i ) 1 The predictor is a sequence of functions g = { g i } ∞ i =1 1 , y i − 1 g i ( x i ) is the estimate of y i 1 After n time instant the empirical squared error for the sequence x n 1 , y n 1 n � L n ( g ) = 1 ( g i ( x i 1 , y i − 1 ) − y i ) 2 . 1 n i =1 2

  3. Regression function estimation Y real valued X observation vector Regression problem E { ( Y − f ( X )) 2 } min f Regression function m ( x ) = E { Y | X = x } For each function f one has E { ( f ( X ) − Y ) 2 } = E { ( m ( X ) − Y ) 2 } + E { ( m ( X ) − f ( X )) 2 3

  4. Data: D n = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } Regression function estimate m n ( x ) = m n ( x, D n ) Usual consistency conditions: - m ( x ) is smooth - X has a density - Y is bounded Nonparametric features: - construction of the estimate - consistency 4

  5. Universal consistency Definition 1 The estimator m n is called (weakly) universally consistent if E { ( m ( X ) − m n ( X )) 2 } → 0 for all distributions of ( X, Y ) with E Y 2 < ∞ . 5

  6. Local averaging estimates Stone (1977) n � m n ( x ) = W ni ( x ; X 1 , . . . , X n ) Y i . i =1 6

  7. k -nearest neighbor estimate W ni is 1 /k if X i is one of the k nearest neighbors of x among X 1 , . . . , X n , and W ni is 0 otherwise. Theorem 1 If k n → ∞ , k n /n → 0 then the k -nearest neighbor estimate is weakly universally consistent. 7

  8. Partitioning estimate Partition P n = { A n, 1 , A n, 2 . . . } � n Y i K n ( x, X i ) i =1 m n ( x ) = , � n K n ( x, X i ) i =1 where K n ( x, u ) = � I [ x ∈ A n,j ,u ∈ A n,j ] . j 8

  9. Theorem 2 If for all sphere S centered at the origin lim sup diam( A n,j ) = 0 n →∞ j ; A n,j ∩ S � =0 and |{ j ; A n,j ∩ S � = 0 }| lim = 0 n n →∞ then the partitioning estimate is weakly universally consistent. Example: A n,j are cubes with volume h d n , h n → 0, nh d n → ∞ 9

  10. Kernel estimate Kernel function K ( x ) ≥ 0 Bandwidth h n > 0 n � x − X i � � Y i K h i =1 m n ( x ) = n � x − X i � � K h i =1 Theorem 3 If h n → 0 , nh d n → ∞ then under some conditions on K the kernel estimate is weakly universally consistent. 10

  11. Least squares estimates empirical L 2 error n � 1 | f ( X j ) − Y j | 2 n j =1 class of functions F n select a function from F n which minimizes the empirical error: m n ∈ F n and n n � � 1 1 | m n ( X j ) − Y j | 2 = min | f ( X j ) − Y j | 2 . n n f ∈F n j =1 j =1 the class F n grows slowly as n grows 11

  12. Examples for F n : - polynomials - splines - neural networks - radial basis functions 12

  13. Dependent data: time series The data D n = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } are dependent long-range dependent form a stationary and ergodic process For given n , the problem is the following minimization: E { ( g ( X n +1 , D n ) − Y n +1 ) 2 } . min g The best predictor is the conditional expectation E { Y n +1 | X n +1 , D n } , which cannot be learned from data 13

  14. there is no prediction sequence with n →∞ ( g n ( X n +1 , D n ) − E { Y n +1 | X n +1 , D n } ) = 0 lim a.s. for all stationary and ergodic sequence. our aim is to achieve the optimum L ∗ = lim E { ( g ( X n +1 , D n ) − Y n +1 ) 2 } , n →∞ min g which is impossible 14

  15. Universal consistency there are universal Ces´ aro consistent prediction sequence g n n � 1 ( g i ( X i +1 , D i ) − Y i +1 ) 2 = L ∗ lim n n →∞ i =1 a.s. for all stationary and ergodic sequence. Such prediction sequence is called universally consistent . We show a construction of universally consistent predictor by combination of predictors (experts). 15

  16. Lemma Let ˜ h 1 , ˜ h 2 , . . . be a sequence of prediction strategies (experts), and let { q k } be a probability distribution on the set of positive integers. Assume that ˜ h i ( y n − 1 ) ∈ [ − B, B ] and y n 1 ∈ [ − B, B ] n . Define 1 w t,k = q k e − ( t − 1) L t − 1 (˜ h k ) /c with c ≥ 8 B 2 , and w t,k v t,k = . ∞ � w t,i i =1 Reminder: n � L n ( g ) = 1 1 , y i − 1 ( g i ( x i ) − y i ) 2 . 1 n i =1 16

  17. If the prediction strategy ˜ g is defined by ∞ � v t,k ˜ g t ( y t − 1 h k ( y t − 1 ˜ ) = ) t = 1 , 2 , . . . 1 1 k =1 then for every n ≥ 1, � � h k ) − c ln q k L n (˜ L n (˜ g ) ≤ inf . n k 17

  18. Special case: N predictors { q k } is the uniform distribution then h k ) + c ln N k L n (˜ L n (˜ g ) ≤ min . n 18

  19. Dependent data: time series stationary and ergodic data ( X 1 , Y 1 ) , . . . , ( X n , Y n ). Assume that | Y 0 | ≤ B . An elementary predictor (expert) is denoted by h ( k,ℓ ) , k, ℓ = 1 , 2 , . . . . Let G ℓ be a quantizer of R d and H ℓ be a quantizer of R . For given k, ℓ , let I n be the set of time instants k < i < n , for which there is a match of k -length quantized sequences: G ℓ ( x i i − k ) = G ℓ ( x n n − k ) and H ℓ ( y i − 1 i − k ) = H ℓ ( y n − 1 n − k ) . 19

  20. Then the prediction of this expert is the averages of y i ’s if i ∈ I n : � i ∈ I n y i 1 , y n − 1 h ( k,ℓ ) ( x n ) = . n 1 | I n | These predictors are not universally consistent since for small k the bias is large and large k the variance is large because of the few matchings. The same is true for the quatizers. The problem is how to choose k, ℓ in a data dependent way. The solution is the combination of experts. 20

  21. The combination of predictors can be derived according to the previous lemma. Let { q k,ℓ } be a probability distribution over ( k, ℓ ), and for c = 8 B 2 put w t,k,ℓ = q k,ℓ e − ( t − 1) L t − 1 ( h ( k,ℓ ) ) /c and w t,k,ℓ v t,k,ℓ = . ∞ � w t,i,j i,j =1 Then the combined prediction ∞ � 1 , y t − 1 1 , y t − 1 g t ( x t v t,k,ℓ h ( k,ℓ ) ( x t ) = ) . 1 1 k,ℓ =1 21

  22. Theorem If the quantizers G ℓ and H ℓ “are asymptotically fine”, and P { Y i ∈ [ − B, B ] } = 1, then the combined predictor g is universally consistent. L. Gy¨ orfi, G. Lugosi (2001) ”Strategies for sequential prediction of stationary time series”, in Modelling Uncertainty: An Examination of its Theory, Methods and Applications , M. Dror, P. L’Ecuyer, F. Szidarovszky (Eds.), pp. 225-248, Kluwer Academic Publisher. 22

  23. 0 − 1 loss y i takes values in the finite set { 1 , 2 , . . . M } . At time instant i the 1 , y i − 1 classifier decides on y i based on the past observation ( x i ). 1 After n round the empirical error for x n 1 , y n 1 is n � L n ( g ) = 1 I { g ( x i ) � = y i } , 1 ,y i − 1 n 1 i =1 i.e., the loss is the 0 − 1 loss, and L n ( g ) is the relative frequency of errors. 23

  24. Pattern recognition Y { 1 , 2 , . . . , M } valued X feature vector Classifier g : R d → { 1 , 2 , . . . , M } . Probability of error: L g = P ( g ( X ) � = Y ) . a posteriori probability P i ( x ) = P { Y = i | X = x } . Bayes decision g ∗ ( x ) = arg max P i ( x ) . i L ∗ Bayes error 24

  25. Universal consistency Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) g n ( x ) = g n (( X 1 , Y 1 ) , . . . , ( X n , Y n ) , x ) . Definition 2 The classifier g n is called (weakly) universally consistent if P ( g n ( X ) � = Y ) → L ∗ for all distributions of ( X, Y ) . 25

  26. Local majority voting k -nearest neighbor rule n � g n ( x ) = arg max W n,i ( x ) I { Y i = j } , j i =1 Partitioning rule : n � g n ( x ) = arg max I { X i ∈ A n ( x ) } I { Y i = j } j i =1 Kernel rule rule : � X i − x � n � g n ( x ) = arg max K I { Y i = j } . h j i =1 The k -NN rule and the partitioning rule and the kernel rule are strongly universally consistent. 26

  27. Empirical error minimization empirical error n � 1 I { g ( X j ) � = Y j } n j =1 class of classifiers G n select a classifier from G n which minimizes the empirical error: g n ∈ G n and n n � � 1 1 I { g n ( X j ) � = Y j } = min I { g ( X j ) � = Y j } n n g ∈G n j =1 j =1 the VC dimension of G n grows slowly as n grows 27

  28. Examples for G n : - polynomial classifiers - tree classifiers - neural networks classifiers - radial basis functions classifiers 28

  29. Dependent data: time series data D n = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } form a stationary and ergodic process For given n , the problem is the following minimization: min P { g ( X n +1 , D n ) � = Y n +1 } , g which cannot be learned from data our aim is to achieve the optimum R ∗ = lim n →∞ min P { g ( X n +1 , D n ) � = Y n +1 } , g which is impossible 29

  30. there are universal Ces´ aro consistent classifier sequence g n : n � 1 I { g i ( X i +1 ,D i ) � = Y i +1 } = R ∗ lim n n →∞ i =1 a.s. for all stationary and ergodic sequence Such classifier sequence is called universally consistent . 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend