less is more nystr om computational regularization
play

Less is More: Nystr om Computational Regularization Alessandro Rudi - PowerPoint PPT Presentation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015 A


  1. Less is More: Nystr¨ om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015

  2. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design

  3. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)

  4. Supervised Learning Problem: Estimate f ∗ f ∗

  5. Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

  6. Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (with unknown distribution) ◮ f ∗ unknown

  7. Outline Learning with kernels Data Dependent Subsampling

  8. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1

  9. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function

  10. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers

  11. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients

  12. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n

  13. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?

  14. Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

  15. Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues

  16. Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1

  17. Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center!

  18. Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1

  19. KRR: Statistics

  20. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n

  21. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks

  22. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound

  23. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1

  24. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1 3. Adaptive tuning via cross validation

  25. KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q

  26. KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of space before running out of time... Can this be fixed?

  27. Outline Learning with kernels Data Dependent Subsampling

  28. Subsampling 1. pick w i at random...

  29. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n

  30. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1

  31. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y =

  32. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y = What about statistics? What’s the price for efficient computations?

  33. Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

  34. Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M

  35. Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M ◮ Few prediction guarantees either suboptimal or in restricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

  36. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗

  37. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks

  38. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . .

  39. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !!

  40. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗

  41. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗ Note: An interesting insight is obtained rewriting the result. . .

  42. Computational Regularization (CoRe) A simple idea: “swap” the role of λ and M . . .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend