regml 2020 class 3 early stopping and spectral
play

RegML 2020 Class 3 Early Stopping and Spectral Regularization - PowerPoint PPT Presentation

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Solve d ( x, y ) L ( w x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models:


  1. RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

  2. Learning problem Solve � dρ ( x, y ) L ( w ⊤ x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models: non-linear features and kernels L.Rosasco, RegML 2020 2

  3. Regularization by penalization Replace E ( w ) + λ � w � 2 � min w E ( w ) by min � �� � w � E λ ( w ) � n ◮ � E ( w ) = 1 i =1 L ( w ⊤ x i , y i ) n ◮ λ > 0 regularization parameter L.Rosasco, RegML 2020 3

  4. Loss functions and computational methods ◮ Logistic loss log(1 + e − yw ⊤ x ) ◮ Hinge loss | 1 − yw ⊤ x | + w t +1 = w t − γ t ∇ � E λ ( w t ) . . . L.Rosasco, RegML 2020 4

  5. Square loss (1 − yw ⊤ x ) 2 = ( y − w ⊤ x ) 2 L.Rosasco, RegML 2020 5

  6. Square loss (1 − yw ⊤ x ) 2 = ( y − w ⊤ x ) 2 E ( w ) = 1 E λ ( w ) = � � � n � ˆ E ( w ) + λ � w � 2 y � 2 with Xw − ˆ ◮ ˆ X n × d data matrix ◮ ˆ y n × 1 output vector. L.Rosasco, RegML 2020 5

  7. Ridge regression / Tikhonov regression E λ ( w ) = 1 y � 2 + λ � w � 2 � n � ˆ Xw − ˆ � �� � Smooth and strongly convex E λ ( w ) = 2 ∇ � X ⊤ ( ˆ ˆ Xw − ˆ y ) + 2 λw = 0 n X ⊤ ˆ ⇒ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ = y L.Rosasco, RegML 2020 6

  8. Linear systems X ⊤ ˆ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ y ◮ nd 2 to form ˆ X ⊤ ˆ X ◮ roughly d 3 to solve the linear system L.Rosasco, RegML 2020 7

  9. Representer theorem for square loss � n f ( x ) = x ⊤ w x ⊤ x i c i = ⇒ f ( x ) = i =1 L.Rosasco, RegML 2020 8

  10. Representer theorem for square loss � n f ( x ) = x ⊤ w x ⊤ x i c i = ⇒ f ( x ) = i =1 Using SVD of ˆ X ... X ⊤ ˆ X + λnI ) − 1 ˆ X ⊤ ( ˆ X ⊤ + λnI ) − 1 ˆ w = ( ˆ y = ˆ X ˆ X ⊤ ˆ y � �� � c � n ⇒ w = ˆ X ⊤ c = = x i c i i =1 L.Rosasco, RegML 2020 8

  11. Beyond linear models n � f ( x ) = x ⊤ w = x ⊤ x i c i , i =1 X ⊤ ˆ X + λnI ) − 1 ˆ X ⊤ + λnI ) − 1 ˆ w = ( ˆ X ⊤ ˆ c = ( ˆ X ˆ y, y ◮ non-linear function f ( x ) = φ ( x ) ⊤ w x �→ φ ( x ) = ( φ 1 ( x ) , . . . φ n ( x )) , ◮ non-linear kernels � n X ⊤ = ˆ X ˆ ˆ f ( x ) = K ( x, x i ) c i . K, i =1 L.Rosasco, RegML 2020 9

  12. Interlude: linear systems and stability c = a 1 Aw = y, A = diag( a 1 , . . . , a d ) , < ∞ , a d A − 1 = diag( a − 1 w = A − 1 y, 1 , . . . , a − 1 d ) More generally A = U Σ U ⊤ , Σ = diag( σ 1 , . . . , σ d ) A − 1 = U Σ − 1 U ⊤ , Σ − 1 = diag( σ − 1 1 , . . . , σ − 1 d ) L.Rosasco, RegML 2020 10

  13. Tikhonov Regularization � n � n 1 1 ( y i − w ⊤ x i ) 2 + λ � w � 2 ( y i − w ⊤ x i ) 2 �→ n n i =1 i =1 X ⊤ ˆ X ⊤ ˆ ˆ Xw = ˆ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ X ⊤ ˆ y �→ y Overfitting and numerical stability L.Rosasco, RegML 2020 11

  14. Beyond Tikhonov: TSVD X ⊤ ˆ X ⊤ ˆ ˆ w M = ( ˆ M ˆ X = V Σ V ⊤ , X ) − 1 X ⊤ ˆ y X ⊤ ˆ ◮ ( ˆ X ) − 1 M = V Σ − 1 M V ⊤ ◮ Σ − 1 M = diag( σ − 1 1 , . . . , σ M − 1 , 0 . . . , 0) Also known as principal component regression (PCR) L.Rosasco, RegML 2020 12

  15. Principal component analysis (PCA) Dimensionality reduction X ⊤ ˆ ˆ X = V Σ V ⊤ Eigenfunctions are directions, of ◮ maximum variance ◮ best reconstruction L.Rosasco, RegML 2020 13

  16. TSVD and PCA ⇔ PCA + ERM TSV D Regularization by projection L.Rosasco, RegML 2020 14

  17. TSVD/PCR beyond linearity Non-linear function p � w i φ i ( x ) = Φ( x ) ⊤ w f ( x ) = i =1 with w = ( � Φ ⊤ � Φ) − 1 M � Φ ⊤ ˆ y Φ = (Φ( x 1 ) , . . . Φ( x n )) ⊤ ∈ R n × p . Let � Φ ⊤ � � ( � Φ ⊤ � Φ = V Σ V ⊤ , Φ) − 1 M = V Σ − 1 M V ⊤ Σ − 1 M = diag( σ − 1 1 , . . . , σ − 1 Σ = diag( σ 1 , . . . , σ p ) , M , 0 , . . . ) L.Rosasco, RegML 2020 15

  18. TSVD/PCR with kernels � n c = ( � K ) − 1 f ( x ) = K ( x, x i ) c i , M ˆ y i =1 � K = U Σ U ⊤ , � K ij = K ( x i , x j ) , Σ = ( σ 1 , . . . , σ n ) , � K − 1 M = U Σ − 1 M U ⊤ , Σ − 1 M = ( σ − 1 1 , . . . , σ − 1 M , 0 , . . . ) , L.Rosasco, RegML 2020 16

  19. Early stopping regularization Other example of regularization: Early stopping of an iterative procedure applied to noisy data. L.Rosasco, RegML 2020 17

  20. Gradient descent for square loss w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) n � ( y i − w ⊤ x i ) 2 = � ˆ y � 2 Xw − ˆ i =1 ◮ no penalty 2 ◮ stepsize chosen a priori γ = � ˆ X ⊤ ˆ X � L.Rosasco, RegML 2020 18

  21. Early stopping at work Fitting on the training set Iteration #1 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  22. Early stopping at work Fitting on the training set Iteration #2 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  23. Early stopping at work Fitting on the training set Iteration #7 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  24. Early stopping at work Fitting on the training set Iteration #5000 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

  25. Semi-convergence � min w E ( w ) vs min E ( w ) w L.Rosasco, RegML 2020 20

  26. Connection to Tikhonov or TSVD w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) X ⊤ ˆ = ( I − γ ˆ X ) w t + γ ˆ X ⊤ ˆ y ) by induction � t − 1 X ⊤ ˆ ( I − γ ˆ ˆ X ) j X ⊤ ˆ w t = γ y j =0 � �� � Truncated power series L.Rosasco, RegML 2020 21

  27. Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 L.Rosasco, RegML 2020 22

  28. Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 ◮ | a | < 1 ∞ ∞ � � (1 − a ) − 1 = a − 1 = a j (1 − a ) j = ⇒ j =0 j =0 L.Rosasco, RegML 2020 22

  29. Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 ◮ | a | < 1 ∞ ∞ � � (1 − a ) − 1 = a − 1 = a j (1 − a ) j = ⇒ j =0 j =0 ◮ A ∈ R d × d , � A � < 1 , invertible ∞ � A − 1 = ( I − A ) j j =0 L.Rosasco, RegML 2020 22

  30. Stable matrix inversion Truncated Neumann Series ∞ t − 1 � � X ⊤ ˆ X ) − 1 = γ X ⊤ ˆ X ⊤ ˆ ( ˆ ( I − γ ˆ ( I − γ ˆ X ) j X ) j ≈ γ j =0 j =0 compare to X ⊤ ˆ X ⊤ ˆ ( ˆ ( ˆ X ) − 1 X + λnI ) − 1 ≈ L.Rosasco, RegML 2020 23

  31. Spectral filtering Different instances of the same principle. ◮ Tikhonov X ⊤ ˆ X + λnI ) − 1 ˆ w t = ( ˆ X ⊤ ˆ y ◮ Early Stopping t − 1 � X ⊤ ˆ X ) j ˆ ( I − γ ˆ X ⊤ ˆ w t = γ y j =0 ◮ TSVD X ⊤ ˆ w t = ( ˆ X ) − 1 M ˆ X ⊤ ˆ y L.Rosasco, RegML 2020 24

  32. Statistics and optimization t − 1 � X ⊤ ˆ X ) j ˆ ( I − γ ˆ X ⊤ ˆ w t = γ y j =0 The difference is in the computations w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) ◮ Tikhonov - O ( nd 2 + d 3 ) ◮ TSVD - O ( nd 2 + d 2 M ) ◮ GD - O ( ndt ) L.Rosasco, RegML 2020 25

  33. Regularization path and warm restart � min w E ( w ) vs min E ( w ) w L.Rosasco, RegML 2020 26

  34. Beyond linear models Non-linear function p � w i φ i ( x ) = Φ( x ) ⊤ w f ( x ) = i =1 ◮ Replace x by Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) ◮ Replace � X by Φ = (Φ( x 1 ) , . . . Φ( x n )) ⊤ ∈ R n × p . � w t +1 = w t − γ � Φ ⊤ ( � Φ w t − ˆ y ) Computational cost O ( npt ) . L.Rosasco, RegML 2020 27

  35. Early-stopping and kernels � n f ( x ) = K ( x i , x ) c i i =1 By induction c t +1 = c t − γ ( � Kc t − ˆ y ) � K ij = K ( x i , x j ) Computational Complexity O ( n 2 t ) . L.Rosasco, RegML 2020 28

  36. What about other loss functions? ◮ PCA + ERM ◮ Gradient / Subgradient Descent. Iterations for regularization, not only optimization! L.Rosasco, RegML 2020 29

  37. Going big... Bottleneck of Kernel methods Memory � O ( n 2 ) K is L.Rosasco, RegML 2020 30

  38. Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) L.Rosasco, RegML 2020 31

  39. Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) ◮ Subsampling (Nystr¨ om) - replace � n � M f ( x ) = K ( x, x i ) c i by f ( x ) = K ( x, ˜ x i ) c i i =1 i =1 ˜ x i subsampled from training set, M L.Rosasco, RegML 2020 31

  40. Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) ◮ Subsampling (Nystr¨ om) - replace � n � M f ( x ) = K ( x, x i ) c i by f ( x ) = K ( x, ˜ x i ) c i i =1 i =1 ˜ x i subsampled from training set, M ◮ Greedy! ◮ Neural Nets L.Rosasco, RegML 2020 31

  41. This class Regularization beyond penalization ◮ Regularization by projection ◮ Regularization by early stopping L.Rosasco, RegML 2020 32

  42. Next class Multioutput learning ◮ Multitask learning ◮ Vector valued learning ◮ Multiclass learning L.Rosasco, RegML 2020 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend