RegML 2020 Class 3 Early Stopping and Spectral Regularization - PowerPoint PPT Presentation

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

Learning problem Solve � dρ ( x, y ) L ( w ⊤ x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models: non-linear features and kernels L.Rosasco, RegML 2020 2

Regularization by penalization Replace E ( w ) + λ � w � 2 � min w E ( w ) by min � �� w � E λ ( w ) � n ◮ � E ( w ) = 1 i =1 L ( w ⊤ x i , y i ) n ◮ λ > 0 regularization parameter L.Rosasco, RegML 2020 3

Loss functions and computational methods ◮ Logistic loss log(1 + e − yw ⊤ x ) ◮ Hinge loss | 1 − yw ⊤ x | + w t +1 = w t − γ t ∇ � E λ ( w t ) . . . L.Rosasco, RegML 2020 4

Square loss (1 − yw ⊤ x ) 2 = ( y − w ⊤ x ) 2 L.Rosasco, RegML 2020 5

Square loss (1 − yw ⊤ x ) 2 = ( y − w ⊤ x ) 2 E ( w ) = 1 E λ ( w ) = � � � n � ˆ E ( w ) + λ � w � 2 y � 2 with Xw − ˆ ◮ ˆ X n × d data matrix ◮ ˆ y n × 1 output vector. L.Rosasco, RegML 2020 5

Ridge regression / Tikhonov regression E λ ( w ) = 1 y � 2 + λ � w � 2 � n � ˆ Xw − ˆ � �� Smooth and strongly convex E λ ( w ) = 2 ∇ � X ⊤ ( ˆ ˆ Xw − ˆ y ) + 2 λw = 0 n X ⊤ ˆ ⇒ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ = y L.Rosasco, RegML 2020 6

Linear systems X ⊤ ˆ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ y ◮ nd 2 to form ˆ X ⊤ ˆ X ◮ roughly d 3 to solve the linear system L.Rosasco, RegML 2020 7

Representer theorem for square loss � n f ( x ) = x ⊤ w x ⊤ x i c i = ⇒ f ( x ) = i =1 L.Rosasco, RegML 2020 8

Representer theorem for square loss � n f ( x ) = x ⊤ w x ⊤ x i c i = ⇒ f ( x ) = i =1 Using SVD of ˆ X ... X ⊤ ˆ X + λnI ) − 1 ˆ X ⊤ ( ˆ X ⊤ + λnI ) − 1 ˆ w = ( ˆ y = ˆ X ˆ X ⊤ ˆ y � �� c � n ⇒ w = ˆ X ⊤ c = = x i c i i =1 L.Rosasco, RegML 2020 8

Beyond linear models n � f ( x ) = x ⊤ w = x ⊤ x i c i , i =1 X ⊤ ˆ X + λnI ) − 1 ˆ X ⊤ + λnI ) − 1 ˆ w = ( ˆ X ⊤ ˆ c = ( ˆ X ˆ y, y ◮ non-linear function f ( x ) = φ ( x ) ⊤ w x �→ φ ( x ) = ( φ 1 ( x ) , . . . φ n ( x )) , ◮ non-linear kernels � n X ⊤ = ˆ X ˆ ˆ f ( x ) = K ( x, x i ) c i . K, i =1 L.Rosasco, RegML 2020 9

Interlude: linear systems and stability c = a 1 Aw = y, A = diag( a 1 , . . . , a d ) , < ∞ , a d A − 1 = diag( a − 1 w = A − 1 y, 1 , . . . , a − 1 d ) More generally A = U Σ U ⊤ , Σ = diag( σ 1 , . . . , σ d ) A − 1 = U Σ − 1 U ⊤ , Σ − 1 = diag( σ − 1 1 , . . . , σ − 1 d ) L.Rosasco, RegML 2020 10

Tikhonov Regularization � n � n 1 1 ( y i − w ⊤ x i ) 2 + λ � w � 2 ( y i − w ⊤ x i ) 2 �→ n n i =1 i =1 X ⊤ ˆ X ⊤ ˆ ˆ Xw = ˆ ( ˆ X + λnI ) w = ˆ X ⊤ ˆ X ⊤ ˆ y �→ y Overfitting and numerical stability L.Rosasco, RegML 2020 11

Beyond Tikhonov: TSVD X ⊤ ˆ X ⊤ ˆ ˆ w M = ( ˆ M ˆ X = V Σ V ⊤ , X ) − 1 X ⊤ ˆ y X ⊤ ˆ ◮ ( ˆ X ) − 1 M = V Σ − 1 M V ⊤ ◮ Σ − 1 M = diag( σ − 1 1 , . . . , σ M − 1 , 0 . . . , 0) Also known as principal component regression (PCR) L.Rosasco, RegML 2020 12

Principal component analysis (PCA) Dimensionality reduction X ⊤ ˆ ˆ X = V Σ V ⊤ Eigenfunctions are directions, of ◮ maximum variance ◮ best reconstruction L.Rosasco, RegML 2020 13

TSVD and PCA ⇔ PCA + ERM TSV D Regularization by projection L.Rosasco, RegML 2020 14

TSVD/PCR beyond linearity Non-linear function p � w i φ i ( x ) = Φ( x ) ⊤ w f ( x ) = i =1 with w = ( � Φ ⊤ � Φ) − 1 M � Φ ⊤ ˆ y Φ = (Φ( x 1 ) , . . . Φ( x n )) ⊤ ∈ R n × p . Let � Φ ⊤ � � ( � Φ ⊤ � Φ = V Σ V ⊤ , Φ) − 1 M = V Σ − 1 M V ⊤ Σ − 1 M = diag( σ − 1 1 , . . . , σ − 1 Σ = diag( σ 1 , . . . , σ p ) , M , 0 , . . . ) L.Rosasco, RegML 2020 15

TSVD/PCR with kernels � n c = ( � K ) − 1 f ( x ) = K ( x, x i ) c i , M ˆ y i =1 � K = U Σ U ⊤ , � K ij = K ( x i , x j ) , Σ = ( σ 1 , . . . , σ n ) , � K − 1 M = U Σ − 1 M U ⊤ , Σ − 1 M = ( σ − 1 1 , . . . , σ − 1 M , 0 , . . . ) , L.Rosasco, RegML 2020 16

Early stopping regularization Other example of regularization: Early stopping of an iterative procedure applied to noisy data. L.Rosasco, RegML 2020 17

Gradient descent for square loss w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) n � ( y i − w ⊤ x i ) 2 = � ˆ y � 2 Xw − ˆ i =1 ◮ no penalty 2 ◮ stepsize chosen a priori γ = � ˆ X ⊤ ˆ X � L.Rosasco, RegML 2020 18

Early stopping at work Fitting on the training set Iteration #1 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 L.Rosasco, RegML 2020 19

Semi-convergence � min w E ( w ) vs min E ( w ) w L.Rosasco, RegML 2020 20

Connection to Tikhonov or TSVD w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) X ⊤ ˆ = ( I − γ ˆ X ) w t + γ ˆ X ⊤ ˆ y ) by induction � t − 1 X ⊤ ˆ ( I − γ ˆ ˆ X ) j X ⊤ ˆ w t = γ y j =0 � �� Truncated power series L.Rosasco, RegML 2020 21

Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 L.Rosasco, RegML 2020 22

Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 ◮ | a | < 1 ∞ ∞ � � (1 − a ) − 1 = a − 1 = a j (1 − a ) j = ⇒ j =0 j =0 L.Rosasco, RegML 2020 22

Neumann series t − 1 � X ⊤ ˆ ( I − γ ˆ X ) j γ j =0 ◮ | a | < 1 ∞ ∞ � � (1 − a ) − 1 = a − 1 = a j (1 − a ) j = ⇒ j =0 j =0 ◮ A ∈ R d × d , � A � < 1 , invertible ∞ � A − 1 = ( I − A ) j j =0 L.Rosasco, RegML 2020 22

Stable matrix inversion Truncated Neumann Series ∞ t − 1 � � X ⊤ ˆ X ) − 1 = γ X ⊤ ˆ X ⊤ ˆ ( ˆ ( I − γ ˆ ( I − γ ˆ X ) j X ) j ≈ γ j =0 j =0 compare to X ⊤ ˆ X ⊤ ˆ ( ˆ ( ˆ X ) − 1 X + λnI ) − 1 ≈ L.Rosasco, RegML 2020 23

Spectral filtering Different instances of the same principle. ◮ Tikhonov X ⊤ ˆ X + λnI ) − 1 ˆ w t = ( ˆ X ⊤ ˆ y ◮ Early Stopping t − 1 � X ⊤ ˆ X ) j ˆ ( I − γ ˆ X ⊤ ˆ w t = γ y j =0 ◮ TSVD X ⊤ ˆ w t = ( ˆ X ) − 1 M ˆ X ⊤ ˆ y L.Rosasco, RegML 2020 24

Statistics and optimization t − 1 � X ⊤ ˆ X ) j ˆ ( I − γ ˆ X ⊤ ˆ w t = γ y j =0 The difference is in the computations w t +1 = w t − γ ˆ X ⊤ ( ˆ Xw t − ˆ y ) ◮ Tikhonov - O ( nd 2 + d 3 ) ◮ TSVD - O ( nd 2 + d 2 M ) ◮ GD - O ( ndt ) L.Rosasco, RegML 2020 25

Regularization path and warm restart � min w E ( w ) vs min E ( w ) w L.Rosasco, RegML 2020 26

Beyond linear models Non-linear function p � w i φ i ( x ) = Φ( x ) ⊤ w f ( x ) = i =1 ◮ Replace x by Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) ◮ Replace � X by Φ = (Φ( x 1 ) , . . . Φ( x n )) ⊤ ∈ R n × p . � w t +1 = w t − γ � Φ ⊤ ( � Φ w t − ˆ y ) Computational cost O ( npt ) . L.Rosasco, RegML 2020 27

Early-stopping and kernels � n f ( x ) = K ( x i , x ) c i i =1 By induction c t +1 = c t − γ ( � Kc t − ˆ y ) � K ij = K ( x i , x j ) Computational Complexity O ( n 2 t ) . L.Rosasco, RegML 2020 28

What about other loss functions? ◮ PCA + ERM ◮ Gradient / Subgradient Descent. Iterations for regularization, not only optimization! L.Rosasco, RegML 2020 29

Going big... Bottleneck of Kernel methods Memory � O ( n 2 ) K is L.Rosasco, RegML 2020 30

Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) L.Rosasco, RegML 2020 31

Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) ◮ Subsampling (Nystr¨ om) - replace � n � M f ( x ) = K ( x, x i ) c i by f ( x ) = K ( x, ˜ x i ) c i i =1 i =1 ˜ x i subsampled from training set, M L.Rosasco, RegML 2020 31

Approaches to large scale ◮ (Random) features - find � Φ : X → R M , with M ≪ n s.t. K ( x, x ′ ) ≈ � Φ( x ) ⊤ � Φ( x ′ ) ◮ Subsampling (Nystr¨ om) - replace � n � M f ( x ) = K ( x, x i ) c i by f ( x ) = K ( x, ˜ x i ) c i i =1 i =1 ˜ x i subsampled from training set, M ◮ Greedy! ◮ Neural Nets L.Rosasco, RegML 2020 31

This class Regularization beyond penalization ◮ Regularization by projection ◮ Regularization by early stopping L.Rosasco, RegML 2020 32

Next class Multioutput learning ◮ Multitask learning ◮ Vector valued learning ◮ Multiclass learning L.Rosasco, RegML 2020 33

RegML 2020 Class 3 Early Stopping and Spectral Regularization - PowerPoint PPT Presentation

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Solve d ( x, y ) L ( w x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models:

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Keys to Creating Thumb Keys to Creating Thumb - Stopping Content Stopping Content Sean Ellenby

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

14/ 04/ 2020 Are we in a Crisis? Communic ation A c r isis is a pe ople - stopping,

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

Applications to high dimensional problems Francesca Odone and Lorenzo Rosasco RegML 2013

Spectral Graph Theory and its Applications Lillian Dai 6.454 Oct. 20, 2004 1 Outline Basic

Lesson 9 Introduction Signal Spectral Analysis: Estimation of the power spectral density

Poster #190 1 Spectral Clustering of Signed Graphs Poster #190 Our Goal: Extend Spectral

Perturbations of manifolds and spectral convergence Olaf Post Mathematik (Fachbereich 4),

CP violation in h Z UMass Amherst May 2, 2015 Marco Farina Cornell University First

The Holographic Correspondence Francesco Bigazzi INFN, Firenze Francesco Bigazzi The Holographic

virtual machines 1 last time access control lists user and group IDs in processes set-user-ID

CSE 373: More on Dijkstras algorithm Michael Lee Wednesday, Feb 21, 2018 1 Dijkstras

CS 2304: Pointers, References, and Memory McQuain 2012, Gusukuma 2015 CS2304: C++ for Java

Environment Birmingham Learn, Explore Regional Event, 11.07.19 Presentation Overview Learn

Intertextuality revised: 08.08.13 || English 1302: Composition II || D. Glen Smith, instructor