regularization via spectral filtering
play

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 - PowerPoint PPT Presentation

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco Regularization via Spectral Filtering About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse


  1. Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco Regularization via Spectral Filtering

  2. About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems, give rise to regularized learning algorithms. These algorithms are kernel methods that can be easily implemented and have a common derivation, but different computational and theoretical properties. L. Rosasco Regularization via Spectral Filtering

  3. Plan From ERM to Tikhonov regularization. Linear ill-posed problems and stability. Spectral Regularization and Filtering. Example of Algorithms. L. Rosasco Regularization via Spectral Filtering

  4. Basic Notation training set S = { ( x 1 , y 1 ) , ..., ( x n , y n ) } . X is the n by d input matrix. Y = ( y 1 , . . . , y n ) is the output vector. k denotes the kernel function , K the n by n kernel matrix with entries K ij = k ( x i , x j ) and H the RKHS with kernel k . RLS estimator solves n 1 ( y i − f ( x i )) 2 + λ � f � 2 � H . min n f ∈H i = 1 L. Rosasco Regularization via Spectral Filtering

  5. Representer Theorem We have seen that RKHS allow us to write the RLS estimator in the form n f λ � S ( x ) = c i k ( x , x i ) i = 1 with ( K + n λ I ) c = Y where c = ( c 1 , . . . , c n ) . L. Rosasco Regularization via Spectral Filtering

  6. The Role of Regularization We observed that adding a penalization term can be interpreted as way to to control smoothness and avoid overfitting n n 1 1 ( y i − f ( x i )) 2 ⇒ min ( y i − f ( x i )) 2 + λ � f � 2 � � min H . n n f ∈H f ∈H i = 1 i = 1 L. Rosasco Regularization via Spectral Filtering

  7. Empirical risk minimization Similarly we can prove that the solution of empirical risk minimization n 1 � ( y i − f ( x i )) 2 min n f ∈H i = 1 can be written as n � f S ( x ) = c i k ( x , x i ) i = 1 where the coefficients satisfy Kc = Y . L. Rosasco Regularization via Spectral Filtering

  8. The Role of Regularization Now we can observe that adding a penalty has an effect from a numerical point of view: Kc = Y ⇒ ( K + n λ I ) c = Y it stabilizes a possibly ill-conditioned matrix inversion problem. This is the point of view of regularization for (ill-posed) inverse problems. L. Rosasco Regularization via Spectral Filtering

  9. Ill-posed Inverse Problems Hadamard introduced the definition of ill-posedness. Ill-posed problems are typically inverse problems. If g ∈ G and f ∈ F , with G , F Hilbert spaces, a linear, continuous operator L , consider the equation g = Lf . The direct problem is is to compute g given f ; the inverse problem is to compute f given the data g . The inverse problem of finding f is well-posed when the solution exists, is unique and is stable, that is depends continuously on the initial data g . Otherwise the problem is ill-posed. L. Rosasco Regularization via Spectral Filtering

  10. Ill-posed Inverse Problems Hadamard introduced the definition of ill-posedness. Ill-posed problems are typically inverse problems. If g ∈ G and f ∈ F , with G , F Hilbert spaces, a linear, continuous operator L , consider the equation g = Lf . The direct problem is is to compute g given f ; the inverse problem is to compute f given the data g . The inverse problem of finding f is well-posed when the solution exists, is unique and is stable, that is depends continuously on the initial data g . Otherwise the problem is ill-posed. L. Rosasco Regularization via Spectral Filtering

  11. Ill-posed Inverse Problems Hadamard introduced the definition of ill-posedness. Ill-posed problems are typically inverse problems. If g ∈ G and f ∈ F , with G , F Hilbert spaces, a linear, continuous operator L , consider the equation g = Lf . The direct problem is is to compute g given f ; the inverse problem is to compute f given the data g . The inverse problem of finding f is well-posed when the solution exists, is unique and is stable, that is depends continuously on the initial data g . Otherwise the problem is ill-posed. L. Rosasco Regularization via Spectral Filtering

  12. Linear System for ERM In the finite dimensional case the main problem is numerical stability. For example, in the learning setting the kernel matrix can be decomposed as K = Q Σ Q T , with Σ = diag ( σ 1 , . . . , σ n ) , σ 1 ≥ σ 2 ≥ ...σ n ≥ 0 and q 1 , . . . , q n are the corresponding eigenvectors. Then n 1 c = K − 1 Y = Q Σ − 1 Q T Y = � � q i , Y � q i . σ i i = 1 In correspondence of small eigenvalues, small perturbations of the data can cause large changes in the solution. The problem is ill-conditioned. L. Rosasco Regularization via Spectral Filtering

  13. Linear System for ERM In the finite dimensional case the main problem is numerical stability. For example, in the learning setting the kernel matrix can be decomposed as K = Q Σ Q T , with Σ = diag ( σ 1 , . . . , σ n ) , σ 1 ≥ σ 2 ≥ ...σ n ≥ 0 and q 1 , . . . , q n are the corresponding eigenvectors. Then n 1 c = K − 1 Y = Q Σ − 1 Q T Y = � � q i , Y � q i . σ i i = 1 In correspondence of small eigenvalues, small perturbations of the data can cause large changes in the solution. The problem is ill-conditioned. L. Rosasco Regularization via Spectral Filtering

  14. Regularization as a Filter For Tikhonov regularization ( K + n λ I ) − 1 Y c = Q (Σ + n λ I ) − 1 Q T Y = n 1 � = σ i + n λ � q i , Y � q i . i = 1 Regularization filters out the undesired components. σ i + n λ ∼ 1 1 For σ ≫ λ n , then σ i . 1 1 For σ ≪ λ n , then σ i + n λ ∼ λ n . L. Rosasco Regularization via Spectral Filtering

  15. Matrix Function Note that we can look at a scalar function G λ ( σ ) as a function on the kernel matrix. Using the eigen-decomposition of K we can define G λ ( K ) = QG λ (Σ) Q T , meaning n � G λ ( K ) Y = G λ ( σ i ) � q i , Y � q i . i = 1 For Tikhonov 1 G λ ( σ ) = σ + n λ. L. Rosasco Regularization via Spectral Filtering

  16. Regularization in Inverse Problems In the inverse problems literature many algorithms are known besides Tikhonov regularization. Each algorithm is defined by a suitable filter function G λ . This class of algorithms is known collectively as spectral regularization. Algorithms are not necessarily based on penalized empirical risk minimization. L. Rosasco Regularization via Spectral Filtering

  17. Algorithms Gradient Descent or Landweber Iteration or L2 Boosting ν -method, accelerated Landweber. Iterated Tikhonov Truncated Singular Value Decomposition (TSVD) Principal Component Regression (PCR) The spectral filtering perspective leads to a unified framework. L. Rosasco Regularization via Spectral Filtering

  18. Properties of Spectral Filters Not every scalar function defines a regularization scheme. Roughly speaking a good filter function must have the following properties: as λ goes to 0, G λ ( σ ) → 1 /σ so that G λ ( K ) → K − 1 . λ controls the magnitude of the (smaller) eigenvalues of G λ ( K ) . L. Rosasco Regularization via Spectral Filtering

  19. Spectral Regularization for Learning We can define a class of Kernel Methods as follows. Spectral Regularization We look for estimators n f λ � S ( X ) = c i k ( x , x i ) i = 1 where c = G λ ( K ) Y . L. Rosasco Regularization via Spectral Filtering

  20. Gradient Descent Consider the (Landweber) iteration: gradient descent set c 0 = 0 for i = 1 , . . . , t − 1 c i = c i − 1 + η ( Y − Kc i − 1 ) If the largest eigenvalue of K is smaller than n the above iteration converges if we choose the step-size η = 2 / n . The above iteration can be seen as the minimization of the empirical risk 1 n � Y − Kc � 2 2 via gradient descent. L. Rosasco Regularization via Spectral Filtering

  21. Gradient Descent as Spectral Filtering Note that c 0 = 0, c 1 = η Y , c 2 = η Y + η ( I − η K ) Y c 3 = η Y + η ( I − η K ) Y + η ( Y − K ( η Y + η ( I − η K ) Y )) = η Y + η ( I − η K ) Y + η ( I − 2 η K + η 2 K 2 ) Y One can prove by induction that the solution at the t − th iteration is given by t − 1 � ( I − η K ) i Y . c = η i = 0 The filter function is t − 1 � ( I − ησ ) i . G λ ( σ ) = η i = 0 L. Rosasco Regularization via Spectral Filtering

  22. Landweber iteration i ≥ 0 x i = 1 / ( 1 − x ) , also holds replacing x with the a Note that � matrix. If we consider the kernel matrix (or rather I − η K ) we get t − 1 ∞ K − 1 = η ( I − η K ) i ∼ η � � ( I − η K ) i . i = 0 i = 0 The filter function of Landweber iteration corresponds to a truncated power expansion of K − 1 . L. Rosasco Regularization via Spectral Filtering

  23. Early Stopping The regularization parameter is the number of iteration. Roughly speaking t ∼ 1 /λ . Large values of t correspond to minimization of the empirical risk and tend to overfit. Small values of t tends to oversmooth, recall we start from c = 0. Early stopping of the iteration has a regularization effect. L. Rosasco Regularization via Spectral Filtering

  24. Gradient Descent at Work 2 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X L. Rosasco Regularization via Spectral Filtering

  25. Gradient Descent at Work 2 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X L. Rosasco Regularization via Spectral Filtering

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend