spectral regularization methods for statistical inverse
play

Spectral regularization methods for statistical inverse learning - PowerPoint PPT Presentation

Spectral regularization methods for statistical inverse learning problems G. Blanchard Universtit at Potsdam van Dantzig seminar, 23/06/2016 Joint work with N. M ucke (U. Potsdam) Rates for statistical inverse learning van Dantzig


  1. Spectral regularization methods for statistical inverse learning problems G. Blanchard Universtit¨ at Potsdam van Dantzig seminar, 23/06/2016 Joint work with N. M¨ ucke (U. Potsdam) Rates for statistical inverse learning van Dantzig seminar 24/06/2016 1 / 38 G. Blanchard

  2. General regularization and kernel methods 1 Inverse learning/regression and relation to kernels 2 Rates for linear spectral regularization methods 3 Beyond the regular spectrum case 4 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 2 / 38 G. Blanchard

  3. General regularization and kernel methods 1 Inverse learning/regression and relation to kernels 2 Rates for linear spectral regularization methods 3 Beyond the regular spectrum case 4 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 3 / 38 G. Blanchard

  4. I NTRODUCTION : RANDOM DESIGN REGRESSION ◮ Consider the familiar regression setting on a random design, Y i = f ∗ ( X i ) + ε i , where ( X i , Y i ) 1 ≤ i ≤ n is an i.i.d. sample from P XY on the space X × R , ◮ with E [ ε i | X i ] = 0. ◮ For an estimator � f we consider the prediction error function, �� � 2 � � f − f ∗ � 2 � � �� � f ( X ) − f ∗ ( X ) � 2 , X = E , which we want to be as small as possible (in expectation or with large probability). ◮ We can also be interested in squared reconstruction error � f − f ∗ � 2 � � �� � H where H is a certain Hilbert norm of interest for the user. Rates for statistical inverse learning van Dantzig seminar 24/06/2016 4 / 38 G. Blanchard

  5. L INEAR CASE ◮ Very classical is the linear case: X = R p , f ∗ ( x ) = � x , β ∗ � , and in usual matrix form ( X t i form the lines of the design matrix X ) Y = X β ∗ + ε ◮ ordinary least squares solution is � β OLS = ( X t X ) † X t Y . �� � 2 � β ∗ − � ◮ Prediction error corresponds to E β, X � � 2 � � � β ∗ − � ◮ Reconstruction error corresponds to β � . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 5 / 38 G. Blanchard

  6. E XTENDING THE SCOPE OF LINEAR REGRESSION ◮ Common strategy to model more complex functions: map input variable x ∈ X to a so-called “feature space” through � x = Φ( x ) ◮ typical examples (say with X = [ 0 , 1 ] ) are x = Φ( x ) = ( 1 , x , x 2 , . . . , x p ) ∈ R p + 1 ; � x = Φ( x ) = ( 1 , cos ( 2 π x ) , sin ( 2 π x ) , cos ( 3 π x ) , sin ( 3 π x ) , . . . ) ∈ R 2 p + 1 . � ◮ Problem: large number of parameters to estimate require regularization to avoid overfitting. Rates for statistical inverse learning van Dantzig seminar 24/06/2016 6 / 38 G. Blanchard

  7. R EGULARIZATION METHODS ◮ Main idea of regularization is to replace ( X t X ) † by an approximate inverse, for instance ◮ Ridge regression/Tikhonov : � β Ridge ( λ ) = ( X t X + λ I p ) − 1 X t Y ◮ PCA projection/spectral cut-off : restrict X t X on its k first eigenvectors β PCA ( k ) = ( X t X ) † � | k X t Y ◮ Gradient descent/Landweber Iteration/ L 2 boosting : β LW ( k ) = � � β LW ( k − 1 ) + X t ( Y − X � β LW ( k − 1 ) ) k � ( I − X t X ) k X t Y , = i = 0 � � � X t X � op ≤ 1). (assuming Rates for statistical inverse learning van Dantzig seminar 24/06/2016 7 / 38 G. Blanchard

  8. G ENERAL FORM SPECTRAL LINEARIZATION ◮ General form regularization method: � β Spec ( ζ,λ ) = ζ λ ( X t X ) X t Y for somme well-chosen function ζ λ : R + → R + acting on the spectrum and “approximating” the function x �→ 1 / x . ◮ λ > 0: regularization parameter; λ → 0 ⇔ less regularization ◮ Notation of functional calculus, i.e. X t X = Q T diag ( λ 1 , . . . , λ p ) Q → ζ ( X t X ) := Q T diag ( ζ ( λ 1 ) , . . . , ζ ( λ p )) Q ◮ Many well-known from the inverse problem literature ◮ Examples: ◮ Tikhonov : ζ λ ( t ) = ( t + λ ) − 1 ◮ Spectral cut-off : ζ λ ( t ) = t − 1 1 { t ≥ λ } i = 0 ( 1 − t ) i . ◮ Landweber iteration : ζ k ( t ) = � k Rates for statistical inverse learning van Dantzig seminar 24/06/2016 8 / 38 G. Blanchard

  9. C OEFFICIENT EXPANSION ◮ A useful trick of functional calculus is the “ shift rule ”: ζ ( X t X ) X t = X t ζ ( XX t ) . ◮ Interpretation : n � � β Spec ( ζ,λ ) = ζ ( X t X ) X t Y = X t ζ ( XX t ) Y = α i X i , � i = 1 with α i = ζ ( G ) Y , � and G = XX t is the ( n , n ) Gram matrix of ( X 1 , . . . , X n ) . ◮ This representation is more economical if p ≫ n . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 9 / 38 G. Blanchard

  10. T HE “ KERNELIZATION ” A NSATZ ◮ Let Φ be a feature mapping into a (possibly infinite dimensional) Hilbert feature space H . ◮ Representing � x = Φ( x ) ∈ H explicitly is cumbersome/impossible in practice, but if we can compute quickly the kernel � � x , � K ( x , x ′ ) := = � Φ( x ) , Φ( x ′ ) � , � x ′ �� � then kernel Gram matrix � x i , � G ij = x j = K ( x i , x j ) is accessible. ◮ We can hence directly “kernelize” any classical regularization technique using the implicit representation n � � α i � α i = ζ ( � β Spec ( ζ,λ ) = � X i , � G ) Y , i = 1 � � � ◮ the value of f ( x ) = β, � x can then be computed for any x : n � f ( x ) = α i K ( X i , x ) . � i = 1 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 10 / 38 G. Blanchard

  11. R EPRODUCING KERNEL METHODS ◮ If H is a Hilbert feature space, it is useful to identify it as a space of real functions on X of the form f ( x ) = � w , Φ( x ) � . The canonical feature mapping is then Φ( x ) = K ( x , . ) and the “reproducing kernel” property reads f ( x ) = � f , Φ( x ) � = � f , K ( x , . ) � . ◮ Classical kernels on R d include ◮ Gaussian Kernel K ( x , y ) = exp − � x − y � 2 / 2 σ 2 ◮ Polynomial Kernel K ( x , y ) = ( 1 + � x , y � ) p ◮ Spline kernels, Mat´ ern kernel, inverse quadratic kernel. . . ◮ Success of reproducing kernel methods since early 00’s is due to their versatility and ease of use : beyond vector spaces, kernels have been constructed on various non-euclidean data (text, genome, graphs, probability distributions. . . ) ◮ One of the tenets of “learning theory” is a distribution-free point of view ; in particular the sampling distribution (of the X i s) is unknown to the user and could be very general. Rates for statistical inverse learning van Dantzig seminar 24/06/2016 11 / 38 G. Blanchard

  12. General regularization and kernel methods 1 Inverse learning/regression and relation to kernels 2 Rates for linear spectral regularization methods 3 Beyond the regular spectrum case 4 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 12 / 38 G. Blanchard

  13. S ETTING : “I NVERSE L EARNING ” PROBLEM ◮ We refer to “inverse learning” (or inverse regression) for an inverse problem where we have noisy observations at random design points : Y i = ( Af ∗ )( X i ) + ε i . ( X i , Y i ) i = 1 ,..., n i.i.d. : (ILP) ◮ the goal is to recover f ∗ ∈ H 1 . ◮ early works on closely related subjects: from the splines literature in the 80’s (e.g. O’Sullivan ’90) Rates for statistical inverse learning van Dantzig seminar 24/06/2016 13 / 38 G. Blanchard

  14. M AIN ASSUMPTION FOR I NVERSE L EARNING Y i = ( Af ∗ )( X i ) + ε i , i = 1 , . . . , n , where A : H 1 → H 2 . Model: (ILP) Observe: ◮ H 2 should be a space of real-values functions on X . ◮ the geometrical structure of the “measurement errors” will be dictated by the statistical properties of the sampling scheme – no need to assume or consider any a priori Hilbert structure on H 2 ◮ crucial stuctural assumption is the following: Assumption The family of evaluation functionals ( S x ) , x ∈ X , defined by S x : H 1 − → R f �− → ( S x )( f ) := ( Af )( x ) is uniformly bounded, i.e., there exists κ < ∞ such that for any x ∈ X | S x ( f ) | ≤ κ � f � H 1 . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 14 / 38 G. Blanchard

  15. G EOMETRY OF INVERSE LEARNING ◮ The inverse learning under the previous assumption was essentially considered by Caponnetto et al. (2006). ◮ Riesz’s theorem implies the existence for any x ∈ X of F x ∈ H 1 : ∀ f ∈ H 1 : ( Af )( x ) = � f , F x � ◮ K ( x , y ) := � F x , F y � defines a positive semidefinite kernel on X with associated reproducing kernel Hilbert space (RKHS) denoted H K . ◮ as a pure function space, H K coincides with Im ( A ) . ◮ assuming A injective, A is in fact an isometric isomorphism between H 1 and H K . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 15 / 38 G. Blanchard

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend