kernel partial least squares for stationary data
play

Kernel partial least squares for stationary data Tatyana - PowerPoint PPT Presentation

Kernel partial least squares for stationary data Tatyana Krivobokova, Marco Singer, Axel Munk Georg-August-Universit at G ottingen Bert de Groot Max Planck Institute for Biophysical Chemistry Van Dantzig Seminar, 06 April 2017 1 / 39


  1. Kernel partial least squares for stationary data Tatyana Krivobokova, Marco Singer, Axel Munk Georg-August-Universit¨ at G¨ ottingen Bert de Groot Max Planck Institute for Biophysical Chemistry Van Dantzig Seminar, 06 April 2017 1 / 39

  2. Motivating example Proteins • are large biological molecules • function often requires dynamics • configuration space is high-dimensional Group of Bert de Groot seeks to identify a relationship between collective atomic motions of a protein and some specific protein’s (biological) function. 2 / 39

  3. Motivating example The data from the Molecular Dynamics (MD) simulations: • Y t ∈ R is a functional quantity of interest at time t , t = 1 , . . . , n • X t ∈ R 3 N are Euclidean coordinates of N atoms at time t Stylized facts • d = 3 N is typically high, but d ≪ n • { X t } t , { Y t } t are (non-)stationary time series • some (large) atom movements might be unrelated to Y t Functional quantity Y t to be modelled a function of X t . 3 / 39

  4. Yeast aquaporin (AQY1) • Gated water channel • Y t is the opening diameter (red line) • 783 backbone atoms • n = 20 , 000 observations on 100 ns timeframe 4 / 39

  5. AQY1 time series Movements of the first atom and the diameter of channel opening 4.0 0.6 3.8 Diameter in nm 3.6 Coordinate 0.5 3.4 3.2 0.4 3.0 0.3 2.8 0 20 40 60 80 100 0 20 40 60 80 100 Time in ns Time in ns 5 / 39

  6. Model Assume Y t = f ( X t ) + ǫ t , t = 1 , . . . , n , where • { X t } t is a d -dimensional stationary time series • { ǫ t } t i.i.d. zero mean sequence independent of { X t } t • f ∈ L 2 ( P � X ), � X is independent of { X t } t and { ǫ t } t and P � X = P X 1 The closeness of an estimator � f of f is measured by � � � � 2 2 � � �� f ( � � X ) − f ( � f − f � 2 = E � X ) . X 6 / 39

  7. Simple linear case Hub, J.S. and de Groot, B. L. (2009) assumed a linear model Y i = X T i β + ǫ i , i = 1 , . . . , n , X i ∈ R d , or in matrix form Y = X β + ǫ , ignored dependence in the data and tried to regularise the estimator by using PCA. 7 / 39

  8. Motivating example PC regression with 50 components 0.8 0.6 0.6 correlation 0.5 0.4 0.4 0.2 0.3 0.0 0 20 40 60 80 100 0 10 20 30 40 50 time in ns number of components 8 / 39

  9. Motivating example Partial Least Squares (PLS) leads to superior results 0.8 0.6 correlation 0.4 0.2 PLS PCR 0.0 0 10 20 30 40 50 number of components 9 / 39

  10. Regularisation with PCR and PLS Consider a linear regression model with fixed design Y = X β + ǫ. In the following let A = X T X and b = X T Y . PCR and PLS regularise β with a transformation H ∈ R d × s s.t. 1 n � Y − XH α � 2 = H ( H T AH ) − 1 H T b , � β s = H arg min α ∈ R s where s ≤ d plays the role of a regularisation parameter. In PCR matrix H consists of the first s eigenvectors of A = X T X . 10 / 39

  11. Regularisation with PLS In PLS one derives H = ( h 1 , . . . , h s ), h i ∈ R d as follows 1 Find cov( Xh , Y ) 2 ∝ X T Y = b h 1 = arg max � h ∈ R d � h � =1 1 X T Y = X � 1 A h 1 ) − 1 h T 2 Project Y orthogonally: Xh 1 ( h T β 1 3 Iterate the procedure according to cov( Xh , Y − X � β i − 1 ) 2 , i = 2 , . . . , s � h i = arg max h ∈ R d � h � =1 Apparently, � β s is highly non-linear in Y . 11 / 39

  12. Regularisation with PLS For PLS is known that h i ∈ K i ( A , b ), i = 1 , . . . , s , where K i ( A , b ) = span { b , Ab , . . . , A i − 1 b } is a Krylov space of order i . With this the alternative definition of PLS is � β ∈K s ( A , b ) � Y − X β � 2 . β s = arg min Note that any β s ∈ K s ( A , b ) can be represented as β s = P s ( A ) b = P s ( X T X ) X T Y = X T P s ( XX T ) Y , where P s is a polynomial of degree at most s − 1. 12 / 39

  13. Regularisation with PLS For the implementation and proofs the residual polynomials R s ( x ) = 1 − xP s ( x ) are of interest. Polynomials R s • are orthogonal w.r.t. an appropriate inner product • satisfy a recurrence relation R s +1 ( x ) = a s xR s ( x ) + b s R s ( x ) + c s R s − 1 ( x ) • are convex on [0 , r s ], where r s is the first root of R s ( x ) and R s (0) = 1. 13 / 39

  14. PLS and conjugate gradient PLS is closely related to the conjugate gradient (CG) algorithm for A β = X T X β = X T Y = b . The solution of this linear equation by CG is defined by β ∈K s ( A , b ) � b − A β � 2 = arg � β ∈K s ( A , b ) � X T ( Y − X β ) � 2 . β CG = arg min min s 14 / 39

  15. CG in deterministic setting CG algorithm has been studied in Nemirovskii (1986) as follows: • Consider ¯ A β = ¯ b for a linear bounded ¯ A : H → H • Assume that only approximation A of ¯ A and b of ¯ b are given • Set � β CG = arg min β ∈K s ( A , b ) � b − A β � 2 H . s 15 / 39

  16. CG in deterministic setting Assume (A1) max {� ¯ A � op , � A � op } ≤ L , � ¯ A − A � op ≤ ǫ and � ¯ b − b � 2 H ≤ δ (A2) The stopping index s satisfies the discrepancy principle s = min { s > 0 : � b − A � β s � H < τ ( δ � � ˆ β s � H + ǫ ) } , τ > 0 (A3) β = ¯ A µ u for � u � H ≤ R , µ, R > 0 (source condition). Theorem (Nemirovskii, 1986) Let (A1) – (A3) hold and ˆ s < ∞ . Then for any θ ∈ [0 , 1] 2(1 − θ ) 2( θ + µ ) A θ ( � � ¯ s − β ) � 2 1+ µ ( ǫ + δ RL µ ) 1+ µ . H ≤ C ( µ, τ ) R β ˆ 16 / 39

  17. Kernel regression A nonparametric model Y i = f ( X i ) + ǫ i , i = 1 , . . . , n , X i ∈ R d is handled in the reproducing kernel Hilbert space (RKHS) framework. Let H be a RKHS, that is • ( H , �· , ·� H ) is a Hilbert space of functions f : R d → R with • a kernel function k : R d × R d → R , s.t. k ( · , x ) ∈ H and f ( x ) = � f , k ( · , x ) � H , x ∈ R d , f ∈ H . f = � n Unknown f is estimated by � i =1 � α i k ( · , X i ). 17 / 39

  18. Kernel regression Define operators • Sample evaluation operator (analogue of X ): T n : f ∈ H �→ { f ( X 1 ) , . . . , f ( X n ) } T ∈ R n • Sample kernel integral operator (analogue of X T / n ): n : u ∈ R n �→ n − 1 � n T ∗ i =1 k ( · , X i ) u i ∈ H • Sample kernel covariance operator (analogue of X T X / n ): n T n : f ∈ H �→ n − 1 � n S n = T ∗ i =1 f ( X i ) k ( · , X i ) ∈ H • Sample kernel (analogue of XX T / n ): K n = T n T ∗ n = n − 1 { k ( X i , X j ) } n i , j =1 18 / 39

  19. Kernel PLS and kernel CG Now we can define the kernel PLS estimator as α ∈K s ( K n , Y ) � Y − K n α � 2 = arg n , Y ) � Y − T n T ∗ n α � 2 , � α s = arg min min α ∈K s ( T n T ∗ or, equivalently, for f = T ∗ n α � n Y ) � Y − T n f � 2 , s = 1 , . . . , n . f s = arg min f ∈K s ( S n , T ∗ The kernel CG estimator is then defined as � f CG n Y ) � T ∗ n ( Y − T n f ) � 2 = arg min H . s f ∈K s ( S n , T ∗ 19 / 39

  20. Results for Kernel CG and PLS Blanchard and Kr¨ amer (2010) • used stochastic setting with i.i.d. data ( Y i , X i ) • proved convergence rates for KCG using ideas in Nemirovskii (1986), Hanke (1995), Caponnetto & de Vito (2007) • argued that the proofs for kernel CG can not be directly transferred to kernel PLS In this work we • use stochastic setting with dependent data • prove convergence rates for kernel PLS building up on Hanke (1995) and Blanchard and Kr¨ amer (2010). 20 / 39

  21. Kernel PLS: assumptions Consider now the model specified for the protein data Y t = f ( X t ) + ǫ t , t = 1 , . . . , n . Let H be a RKHS with kernel k and assume (C1) H is separable; (C2) ∃ κ > 0 s.t. | k ( x , y ) | ≤ κ , ∀ x , y ∈ R d and k is measurable; Under (C1) the Hilbert-Schmidt norm of operators from H to H is well-defined and (C2) implies that all functions in H are bounded. 21 / 39

  22. Kernel PLS: assumptions Let T and T ∗ be population versions of T n and T ∗ n : � T : f ∈ H �→ f ∈ L 2 ( P X ) � T ∗ : f ∈ L 2 ( P � � X ) �→ X ( x ) ∈ H . f ( x ) k ( · , x ) dP It implies population versions of S n and K n : S = T ∗ T and K = TT ∗ . Operators T and T ∗ are adjoint and S , K are self-adjoint. 22 / 39

  23. Kernel PLS: assumptions As in Nemirovskii (1986) we use the source condition as an assumption on regularity of f : (SC) ∃ r ≥ 0, R > 0 and u ∈ L 2 ( P � X ) s.t. f = K r u and � u � 2 ≤ R If r ≥ 1 / 2, then f ∈ L 2 ( P � X ) coincides a.s. with f H ∈ H ( f = Tf H ). The setting with r < 1 / 2 is referred to as the outer case. 23 / 39

  24. Kernel PLS: assumptions Under suitable regularity conditions due to Mercer’ theorem � K ( x , y ) = η i φ i ( x ) φ i ( y ) i i =1 for L 2 ( P � for an orthonormal basis { φ i } ∞ X ) and η 1 ≥ η 2 ≥ . . . . Hence, � � � � θ 2 � θ i φ i ( x ) ∈ L 2 ( P X ) and i H = f : f = < ∞ . η i i i The source condition corresponds to f ∈ H r , where � � � � θ 2 � θ i φ i ( x ) ∈ L 2 ( P X ) and ≤ R 2 i H r = f : f = . η 2 r i i i 24 / 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend