scalable gaussian processes with a twist of probabilistic
play

Scalable Gaussian processes with a twist of Probabilistic Numerics - PowerPoint PPT Presentation

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017 Agenda Kernel Methods Scalable Gaussian Processes (using Preconditioning)


  1. Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017

  2. Agenda • Kernel Methods • Scalable Gaussian Processes (using Preconditioning) • Probabilistic Numerics 1/33

  3. Kernel Methods • Operate in a high-dimensional, implicit feature space 2/33 • E.g. RBF : k • Rely on the construction of an n × n Gram matrix K ( ) ( 2 d 2 ) x i , x j = σ 2 exp − 1 ) ⊤ Λ where d 2 = ( ( ) x i − x j x i − x j

  4. Kernel Methods • Wide variety of kernel functions available Taken from David Duvenaud’s PhD Thesis 3/33

  5. Kernel Methods • Choice is not always straightforward! Taken from David Duvenaud’s PhD Thesis 4/33

  6. All About that Bass Bayes marginal likelihood 5/33 posterior = likelihood × prior p ( par | X , y ) = p ( y | X , par ) × p ( par ) p ( y | X )

  7. All About that Bass Bayes - Making Predictions • We average over all possible parameter values, weighted by their posterior probability 6/33 ∫ p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , par ) p ( par | X , y ) d par = N ( E [ y ∗ ] , V [ y ∗ ])

  8. Gaussian Processes

  9. Gaussian Processes - Prior Distribution over Functions 7/33 3 2 1 label 0 K ∞ = −1 −2 −3 −4 −2 0 2 4 input

  10. Gaussian Processes - Conditioned on Observations 8/33 3 2 ● ● 1 ● ● ● ● label ● ● ● ● 0 ● ● ● ● K ∞ = ● ● ● ● ● ● −1 −2 −3 −4 −2 0 2 4 input

  11. Gaussian Processes - Posterior Distribution over Functions 9/33 3 2 ● ● 1 ● ● ● ● label ● ● ● ● 0 ● ● ● ● K y = ● ● ● ● ● ● −1 −2 −3 −4 −2 0 2 4 input

  12. Gaussian Processes GP regression example Inference result GP prior 10/33 3 3 3 2 2 2 ● ● ● ● 1 1 1 ● ● ● ● ● ● ● ● label label label ● ● ● ● ● ● 0 ● 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −1 −1 −2 −2 −2 −3 −3 −3 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 input input input K y = K ∞ = K ∞ =

  13. Bayesian Learning vs Deep Learning • Deep Learning + Scalable to very large datasets + Increased model flexibility/capacity - Frequentist approaches make only point estimates - Less robust to overfitting • Bayesian Learning + Incorporates uncertainty in predictions + Works well with smaller datasets - Lack of conjugacy necessitates approximation - Expensive computational and storage requirements 11/33

  14. Bayesian Learning vs Deep Learning - Deep Gaussian Processes • Deep probabilistic models • Composition of functions 12/33 ( h ( N h − 1 ) ( θ ( N h − 1 ) ) ◦ . . . ◦ h ( 0 ) ( θ ( 0 ) )) f ( x ) = ( x ) h ( 0 ) ( x ) h ( 0 ) ( x ) h ( 1 ) ( x ) h ( 1 ) ( )

  15. Bayesian Learning vs Deep Learning - Deep Gaussian Processes • Inference requires calculating the marginal likelihood: • Very challenging! p p p 13/33 ∫ ( Y | F ( N h ) , θ ( N h ) ) p ( Y | X , θ ) = × ( F ( N h ) | F ( N h − 1 ) , θ ( N h − 1 ) ) × . . . × ( F ( 1 ) | X , θ ( 0 ) ) dF ( N h ) . . . dF ( 1 )

  16. Bayesian Learning vs Deep Learning - Deep Gaussian Processes X Y Cutajar et al., Random Feature Expansions for Deep Gaussian Processes , ICML 2017 Yarin Gal, Bayesian Deep Learning , PhD Thesis 14/33 Φ ( 0 ) F ( 1 ) Φ ( 1 ) F ( 2 ) Ω ( 0 ) W ( 0 ) Ω ( 1 ) W ( 1 ) θ ( 0 ) θ ( 1 )

  17. Scalable Gaussian Processes

  18. Gaussian Processes 2 Tr y y • Marginal likelihood 15/33 • Derivatives wrt par 2 y T K − 1 log [ p ( y | par )] = − 1 2 log | K y | − 1 y y + const . ∂ log [ p ( y | par )] ( ∂ K y ) ∂ K y K − 1 2 y T K − 1 K − 1 = − 1 + 1 ∂ par i ∂ par i ∂ par i y y

  19. Gaussian Processes - Stochastic Trace Estimation Taken from Shakir Mohamed’s Machine Learning Blog 16/33

  20. Gaussian Processes - Stochastic Gradients y Linear systems only! y y N r 2 N r • Stochastic gradient r y 17/33 y Tr • Stochastic estimate of the trace - assuming E [ rr T ] = I , then ( ) ( ) [ ] ∂ K y ∂ K y ∂ K y K − 1 K − 1 r T K − 1 = Tr E [ rr T ] = E ∂ par i ∂ par i ∂ par i ∂ K y ∂ K y r ( i ) + 1 ∑ r ( i ) T K − 1 2 y T K − 1 K − 1 − 1 ∂ par i ∂ par i y y i = 1

  21. Gaussian Processes - Stochastic Gradients y Linear systems only! y y N r 2 N r • Stochastic gradient r y 17/33 y Tr • Stochastic estimate of the trace - assuming E [ rr T ] = I , then ( ) ( ) [ ] ∂ K y ∂ K y ∂ K y K − 1 K − 1 r T K − 1 = Tr E [ rr T ] = E ∂ par i ∂ par i ∂ par i ∂ K y ∂ K y r ( i ) + 1 ∑ r ( i ) T K − 1 2 y T K − 1 K − 1 − 1 ∂ par i ∂ par i y y i = 1

  22. tn 2 for t CG iterations - in theory t Solving Linear Systems • Cholesky Decomposition • K must be stored in memory! • Conjugate Gradient • Numerical solution of linear systems • n (possibly worse!) 18/33 • Involve the solution of linear systems K z = v • O ( n 2 ) space and O ( n 3 ) time - unfeasible for large n

  23. Solving Linear Systems • Cholesky Decomposition • K must be stored in memory! • Conjugate Gradient • Numerical solution of linear systems 18/33 • Involve the solution of linear systems K z = v • O ( n 2 ) space and O ( n 3 ) time - unfeasible for large n • O ( tn 2 ) for t CG iterations - in theory t = n (possibly worse!) z0 z

  24. Solving Linear Systems • Preconditioned Conjugate Gradient (henceforth PCG ) • Transforms linear system to be better conditioned, improving convergence CG PCG 19/33 • Yields a new linear system of the form P − 1 K z = P − 1 v • O ( tn 2 ) for t PCG iterations - in practice t ≪ n z0 z0 z z

  25. • For low-rank preconditioners we employ the Woodbury inversion lemma: K y = 1 = • For other preconditioners we solve inner linear systems once again Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert P = P using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible

  26. Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert • For low-rank preconditioners we employ the Woodbury inversion lemma: P = • For other preconditioners we solve inner linear systems once again using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible K y = P − 1 =

  27. Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert • For low-rank preconditioners we employ the Woodbury inversion lemma: P = • For other preconditioners we solve inner linear systems once again using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible K y = P − 1 =

  28. Preconditioning Approaches PITC Regularization SKI Block Jacobi Partial SVD r Nyström Spectral UU K UX m 21/33 FITC UU K UX P = K XU K − 1 UU K UX + λ I where U ⊂ X P = K XU K − 1 K − K XU K − 1 ( ) UU K UX + diag + λ I P = K XU K − 1 K − K XU K − 1 ( ) UU K UX + bldiag + λ I ∑ m [ 2 π s ⊤ ( )] P ij = σ 2 x i − x j + λ I ij r = 1 cos K = A Λ A ⊤ P = A [ · , 1 : m ] Λ [ 1 : m , 1 : m ] A ⊤ ⇒ [ 1 : m , · ] + λ I P = bldiag ( K ) + λ I P = WK UU W ⊤ + λ I where K UU is Kronecker P = K + λ I + δ I

  29. Comparison of Preconditioners vs CG 22/33

  30. Experimental Setup - GP Kernel Parameter Optimization • Exact gradient-based optimization using Cholesky decomposition (CHOL) • Stochastic gradient-based optimization • Linear systems solved with CG and PCG • GP Approximations • Variational learning of inducing variables ( VAR ) • Fully Independent Training Conditional ( FITC ) • Partially Independent Training Conditional ( PITC ) 23/33

  31. Results - ARD Kernel Regression Protein ( n = 45730, d =9) Classification EEG ( n = 14979, d =14) Power plant ( n = 9568, d =4) 24/33 Spam ( n = 4061, d =57) 40 0 Negative Test Log−Lik Negative Test Log−Lik 0.22 35 0.12 −10 Error Rate 30 RMSE 0.20 −20 25 0.08 20 −30 0.18 15 0.04 −40 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −1 0 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) 0.72 60 0.25 400 Negative Test Log−Lik Negative Test Log−Lik 50 0.68 350 Error Rate RMSE 0.15 40 300 0.64 30 250 0.05 200 20 0.60 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) PCG CG CHOL FITC PITC VAR

  32. Follow-up Work • Faster Kernel Ridge Regression Using Sketching and Preconditioning Avron et al. (2017) • FALKON: An Optimal Large Scale Kernel Method Rosasco et al. (2017) • Large Linear Multi-output Gaussian Process Learning for Time Series Feinberg et al. (2017) • Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes Kim et al. (2017) 25/33

  33. Follow-up work ... but what’s left to do now? 26/33

  34. Follow-up work ... but what’s left to do now? 26/33

  35. Probabilistic Numerics

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend