high dimensional bayesian optimisation and bandits via
play

High Dimensional Bayesian Optimisation and Bandits via Additive - PowerPoint PPT Presentation

High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kanda samy , Jeff Schneider, Barnab as P oczos ICML 15 July 8 2015 1/20 Bandits & Optimisation Maximum Likelihood inference in Computational


  1. High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kanda samy , Jeff Schneider, Barnab´ as P´ oczos ICML ’15 July 8 2015 1/20

  2. Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics Cosmological Simulator E.g: Hubble Constant Baryonic Density Observation 2/20

  3. Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics Cosmological Simulator E.g: Hubble Constant Baryonic Density Observation 2/20

  4. Bandits & Optimisation Expensive Blackbox Function 2/20

  5. Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics 2/20

  6. Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) f ( x ∗ ) x ∗ x 3/20

  7. Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3/20

  8. Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Optimisation ∼ = Minimise Simple Regret . S T = f ( x ∗ ) − x t , t =1 ,..., T f ( x t ) . max 3/20

  9. Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandits ∼ = Minimise Cumulative Regret . T � R T = f ( x ∗ ) − f ( x t ) . t =1 3/20

  10. Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Optimisation ∼ = Minimise Simple Regret . S T = f ( x ∗ ) − x t , t =1 ,..., T f ( x t ) . max 3/20

  11. Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 4/20

  12. Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Obtain posterior GP. . 4/20

  13. Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Maximise acquisition function ϕ t : x t = argmax x ϕ t ( x ). ϕ t ( x ) x t = 0 . 828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x GP-UCB : ϕ t ( x ) = µ t − 1 ( x ) + β 1 / 2 σ t − 1 ( x ) (Srinivas et al. 2010) t 4/20

  14. Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Maximise acquisition function ϕ t : x t = argmax x ϕ t ( x ). ϕ t ( x ) x t = 0 . 828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x ϕ t : Expected Improvement ( GP-EI ), Thompson Sampling etc. 4/20

  15. Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. 5/20

  16. Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. Existing Work: ◮ (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB . ◮ (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. ◮ (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB . 5/20

  17. Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013. ◮ Assumes f varies only along a low dimensional subspace. ◮ Perform BO on a low dimensional subspace. ◮ Assumption too strong in realistic settings. 5/20

  18. Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , 6/20

  19. Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , E.g. f ( x { 1 ,..., 10 } ) = f (1) ( x { 1 , 3 , 9 } ) + f (2) ( x { 2 , 4 , 8 } ) + f (3) ( x { 5 , 6 , 10 } ) . 1 2 3 4 5 6 ❍ 7 ✟ 8 9 10 ✟ ❍ Call {X ( j ) M j =1 } = { (1 , 3 , 9) , (2 , 4 , 8) , (5 , 6 , 10) } the “decomposition”. 6/20

  20. Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , Assume each f ( j ) ∼ GP ( 0 , κ ( j ) ). Then f ∼ GP ( 0 , κ ) where, κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . 6/20

  21. Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , Assume each f ( j ) ∼ GP ( 0 , κ ( j ) ). Then f ∼ GP ( 0 , κ ) where, κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . Given ( X , Y ) = { ( x i , y i ) T i =1 } , and test point x † , � µ ( j ) , σ ( j )2 ) . f ( j ) ( x ( j ) † ) | X , Y ∼ N 6/20

  22. Outline 1. GP-UCB 2. The Add-GP-UCB algorithm ◮ Bounds on S T : exponential in D → linear in D . ◮ An easy-to-optimise acquisition function. ◮ Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions 7/20

  23. GP-UCB µ t − 1 ( x ) + β 1 / 2 x t = argmax σ t − 1 ( x ) t x ∈X 8/20

  24. GP-UCB µ t − 1 ( x ) + β 1 / 2 x t = argmax σ t − 1 ( x ) t x ∈X Squared Exponential Kernel � � x − x ′ � 2 � κ ( x , x ′ ) = A exp 2 h 2 Theorem (Srinivas et al. 2010) Let f ∼ GP ( 0 , κ ). Then w.h.p, �� � D D (log T ) D S T ∈ O . T 8/20

  25. GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. 9/20

  26. GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. Can be shown: If each κ ( j ) is a SE kernel, �� � D 2 d d (log T ) d S T ∈ O . T 9/20

  27. GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. Can be shown: If each κ ( j ) is a SE kernel, �� � D 2 d d (log T ) d S T ∈ O . T But ϕ t = µ t − 1 + β 1 / 2 σ t − 1 is D -dimensional ! t 9/20

  28. Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) . ϕ t ( x ) = � t j =1 10/20

  29. Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) ϕ t ( x ) = � . t � �� � j =1 ϕ ( j ) t ( x ( j ) ) � ϕ ( j ) Maximise each � separately. t Requires only O ( poly ( D ) ζ − d ) effort (vs O ( ζ − D ) for GP-UCB ) . 10/20

  30. Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) ϕ t ( x ) = � . t � �� � j =1 ϕ ( j ) t ( x ( j ) ) � ϕ ( j ) Maximise each � separately. t Requires only O ( poly ( D ) ζ − d ) effort (vs O ( ζ − D ) for GP-UCB ) . Theorem Let f ( j ) ∼ GP ( 0 , κ ( j ) ) and f = � j f ( j ) . Then w.h.p, �� � D 2 d d (log T ) d S T ∈ O . T 10/20

  31. Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : � D D / 2 (log T ) D / 2 T − 1 / 2 � S T ∈ O GP-UCB on additive f : � DT − 1 / 2 � S T ∈ O O ( ζ − D ) effort. Maximising ϕ t : Add-GP-UCB on additive f : � DT − 1 / 2 � S T ∈ O O ( poly ( D ) ζ − d ) effort. Maximising � ϕ t : 11/20

  32. f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 f (2) ( x { 2 } ) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 f (1) ( x { 1 } ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

  33. f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 f (2) ( x { 2 } ) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 f (1) ( x { 1 } ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

  34. f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

  35. f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 ϕ (2) ( x { 2 } ) 0.7 ˜ 0.6 0.5 0.4 = 0 . 141 0.3 x ( 2 ) 0.2 t 0.1 0 ϕ (1) ( x { 1 } ) ˜ x ( 1 ) = 0 . 869 t 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend