multi fidelity bayesian optimisation
play

Multi-fidelity Bayesian Optimisation g ( z, x ) f ( x ) Z x z X - PowerPoint PPT Presentation

Multi-fidelity Bayesian Optimisation g ( z, x ) f ( x ) Z x z X Kirthevasan Kandasamy Carnegie Mellon University Facebook Inc. Menlo Park, CA September 26, 2017 Slides: www.cs.cmu.edu/ kkandasa/talks/fb-mf-slides.pdf


  1. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ ◮ Optimise f = f (2) . x ⋆ = argmax x f (2) ( x ). ◮ But .. we have an approximation f (1) to f (2) . ◮ f (1) costs λ (1) , f (2) costs λ (2) . λ (1) < λ (2) . “cost”: could be computation time, money etc. ◮ f (1) , f (2) ∼ GP (0 , κ ). ◮ � f (2) − f (1) � ∞ ≤ ζ (1) . ζ (1) is known. 12/30

  2. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. 13/30

  3. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. Maximise f (2) . Don’t care for maximum of f (1) . End Goal: 13/30

  4. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. Maximise f (2) . Don’t care for maximum of f (1) . End Goal: S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) Simple Regret: Capital Λ ← amount of the resource spent. E.g. seconds or dollars . 13/30

  5. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. Maximise f (2) . Don’t care for maximum of f (1) . End Goal: S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) Simple Regret: Capital Λ ← amount of the resource spent. E.g. seconds or dollars . No reward for querying f (1) , but use cheap evaluations to guide search for x ⋆ at f (2) . 13/30

  6. Challenges f (2) = f x ⋆ 13/30

  7. Challenges f (2) + ζ (1) − ζ (1) x ⋆ 13/30

  8. Challenges f (2) f (1) x ⋆ 13/30

  9. Challenges f (2) f (1) x ⋆ ◮ f (1) is not just a noisy version of f (2) . 13/30

  10. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  11. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  12. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  13. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  14. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ ◮ Need to explore f (2) sufficiently well around the high valued regions of f (1) – but at a not too large region. 13/30

  15. Challenges f (1) f (2) x (1) x ⋆ ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ ◮ Need to explore f (2) sufficiently well around the high valued regions of f (1) – but at a not too large region. 13/30

  16. Challenges f (1) f (2) x (1) x ⋆ ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ ◮ Need to explore f (2) sufficiently well around the high valued regions of f (1) – but at a not too large region. Key Message: We will explore X using f (1) and use f (2) mostly in a promising region X α . 13/30

  17. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound f (2) f (1) x ⋆ 14/30

  18. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound f (2) f (1) x ⋆ ◮ Construct Upper Confidence Bound ϕ t for f (2) . Choose point x t = argmax x ∈X ϕ t ( x ). 14/30

  19. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound t = 11 f (2) f (1) x ⋆ x t ◮ Construct Upper Confidence Bound ϕ t for f (2) . Choose point x t = argmax x ∈X ϕ t ( x ). ϕ (1) µ (1) t − 1 ( x ) + β 1 / 2 σ (1) t − 1 ( x ) + ζ (1) t ( x ) = t ϕ (2) t ( x ) = µ (2) t − 1 ( x ) + β 1 / 2 σ (2) t − 1 ( x ) t ϕ t ( x ) = min { ϕ (1) t ( x ) , ϕ (2) t ( x ) } 14/30

  20. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound t = 11 f (2) m t = 2 f (1) γ (1) x ⋆ x t ◮ Construct Upper Confidence Bound ϕ t for f (2) . Choose point x t = argmax x ∈X ϕ t ( x ). ϕ (1) µ (1) t − 1 ( x ) + β 1 / 2 σ (1) t − 1 ( x ) + ζ (1) t ( x ) = t ϕ (2) t ( x ) = µ (2) t − 1 ( x ) + β 1 / 2 σ (2) t − 1 ( x ) t ϕ t ( x ) = min { ϕ (1) t ( x ) , ϕ (2) t ( x ) } � if β 1 / 2 σ (1) t − 1 ( x t ) > γ (1) 1 ◮ Choose fidelity m t = t 2 otherwise. 14/30

  21. Theoretical Results for MF-GP-UCB GP-UCB , κ is an SE kernel (Srinivas et al. 2010) � vol ( X ) S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) w.h.p � Λ 15/30

  22. Theoretical Results for MF-GP-UCB GP-UCB , κ is an SE kernel (Srinivas et al. 2010) � vol ( X ) S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) w.h.p � Λ MF-GP-UCB , κ is an SE kernel (Kandasamy et al. NIPS 2016b) � � vol ( X α ) vol ( X ) w.h.p ∀ α > 0 , S (Λ) + � Λ 2 − α Λ X α = { x : f (2) ( x ⋆ ) − f (1) ( x ) ≤ C α ζ (1) } Good approximation (small ζ (1) ) = ⇒ vol ( X α ) ≪ vol ( X ). 15/30

  23. MF-GP-UCB with multiple approximations 16/30

  24. MF-GP-UCB with multiple approximations Things work out. 16/30

  25. Experiment: Viola & Jones Face Detection 22 Threshold values for each cascade. ( d = 22) Fidelities with dataset sizes (300 , 3000). ( M = 2) 0.35 0.3 0.25 0.2 0.15 0.1 1000 2000 3000 4000 5000 6000 7000 8000 17/30

  26. Experiment: Cosmological Maximum Likelihood Inference ◮ Type Ia Supernovae Data ◮ Maximum likelihood inference for 3 cosmological parameters: ◮ Hubble Constant H 0 ◮ Dark Energy Fraction Ω Λ ◮ Dark Matter Fraction Ω M ◮ Likelihood: Robertson Walker metric (Robertson 1936) Requires numerical integration for each point in the dataset. 18/30

  27. Experiment: Cosmological Maximum Likelihood Inference 3 cosmological parameters. ( d = 3) Fidelities: integration on grids of size (10 2 , 10 4 , 10 6 ). ( M = 3) 10 5 0 -5 -10 500 1000 1500 2000 2500 3000 3500 19/30

  28. MF-GP-UCB Synthetic Experiment: Hartmann-3 D d = 3 , M = 3 Query frequencies for Hartmann-3D 40 m=1 m=2 35 m=3 Num. of Queries 30 25 20 15 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 f (3) ( x ) 19/30

  29. Outline 1. A finite number of approximations (Kandasamy et al. NIPS 2016b) - Formalism, intuition and challenges - Algorithm - Theoretical results - Experiments 2. A continuous spectrum of approximations (Kandasamy et al. ICML 2017) - Formalism - Algorithm - Theoretical results - Experiments 20/30

  30. Why continuous approximations? - Use an arbitrary amount of data? 21/30

  31. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? 21/30

  32. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). 21/30

  33. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). 21/30

  34. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). Scientific studies: Simulations and numerical computations at varying continuous levels of granularity. 21/30

  35. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. Z X 22/30

  36. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . X 22/30

  37. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . z • X We wish to optimise f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. 22/30

  38. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X We wish to optimise f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). 22/30

  39. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X We wish to optimise f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). A cost function, λ : Z → R + . λ ( z ) λ ( z ) = λ ( N , T ) = O ( N 2 T ) . Z z • 22/30

  40. Multi-fidelity Simple Regret (Kandasamy et al. ICML 2017) g ( z, x ) f ( x ) λ ( z ) Z x ⋆ z • X Z z • End Goal: Find x ⋆ = argmax x f ( x ). 23/30

  41. Multi-fidelity Simple Regret (Kandasamy et al. ICML 2017) g ( z, x ) f ( x ) λ ( z ) Z x ⋆ z • X Z z • End Goal: Find x ⋆ = argmax x f ( x ). Simple Regret after capital Λ: S (Λ) = f ( x ⋆ ) − max t : z t = z • f ( x t ) . Λ ← amount of a resource spent, e.g. computation time or money. 23/30

  42. Multi-fidelity Simple Regret (Kandasamy et al. ICML 2017) g ( z, x ) f ( x ) λ ( z ) Z x ⋆ z • X Z z • End Goal: Find x ⋆ = argmax x f ( x ). Simple Regret after capital Λ: S (Λ) = f ( x ⋆ ) − max t : z t = z • f ( x t ) . Λ ← amount of a resource spent, e.g. computation time or money. No reward for maximising low fidelities, but use cheap evaluations at z � = z • to speed up search for x ⋆ . 23/30

  43. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) 24/30

  44. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev 24/30

  45. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 24/30

  46. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 24/30

  47. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  48. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  49. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  50. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  51. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � λ ( z ) � q � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) = ξ ( z ) λ ( z • ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  52. Theoretical Results for BOCA κ : ( Z × X ) 2 → R . g ∼ GP ( 0 , κ ), κ ([ z , x ] , [ z ′ , x ′ ]) = κ X ( x , x ′ ) · κ Z ( z , z ′ ) 25/30

  53. Theoretical Results for BOCA κ : ( Z × X ) 2 → R . g ∼ GP ( 0 , κ ), κ ([ z , x ] , [ z ′ , x ′ ]) = κ X ( x , x ′ ) · κ Z ( z , z ′ ) g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” 25/30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend