High Dimensional Bayesian Optimisation and Bandits via Additive - PowerPoint PPT Presentation

High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kanda samy , Jeff Schneider, Barnab´ as P´ oczos ICML ’15 July 8 2015 1/20

Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics Cosmological Simulator E.g: Hubble Constant Baryonic Density Observation 2/20

Bandits & Optimisation Expensive Blackbox Function 2/20

Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics 2/20

Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) f ( x ∗ ) x ∗ x 3/20

Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3/20

Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Optimisation ∼ = Minimise Simple Regret . S T = f ( x ∗ ) − x t , t =1 ,..., T f ( x t ) . max 3/20

Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandits ∼ = Minimise Cumulative Regret . T � R T = f ( x ∗ ) − f ( x t ) . t =1 3/20

Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Optimisation ∼ = Minimise Simple Regret . S T = f ( x ∗ ) − x t , t =1 ,..., T f ( x t ) . max 3/20

Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 4/20

Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Obtain posterior GP. . 4/20

Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Maximise acquisition function ϕ t : x t = argmax x ϕ t ( x ). ϕ t ( x ) x t = 0 . 828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x GP-UCB : ϕ t ( x ) = µ t − 1 ( x ) + β 1 / 2 σ t − 1 ( x ) (Srinivas et al. 2010) t 4/20

Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Maximise acquisition function ϕ t : x t = argmax x ϕ t ( x ). ϕ t ( x ) x t = 0 . 828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x ϕ t : Expected Improvement ( GP-EI ), Thompson Sampling etc. 4/20

Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. 5/20

Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. Existing Work: ◮ (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB . ◮ (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. ◮ (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB . 5/20

Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013. ◮ Assumes f varies only along a low dimensional subspace. ◮ Perform BO on a low dimensional subspace. ◮ Assumption too strong in realistic settings. 5/20

Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , 6/20

Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , E.g. f ( x { 1 ,..., 10 } ) = f (1) ( x { 1 , 3 , 9 } ) + f (2) ( x { 2 , 4 , 8 } ) + f (3) ( x { 5 , 6 , 10 } ) . 1 2 3 4 5 6 ❍ 7 ✟ 8 9 10 ✟ ❍ Call {X ( j ) M j =1 } = { (1 , 3 , 9) , (2 , 4 , 8) , (5 , 6 , 10) } the “decomposition”. 6/20

Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , Assume each f ( j ) ∼ GP ( 0 , κ ( j ) ). Then f ∼ GP ( 0 , κ ) where, κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . 6/20

Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , Assume each f ( j ) ∼ GP ( 0 , κ ( j ) ). Then f ∼ GP ( 0 , κ ) where, κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . Given ( X , Y ) = { ( x i , y i ) T i =1 } , and test point x † , � µ ( j ) , σ ( j )2 ) . f ( j ) ( x ( j ) † ) | X , Y ∼ N 6/20

Outline 1. GP-UCB 2. The Add-GP-UCB algorithm ◮ Bounds on S T : exponential in D → linear in D . ◮ An easy-to-optimise acquisition function. ◮ Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions 7/20

GP-UCB µ t − 1 ( x ) + β 1 / 2 x t = argmax σ t − 1 ( x ) t x ∈X 8/20

GP-UCB µ t − 1 ( x ) + β 1 / 2 x t = argmax σ t − 1 ( x ) t x ∈X Squared Exponential Kernel � � x − x ′ � 2 � κ ( x , x ′ ) = A exp 2 h 2 Theorem (Srinivas et al. 2010) Let f ∼ GP ( 0 , κ ). Then w.h.p, �� D D (log T ) D S T ∈ O . T 8/20

GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. 9/20

GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. Can be shown: If each κ ( j ) is a SE kernel, �� D 2 d d (log T ) d S T ∈ O . T 9/20

GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. Can be shown: If each κ ( j ) is a SE kernel, �� D 2 d d (log T ) d S T ∈ O . T But ϕ t = µ t − 1 + β 1 / 2 σ t − 1 is D -dimensional ! t 9/20

Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) . ϕ t ( x ) = � t j =1 10/20

Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) ϕ t ( x ) = � . t � �� j =1 ϕ ( j ) t ( x ( j ) ) � ϕ ( j ) Maximise each � separately. t Requires only O ( poly ( D ) ζ − d ) effort (vs O ( ζ − D ) for GP-UCB ) . 10/20

Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) ϕ t ( x ) = � . t � �� j =1 ϕ ( j ) t ( x ( j ) ) � ϕ ( j ) Maximise each � separately. t Requires only O ( poly ( D ) ζ − d ) effort (vs O ( ζ − D ) for GP-UCB ) . Theorem Let f ( j ) ∼ GP ( 0 , κ ( j ) ) and f = � j f ( j ) . Then w.h.p, �� D 2 d d (log T ) d S T ∈ O . T 10/20

Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : � D D / 2 (log T ) D / 2 T − 1 / 2 � S T ∈ O GP-UCB on additive f : � DT − 1 / 2 � S T ∈ O O ( ζ − D ) effort. Maximising ϕ t : Add-GP-UCB on additive f : � DT − 1 / 2 � S T ∈ O O ( poly ( D ) ζ − d ) effort. Maximising � ϕ t : 11/20

f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 f (2) ( x { 2 } ) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 f (1) ( x { 1 } ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 ϕ (2) ( x { 2 } ) 0.7 ˜ 0.6 0.5 0.4 = 0 . 141 0.3 x ( 2 ) 0.2 t 0.1 0 ϕ (1) ( x { 1 } ) ˜ x ( 1 ) = 0 . 869 t 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20

High Dimensional Bayesian Optimisation and Bandits via Additive - PowerPoint PPT Presentation

High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kanda samy , Jeff Schneider, Barnab as P oczos ICML 15 July 8 2015 1/20 Bandits & Optimisation Maximum Likelihood inference in Computational

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

The additive model revisited Sara van de Geer January 8, 2013 but first something else (Les

A Modern History of Probability Theory Kevin H. Knuth Depts. of Physics and Informatics

Computational Linguistics Statistical NLP Aurlie Herbelot 2020 Centre for Mind/Brain Sciences

Noise Graph Addition: A New Perspective for Graph Anonymization Vicen Torra, Julin Salas

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr

Presentation 7.3a: Multiple linear re- gression Murray Logan July 19, 2017 Table of contents

Lattice and Non-Lattice Markov Additive Models Jevgenijs Ivanovs, Guy Latouche and Peter Taylor