Near-linear Time Gaussian Process Optimization with Adaptive - PowerPoint PPT Presentation

Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT

Bayesian/Bandit Optimization Set of candidates A 2

Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback (3) Update model 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T 2

Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T Use Gaussian process/kernelized Bandit to model f 2

Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) 3 Image from Berkeley’s CS 188

Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? 3 Image from Berkeley’s CS 188

Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? Batch BKB: no-regret and scalable 3 Image from Berkeley’s CS 188

Why Scalable GP Optimization is Hard Experimental scalability Computational scalability 4

Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational scalability 4

Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP 4

Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP Batching and approximations increase regret 4

Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

Choosing good candidates with GP-UCB 6

Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) 6

Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) 6

Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. 6

Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points 6

Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points [Cal+19]: � u t valid UCB if D t updated at every t . 6

Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e 7

Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost 7

Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost Improve scalability: batching feedback (GP-BUCB), batching resparsification ? 7

Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch 8

Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch “Not too big” Lemma : valid UCB 8

Delayed Resparsification New adaptive batching rule BBKB 4000 � σ 2 ( x i ) � 1 no-resparsify until � size 3000 i ∈ Batch batch 2000 “Not too big” Lemma : valid UCB 1000 “Not too small” Lemma : batch-size = Ω ( t ) 0 2000 4000 6000 8000 10000 12000 t 8

Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. 9

Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. Comparisons: Same regret of GP-UCB/IGP-UCB and better scalability (form O ( T 3 ) to O ( Td 2 eff ) ) Larger batches than GP-BUCB Better regret and better scalability than async-TS 9

In practice: Scalability Cadata NAS-bench-101 A = 20640, d = 8, T = 2000 A = 12416, d = 19, T = 12000 40 50 Batch-GPUCB eps-Greedy BKB Regularized evolution 35 Global-BBKB Global-BBKB 40 GPUCB 30 async-TS time ( sec ) time ( sec ) 25 30 20 20 15 10 10 5 0 0 0 250 500 750 1000 1250 1500 1750 2000 2000 4000 6000 8000 10000 12000 t t 10

Near-linear Time Gaussian Process Optimization with Adaptive - PowerPoint PPT Presentation

Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

A prior near-ignorance Gaussian Process model for nonparametric regression Francesca Mangili

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

2 Forward Elimination Factored Portion Factored Portion Si

Adversarially Robust Optimization with Gaussian Processes Ilija Bogunovic, Jonathan Scarlett,

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Applications of the Reverse Engineering Language REIL Hackers to Hackers Conference 2009, So

1-Bucket-Theta: Map 1-Bucket-Theta: Reduce Col T 1 6 Row Input: tuple x S T,

Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science

How efficient are binary search trees? Balanced search trees Balancing a tree means to keep the

On Moment Problems in Robust Control, Spectral Estimation, Image Processing and System

harmonic functions and the chromatic polynomial R. Kenyon (Brown) based on joint work with A.

Formal Methods for the Verification of Distributed Algorithms Paul Gastin Laboratoire

Peer-to-Peer Networking and Discovery Technologies Week 6 Whats Peer-to-Peer? A different