Near-linear Time Gaussian Process Optimization with Adaptive - - PowerPoint PPT Presentation

near linear time gaussian process optimization with
SMART_READER_LITE
LIVE PREVIEW

Near-linear Time Gaussian Process Optimization with Adaptive - - PowerPoint PPT Presentation

Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT


slide-1
SLIDE 1

Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification

  • D. Calandriello* 1, L. Carratino* 2, A. Lazaric 3, M. Valko 1, L. Rosasco 2,4

* equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT

slide-2
SLIDE 2

Bayesian/Bandit Optimization

Set of candidates A

2

slide-3
SLIDE 3

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-4
SLIDE 4

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-5
SLIDE 5

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-6
SLIDE 6

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-7
SLIDE 7

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-8
SLIDE 8

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-9
SLIDE 9

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-10
SLIDE 10

Bayesian/Bandit Optimization

Set of candidates A for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-11
SLIDE 11

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-12
SLIDE 12

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, unknown reward function f : A → R for t = 1, . . . , T: (1) Select candidate (2) Receive noisy feedback (3) Update model

2

slide-13
SLIDE 13

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, unknown reward function f : A → R for t = 1, . . . , T: (1) Select candidate xt using model ut (ideally ut ≈ f) (2) Receive noisy feedback (3) Update model

2

slide-14
SLIDE 14

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, unknown reward function f : A → R for t = 1, . . . , T: (1) Select candidate xt using model ut (ideally ut ≈ f) (2) Receive noisy feedback yt = f(xt) + ηt (3) Update model

2

slide-15
SLIDE 15

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, unknown reward function f : A → R for t = 1, . . . , T: (1) Select candidate xt using model ut (ideally ut ≈ f) (2) Receive noisy feedback yt = f(xt) + ηt (3) Update model ut

2

slide-16
SLIDE 16

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, unknown reward function f : A → R for t = 1, . . . , T: (1) Select candidate xt using model ut (ideally ut ≈ f) (2) Receive noisy feedback yt = f(xt) + ηt (3) Update model ut Performance measure: cumulative regret w.r.t. best x∗ RT = T

t=1 f(x∗) − f(xt).

2

slide-17
SLIDE 17

Bayesian/Bandit Optimization

Set of candidates A = {x1, . . . , xA} ⊂ Rd, unknown reward function f : A → R for t = 1, . . . , T: (1) Select candidate xt using model ut (ideally ut ≈ f) (2) Receive noisy feedback yt = f(xt) + ηt (3) Update model ut Performance measure: cumulative regret w.r.t. best x∗ RT = T

t=1 f(x∗) − f(xt).

Use Gaussian process/kernelized Bandit to model f

2

slide-18
SLIDE 18

Gaussian Process Optimization

Well studied: exploration vs exploitation → no-regret (low error)

Image from Berkeley’s CS 188 3

slide-19
SLIDE 19

Gaussian Process Optimization

Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ?

Image from Berkeley’s CS 188 3

slide-20
SLIDE 20

Gaussian Process Optimization

Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? Batch BKB: no-regret and scalable

Image from Berkeley’s CS 188 3

slide-21
SLIDE 21

Why Scalable GP Optimization is Hard

Experimental scalability Computational scalability

4

slide-22
SLIDE 22

Why Scalable GP Optimization is Hard

Experimental scalability vs sequential batch Computational scalability

4

slide-23
SLIDE 23

Why Scalable GP Optimization is Hard

Experimental scalability vs sequential batch Computational scalability vs exact GP approximate GP

4

slide-24
SLIDE 24

Why Scalable GP Optimization is Hard

Experimental scalability vs sequential batch Computational scalability vs exact GP approximate GP Batching and approximations increase regret

4

slide-25
SLIDE 25

Landscape of No-Regret GP Optimization

  • O(T 3)

sequential batched exact GP approximate GP GP-BUCB Async-TS GP-UCB IGP-UCB GP-TS Batch BKB BKB

  • O(T 2)
  • O(T)

Our solution: new adaptive schedule for

  • batch-size
  • approximation updates

5

slide-26
SLIDE 26

Landscape of No-Regret GP Optimization

  • O(T 3)

sequential batched exact GP approximate GP GP-BUCB Async-TS GP-UCB IGP-UCB GP-TS Batch BKB BKB

  • O(T 2)
  • O(T)

Our solution: new adaptive schedule for

  • batch-size
  • approximation updates

5

slide-27
SLIDE 27

Landscape of No-Regret GP Optimization

  • O(T 3)

sequential batched exact GP approximate GP GP-BUCB Async-TS GP-UCB IGP-UCB GP-TS Batch BKB BKB

  • O(T 2)
  • O(T)

Our solution: new adaptive schedule for

  • batch-size
  • approximation updates

5

slide-28
SLIDE 28

Landscape of No-Regret GP Optimization

  • O(T 3)

sequential batched exact GP approximate GP GP-BUCB Async-TS GP-UCB IGP-UCB GP-TS Batch BKB BKB

  • O(T 2)
  • O(T)

Our solution: new adaptive schedule for

  • batch-size
  • approximation updates

5

slide-29
SLIDE 29

Choosing good candidates with GP-UCB

6

slide-30
SLIDE 30

Choosing good candidates with GP-UCB

Xt = {x1, . . . , xt}, Yt = {y1, . . . , yt} Exact GP-UCB: ut( · ) = µ( · | Xt, Yt)

6

slide-31
SLIDE 31

Choosing good candidates with GP-UCB

Xt = {x1, . . . , xt}, Yt = {y1, . . . , yt} Exact GP-UCB: ut( · ) = µ( · | Xt, Yt) + βtσ( · | Xt)

6

slide-32
SLIDE 32

Choosing good candidates with GP-UCB

Xt = {x1, . . . , xt}, Yt = {y1, . . . , yt} Exact GP-UCB: ut( · ) = µ( · | Xt, Yt) + βtσ( · | Xt) [Sri+10]: ut valid UCB.

6

slide-33
SLIDE 33

Choosing good candidates with GP-UCB

Xt = {x1, . . . , xt}, Yt = {y1, . . . , yt} Exact GP-UCB: ut( · ) = µ( · | Xt, Yt) + βtσ( · | Xt) [Sri+10]: ut valid UCB. Sparse GP-UCB:

  • ut( · ) =

µ( · | Xt, Yt, Dt) + βt σ( · | Xt, Dt) with Dt ⊂ Xt inducing points

6

slide-34
SLIDE 34

Choosing good candidates with GP-UCB

Xt = {x1, . . . , xt}, Yt = {y1, . . . , yt} Exact GP-UCB: ut( · ) = µ( · | Xt, Yt) + βtσ( · | Xt) [Sri+10]: ut valid UCB. Sparse GP-UCB:

  • ut( · ) =

µ( · | Xt, Yt, Dt) + βt σ( · | Xt, Dt) with Dt ⊂ Xt inducing points [Cal+19]: ut valid UCB if Dt updated at every t.

6

slide-35
SLIDE 35

Performance vs Scalability

e ut( · ) = e µ( · | Xt, Yt, Dt) + e βte σ2( · | Xt, Dt)

Better performance: collect more feedback, update inducing points (resparsify)

7

slide-36
SLIDE 36

Performance vs Scalability

e ut( · ) = e µ( · | Xt, Yt, Dt) + e βte σ2( · | Xt, Dt)

Worse scalability: experimental cost, resparsification cost Better performance: collect more feedback, update inducing points (resparsify)

7

slide-37
SLIDE 37

Performance vs Scalability

e ut( · ) = e µ( · | Xt, Yt, Dt) + e βte σ2( · | Xt, Dt)

Worse scalability: experimental cost, resparsification cost Better performance: collect more feedback, update inducing points (resparsify) Improve scalability: batching feedback (GP-BUCB), batching resparsification ?

7

slide-38
SLIDE 38

Delayed Resparsification

New adaptive batching rule no-resparsify until

  • i∈Batch
  • σ2(xi) 1

8

slide-39
SLIDE 39

Delayed Resparsification

New adaptive batching rule no-resparsify until

  • i∈Batch
  • σ2(xi) 1

“Not too big” Lemma: valid UCB

8

slide-40
SLIDE 40

Delayed Resparsification

New adaptive batching rule no-resparsify until

  • i∈Batch
  • σ2(xi) 1

“Not too big” Lemma: valid UCB “Not too small” Lemma: batch-size = Ω(t)

2000 4000 6000 8000 10000 12000

t

1000 2000 3000 4000

batch size

BBKB

8

slide-41
SLIDE 41

Batch-BKB Theorem

With high probability Batch-BKB achieves no-regret with time complexity O(Td2

eff),

where deff ≪ T is the effective dimension / degrees of freedom of the GP.

9

slide-42
SLIDE 42

Batch-BKB Theorem

With high probability Batch-BKB achieves no-regret with time complexity O(Td2

eff),

where deff ≪ T is the effective dimension / degrees of freedom of the GP. Comparisons: Same regret of GP-UCB/IGP-UCB and better scalability (form O(T 3) to O(Td2

eff))

Larger batches than GP-BUCB Better regret and better scalability than async-TS

9

slide-43
SLIDE 43

In practice: Scalability

Cadata A = 20640, d = 8, T = 2000

250 500 750 1000 1250 1500 1750 2000

t

5 10 15 20 25 30 35 40

time (sec)

Batch-GPUCB BKB Global-BBKB GPUCB async-TS

NAS-bench-101 A = 12416, d = 19, T = 12000

2000 4000 6000 8000 10000 12000

t

10 20 30 40 50

time (sec)

eps-Greedy Regularized evolution Global-BBKB

10

slide-44
SLIDE 44

In practice: Performance

NAS-bench-101 Regret / Regret uniform

2000 4000 6000 8000 10000 12000

t

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Rt / Runif

t

eps-Greedy Regularized evolution Global-BBKB

NAS-bench-101 Simple Regret

2000 4000 6000 8000 10000 12000

t

0.000 0.002 0.004 0.006 0.008 0.010

f maxt f(xt)

eps-Greedy Regularized evolution Global-BBKB

11

slide-45
SLIDE 45

Thank you

12