near linear time gaussian process optimization with
play

Near-linear Time Gaussian Process Optimization with Adaptive - PowerPoint PPT Presentation

Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT


  1. Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT

  2. Bayesian/Bandit Optimization Set of candidates A 2

  3. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  4. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  5. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  6. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  7. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  8. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  9. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  10. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  11. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  12. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  13. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback (3) Update model 2

  14. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model 2

  15. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t 2

  16. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T 2

  17. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T Use Gaussian process/kernelized Bandit to model f 2

  18. Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) 3 Image from Berkeley’s CS 188

  19. Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? 3 Image from Berkeley’s CS 188

  20. Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? Batch BKB: no-regret and scalable 3 Image from Berkeley’s CS 188

  21. Why Scalable GP Optimization is Hard Experimental scalability Computational scalability 4

  22. Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational scalability 4

  23. Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP 4

  24. Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP Batching and approximations increase regret 4

  25. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  26. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  27. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  28. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  29. Choosing good candidates with GP-UCB 6

  30. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) 6

  31. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) 6

  32. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. 6

  33. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points 6

  34. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points [Cal+19]: � u t valid UCB if D t updated at every t . 6

  35. Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e 7

  36. Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost 7

  37. Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost Improve scalability: batching feedback (GP-BUCB), batching resparsification ? 7

  38. Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch 8

  39. Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch “Not too big” Lemma : valid UCB 8

  40. Delayed Resparsification New adaptive batching rule BBKB 4000 � σ 2 ( x i ) � 1 no-resparsify until � size 3000 i ∈ Batch batch 2000 “Not too big” Lemma : valid UCB 1000 “Not too small” Lemma : batch-size = Ω ( t ) 0 2000 4000 6000 8000 10000 12000 t 8

  41. Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. 9

  42. Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. Comparisons: Same regret of GP-UCB/IGP-UCB and better scalability (form O ( T 3 ) to O ( Td 2 eff ) ) Larger batches than GP-BUCB Better regret and better scalability than async-TS 9

  43. In practice: Scalability Cadata NAS-bench-101 A = 20640, d = 8, T = 2000 A = 12416, d = 19, T = 12000 40 50 Batch-GPUCB eps-Greedy BKB Regularized evolution 35 Global-BBKB Global-BBKB 40 GPUCB 30 async-TS time ( sec ) time ( sec ) 25 30 20 20 15 10 10 5 0 0 0 250 500 750 1000 1250 1500 1750 2000 2000 4000 6000 8000 10000 12000 t t 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend