online learning with kernel losses
play

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley - PowerPoint PPT Presentation

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji and Peter Bartlett 1 Talk Overview Intro to Online Learning Linear Bandits Kernel Bandits 2 Online Learning 3 Online Learning t = 1


  1. Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji and Peter Bartlett 1

  2. Talk Overview • Intro to Online Learning • Linear Bandits • Kernel Bandits 2

  3. Online Learning 3

  4. Online Learning t = 1 , · · · , n Learner Adversary 3

  5. Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A 3

  6. Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Adversary reveals loss (or reward) ` t ∈ W 3

  7. Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial 3

  8. Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial n X ` t ( a t ) t =1 3

  9. Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial n n R ( n ) = X X ` t ( a t ) − min ` t ( a t ) a ∗ ∈ A t =1 t =1 3

  10. Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial n n R ( n ) = X X ` t ( a t ) − min ` t ( a t ) a ∗ ∈ A t =1 t =1 The learner’s objective is to minimize Regret 3

  11. Full information vs Bandit feedback 4

  12. Full information vs Bandit feedback Full Information: Learner gets to sees all of ` t ( · ) 4

  13. Full information vs Bandit feedback Full Information: Learner gets to sees all of ` t ( · ) Bandit Feedback: Learner only sees the value ` t ( a t ) 4

  14. Full information vs Bandit feedback Full Information: Learner gets to sees all of ` t ( · ) Bandit Feedback: Learner only sees the value ` t ( a t ) 4

  15. Multi Armed Bandits P 1 P 2 P 3 µ 2 µ 3 µ 1 5

  16. Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } µ 2 µ 3 µ 1 5

  17. Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } Gets reward X a t ∼ P a t µ 2 µ 3 µ 1 5

  18. Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } Gets reward X a t ∼ P a t µ 2 µ 3 µ 1 " n # X R ( n ) = max a ∗ ∈ { 1 , ··· K } nµ a ∗ − E X a t t =1 5

  19. Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } Gets reward X a t ∼ P a t µ 2 µ 3 µ 1 " n # X R ( n ) = max a ∗ ∈ { 1 , ··· K } nµ a ∗ − E X a t t =1 MAB regret p R ( n ) = O ( Kn log( n )) [Auer et al. 2002] 5

  20. Structured losses Network ( V, E ) Arms = Paths a t ∈ A ⊂ { 0 , 1 } E Loss = delay w t ∈ W = [0 , 1] E Exponential MAB regret Packet routing p R ( n ) = O ( | num paths | · n log( n )) Delay is linear h a t , w t i 6

  21. Structured losses Network ( V, E ) Arms = Paths a t ∈ A ⊂ { 0 , 1 } E Loss = delay w t ∈ W = [0 , 1] E Exponential MAB regret Packet routing p R ( n ) = O ( | num paths | · n log( n )) Delay is linear h a t , w t i 6

  22. Linear Bandits 7

  23. Linear Bandits Learner chooses an action a t ∈ A ⊂ R d 7

  24. Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d 7

  25. Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or adversarial 7

  26. Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or Learner only experiences h w t , a t i adversarial 7

  27. Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or Learner only experiences h w t , a t i adversarial Expected regret: " n # n X X R ( n ) = E h w t , a t i � inf h w t , a i a ∈ A t =1 t =1 7

  28. Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or Learner only experiences h w t , a t i adversarial Expected regret: " n # n X X R ( n ) = E h w t , a t i � inf h w t , a i a ∈ A t =1 t =1 , W = [0 , 1] d MAB reduces to Linear A = { e 1 , · · · , e d } Bandits 7

  29. Exponential weights for adversarial linear bandits 8

  30. Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation 8

  31. Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation 8

  32. Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation 8

  33. Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h w t , a t i 8

  34. Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h w t , a t i Build loss estimator ˆ w t 8

  35. Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h w t , a t i Build loss estimator ˆ w t q t ( a ) / exp( � η h ˆ w t , a i ) q t − 1 ( a ) Update | {z } Exponential weights 8

  36. Exponential weights t X q t ( a ) / exp( � η h w i , a i ) ˆ i =1 9

  37. Exponential weights t X q t ( a ) / exp( � η h w i , a i ) ˆ i =1 t X ˆ w i i =1 A 9

  38. Exponential weights t X q t ( a ) / exp( � η h w i , a i ) ˆ i =1 t X ˆ w i q t i =1 A A 9

  39. Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ 10

  40. Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ 10

  41. Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ is an unbiased estimator of : ˆ w t w t 10

  42. Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ is an unbiased estimator of : ˆ w t w t aa T ⇤� � 1 E a t ⇠ p t [ a t h w t , a t i |F t � 1 ] � ⇥ E a t ⇠ p t [ ˆ w t |F t � 1 ] = E a ⇠ p t aa T ⇤� � 1 E a t ⇠ p t a t a > � ⇥ ⇥ ⇤ = t |F t � 1 E a ⇠ p t w t = w t 10

  43. Linear bandits regret Theorem. (Linear Bandits Regret). [See for example Bubeck ‘11] n R ( n )  γ n + log( |A| ) X w t , a i ) 2 + η EE a ∼ p t ( h ˆ η t =1 Exploration over Barycentric Spanner, [Dani, Hayes, Kakade ’08] p n log( |A| )) = O ( d 3 / 2 √ n ) O ( d Uniform over , [Cesa-Bianchi, Lugosi, ’12] A p dn log( |A| )) = O ( d √ n ) O ( John’s distribution [Bubeck, Cesa-Bianchi, Kakade ’12] O ( d √ n ) 11

  44. Linear bandits regret Dimension dependence Variance bound: w t , a i ) 2 ⇤⇤ ⇥ ⇥ ( h ˆ  d E E a t ∼ p t 12

  45. Linear bandits regret Dimension dependence Variance bound: w t , a i ) 2 ⇤⇤ ⇥ ⇥ ( h ˆ  d E E a t ∼ p t Dimension dependence 12

  46. Linear bandits regret Dimension dependence Variance bound: w t , a i ) 2 ⇤⇤ ⇥ ⇥ ( h ˆ  d E E a t ∼ p t Dimension dependence n R ( n )  γ n + log( |A| ) X w t , a i ) 2 + η EE a ∼ p t ( h ˆ η t =1 | {z } ≤ η dn 12

  47. Recap • Intro to Online Learning • Linear Bandits • Kernel Bandits 13

  48. Online Quadratic losses a t 2 A = { a s.t. k a k 2  1 } Symmetric and B t possibly non convex ` t ( a ) = h b t , a i + a > B t a min ` t ( a ) a ∈ A Offline problem has polytime solution Strong Duality Covfefe z = x 2 − . 5 ∗ y 2 + x ∗ y − . 5 ∗ x + . 5 y + 1 Peter Bartlett Niladri Chatterji 14

  49. Linearization of Quadratic losses matrices ( ) Quadratic losses are linear in the space of vector ✓ aa > ⌧✓ B t ◆ ◆� ` ( a ) = h b t , a i + a > B t a ` ( a ) = , b t a We can use the linear bandits machinery Exponential weights for quadratic bandits 15

  50. Exponential weights for adversarial quadratic bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h b t , a t i + a > t B t a t ✓ ˆ ◆ B t Build loss estimator ˆ b t b t , a i + a > ˆ q t ( a ) / exp( � η ( h ˆ B t a )) q t � 1 ( a ) Update | {z } Exponential weights Sampling is poly time 16

  51. Beyond “Finite Dimensional” Losses Evasion games: Obstacle avoidance ` t ( a ) = exp( �k a � w t k 2 ) Gaussian kernel - Infinite dimensional 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend