meta learning contextual bandit exploration
play

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e - PowerPoint PPT Presentation

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland Microsoft Research & University of Maryland amr@cs.umd.edu me@hal3.name Abstract 1 Can we learn to explore in contextual bandits? 2


  1. Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum´ e III University of Maryland Microsoft Research & University of Maryland amr@cs.umd.edu me@hal3.name Abstract 1

  2. Can we learn to explore in contextual bandits? 2

  3. Contextual Bandits: News Display 3

  4. Contextual Bandits: News Display NEW NEW NEW NEW 4

  5. Contextual Bandits: News Display NEW 5

  6. Contextual Bandits: News Display NEW Goal: Maximize Sum of Rewards 6

  7. Training Mêlée by Imitation Access to π * at train Roll-out with π * Goal: learn π loss exploit (exploit) t-1 t … loss explore (explore) Deviation Roll-in with π Examples / Time 7

  8. Generalization: Meta-Features - No direct dependency on the contexts x. - Features include: - Calibrated predicted probability p(a t | f t , x t ); - Entropy of the predicted probability distribution; - A one-hot encoding for the predicted action ft(x t ); - Current time step t; - Average observed rewards for each action. 8

  9. A representative learning curve 9

  10. Win / Loss Statistics Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses. 10

  11. Win / Loss Statistics Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses. 11

  12. Theoretical Guarantees - The no-regret property of Aggrevate can be leveraged in our meta-learning setting. - We relate the regret of the learner to the overall regret of π . - This shows that, if the underlying classifier improves su ffi ciently quickly, Mêlée will achieve sublinear regret. 12

  13. Conclusion - Q: Can we learn to explore in contextual bandits? - A: Yes, by imitating an expert exploration policy; - Generalize across bandit problems using meta-features; - Outperform alternative strategies in most settings; - We provide theoretical guarantees. 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend