Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e - - PowerPoint PPT Presentation

meta learning contextual bandit exploration
SMART_READER_LITE
LIVE PREVIEW

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e - - PowerPoint PPT Presentation

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland Microsoft Research & University of Maryland amr@cs.umd.edu me@hal3.name Abstract 1 Can we learn to explore in contextual bandits? 2


slide-1
SLIDE 1

1

Meta-Learning Contextual Bandit Exploration

Amr Sharaf University of Maryland amr@cs.umd.edu Hal Daum´ e III Microsoft Research & University of Maryland me@hal3.name

Abstract

slide-2
SLIDE 2

Can we learn to explore in contextual bandits?

2

slide-3
SLIDE 3

Contextual Bandits: News Display

3

slide-4
SLIDE 4

Contextual Bandits: News Display

4

NEW NEW NEW NEW

slide-5
SLIDE 5

Contextual Bandits: News Display

5

NEW

slide-6
SLIDE 6

Contextual Bandits: News Display

6

NEW

Goal: Maximize Sum of Rewards

slide-7
SLIDE 7

Training Mêlée by Imitation

7

Examples / Time

Access to π* at train Goal: learn π

explore

loss (explore)

exploit

loss (exploit)

Roll-out with π*

t t-1

Deviation

Roll-in with π

slide-8
SLIDE 8

Generalization: Meta-Features

  • No direct dependency on the contexts x.
  • Features include:
  • Calibrated predicted probability p(at | ft, xt);
  • Entropy of the predicted probability distribution;
  • A one-hot encoding for the predicted action ft(xt);
  • Current time step t;
  • Average observed rewards for each action.

8

slide-9
SLIDE 9

A representative learning curve

9

slide-10
SLIDE 10

Win / Loss Statistics

Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses.

10

slide-11
SLIDE 11

Win / Loss Statistics

Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses.

11

slide-12
SLIDE 12

Theoretical Guarantees

  • The no-regret property of Aggrevate can be leveraged in our

meta-learning setting.

  • We relate the regret of the learner to the overall regret of π.
  • This shows that, if the underlying classifier improves

sufficiently quickly, Mêlée will achieve sublinear regret.

12

slide-13
SLIDE 13

Conclusion

  • Q: Can we learn to explore in contextual bandits?
  • A: Yes, by imitating an expert exploration policy;
  • Generalize across bandit problems using meta-features;
  • Outperform alternative strategies in most settings;
  • We provide theoretical guarantees.

13