robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - - PowerPoint PPT Presentation

β–Ά
robustly combining supervised and
SMART_READER_LITE
LIVE PREVIEW

robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - - PowerPoint PPT Presentation

Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal Daum III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University


slide-1
SLIDE 1

Warm-starting contextual bandits: robustly combining supervised and bandit feedback

Chicheng Zhang1; Alekh Agarwal1; Hal DaumΓ© III1,2; John Langford1; Sahand Negahban3

1Microsoft Research, 2University of Maryland, 3Yale University

slide-2
SLIDE 2

Warm-starting contextual bandits

  • For timestep 𝑒 = 1,2, . . π‘ˆ:
  • Observe context 𝑦𝑒

with associated cost 𝑑𝑒 = (𝑑𝑒 1 , … , 𝑑𝑒 𝐿 ) from distribution 𝐸

  • Take an action 𝑏𝑒 ∈ {1, … 𝐿}
  • Receive cost 𝑑𝑒 𝑏𝑒 ∈ [0,1]
  • Goal: incur low cumulative cost: σ𝑒=1

π‘ˆ

𝑑𝑒(𝑏𝑒)

Learning algorithm User 𝑏𝑒 𝑑𝑒(𝑏𝑒) 𝑦𝑒

2

𝑑𝑒

slide-3
SLIDE 3

Warm-starting contextual bandits

  • Receive warm-starting examples 𝑇 =

𝑦, 𝑑 ~ 𝑋

  • For timestep 𝑒 = 1,2, . . π‘ˆ:
  • Observe context 𝑦𝑒

with associated cost 𝑑𝑒 = (𝑑𝑒 1 , … , 𝑑𝑒 𝐿 ) from distribution 𝐸

  • Take an action 𝑏𝑒 ∈ {1, … 𝐿}
  • Receive cost 𝑑𝑒 𝑏𝑒 ∈ [0,1]
  • Goal: incur low cumulative cost: σ𝑒=1

π‘ˆ

𝑑𝑒(𝑏𝑒)

Learning algorithm User 𝑏𝑒 𝑑𝑒(𝑏𝑒) 𝑦𝑒

3

𝑑𝑒 Fully labeled 𝑇

slide-4
SLIDE 4

Warm-starting contextual bandits: motivation

  • Some labeled examples often exist in applications, e.g.
  • News recommendation: editorial relevance annotations
  • Healthcare: historical medical records w/ prescribed treatments
  • Leveraging historical data can reduce unsafe exploration

4

slide-5
SLIDE 5

Warm-starting contextual bandits: motivation

  • Some labeled examples often exist in applications, e.g.
  • News recommendation: editorial relevance annotations
  • Healthcare: historical medical records w/ prescribed treatments
  • Leveraging historical data can reduce unsafe exploration

5

Key Challenge: 𝑋 may not be the same as 𝐸

  • Editors fail to capture users’ preferences
  • Medical record data from another population

How to utilize the warm-starting examples robustly and effectively?

slide-6
SLIDE 6

Algorithm & performance guarantees

ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy

slide-7
SLIDE 7

Algorithm & performance guarantees

  • Theorem (informal):

Compared to algorithms that ignore 𝑇, * the regret of ARRoW-CB is

  • never much worse (robustness)
  • much smaller, if 𝑋 and 𝐸 are close enough, and |𝑇| is large enough

ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy

*𝑇~𝑋 is the warm start data

slide-8
SLIDE 8

Empirical evaluation

  • 524 datasets from openml.org
  • CDFs of normalized errors

𝑓 Algorithm 1 Algorithm 2

% settings w/ error ≀ 𝑓

slide-9
SLIDE 9

Empirical evaluation

  • 524 datasets from openml.org
  • CDFs of normalized errors

𝑓 Algorithm 1 Algorithm 2

% settings w/ error ≀ 𝑓

  • Moderate noise setting
  • Algorithms:

ARRoW-CB, Sup-Only, Bandit-Only, Sim-Bandit (uses both sources)

% settings w/ error ≀ 𝑓 𝑓

slide-10
SLIDE 10

Empirical evaluation

  • 524 datasets from openml.org
  • CDFs of normalized errors

𝑓 Algorithm 1 Algorithm 2

% settings w/ error ≀ 𝑓

  • Moderate noise setting
  • Algorithms:

ARRoW-CB, Sup-Only, Bandit-Only, Sim-Bandit (uses both sources)

% settings w/ error ≀ 𝑓 𝑓

Poster Thu #52