robustly combining supervised and
play

robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - PowerPoint PPT Presentation

Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal Daum III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University


  1. Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal DaumΓ© III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University

  2. Warm-starting contextual bandits β€’ For timestep 𝑒 = 1,2, . . π‘ˆ: β€’ Observe context 𝑦 𝑒 with associated cost 𝑑 𝑒 = (𝑑 𝑒 1 , … , 𝑑 𝑒 𝐿 ) 𝑑 𝑒 from distribution 𝐸 𝑦 𝑒 β€’ Take an action 𝑏 𝑒 ∈ {1, … 𝐿} 𝑏 𝑒 β€’ Receive cost 𝑑 𝑒 𝑏 𝑒 ∈ [0,1] 𝑑 𝑒 (𝑏 𝑒 ) Learning algorithm User π‘ˆ β€’ Goal: incur low cumulative cost: Οƒ 𝑒=1 𝑑 𝑒 (𝑏 𝑒 ) 2

  3. Warm-starting contextual bandits β€’ Receive warm-starting examples 𝑇 = 𝑦, 𝑑 ~ 𝑋 β€’ For timestep 𝑒 = 1,2, . . π‘ˆ: β€’ Observe context 𝑦 𝑒 with associated cost 𝑑 𝑒 = (𝑑 𝑒 1 , … , 𝑑 𝑒 𝐿 ) 𝑑 𝑒 Fully labeled 𝑇 from distribution 𝐸 𝑦 𝑒 β€’ Take an action 𝑏 𝑒 ∈ {1, … 𝐿} 𝑏 𝑒 β€’ Receive cost 𝑑 𝑒 𝑏 𝑒 ∈ [0,1] 𝑑 𝑒 (𝑏 𝑒 ) Learning algorithm User π‘ˆ β€’ Goal: incur low cumulative cost: Οƒ 𝑒=1 𝑑 𝑒 (𝑏 𝑒 ) 3

  4. Warm-starting contextual bandits: motivation β€’ Some labeled examples often exist in applications, e.g. β€’ News recommendation: editorial relevance annotations β€’ Healthcare: historical medical records w/ prescribed treatments β€’ Leveraging historical data can reduce unsafe exploration 4

  5. Warm-starting contextual bandits: motivation β€’ Some labeled examples often exist in applications, e.g. β€’ News recommendation: editorial relevance annotations β€’ Healthcare: historical medical records w/ prescribed treatments β€’ Leveraging historical data can reduce unsafe exploration Key Challenge: 𝑋 may not be the same as 𝐸 β€’ Editors fail to capture users’ preferences β€’ Medical record data from another population How to utilize the warm-starting examples robustly and effectively? 5

  6. Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy

  7. Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy β€’ Theorem (informal): Compared to algorithms that ignore 𝑇 , * the regret of ARRoW-CB is - never much worse (robustness) - much smaller, if 𝑋 and 𝐸 are close enough, and |𝑇| is large enough * 𝑇~𝑋 is the warm start data

  8. Empirical evaluation β€’ 524 datasets from openml.org Algorithm 1 β€’ CDFs of normalized errors % settings w/ error ≀ 𝑓 Algorithm 2 𝑓

  9. Empirical evaluation β€’ 524 datasets from openml.org Algorithm 1 β€’ CDFs of normalized errors % settings w/ error ≀ 𝑓 Algorithm 2 𝑓 β€’ Moderate noise setting β€’ Algorithms: ARRoW-CB, % settings w/ Sup-Only, error ≀ 𝑓 Bandit-Only, Sim-Bandit (uses both sources) 𝑓

  10. Empirical evaluation β€’ 524 datasets from openml.org Algorithm 1 β€’ CDFs of normalized errors % settings w/ error ≀ 𝑓 Algorithm 2 𝑓 β€’ Moderate noise setting β€’ Algorithms: ARRoW-CB, % settings w/ Sup-Only, error ≀ 𝑓 Bandit-Only, Sim-Bandit (uses both sources) Poster Thu #52 𝑓

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend