Warm-starting contextual bandits: robustly combining supervised and bandit feedback
Chicheng Zhang1; Alekh Agarwal1; Hal DaumΓ© III1,2; John Langford1; Sahand Negahban3
1Microsoft Research, 2University of Maryland, 3Yale University
robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - - PowerPoint PPT Presentation
Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal Daum III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University
Chicheng Zhang1; Alekh Agarwal1; Hal DaumΓ© III1,2; John Langford1; Sahand Negahban3
1Microsoft Research, 2University of Maryland, 3Yale University
with associated cost ππ’ = (ππ’ 1 , β¦ , ππ’ πΏ ) from distribution πΈ
π
Learning algorithm User ππ’ ππ’(ππ’) π¦π’
2
ππ’
with associated cost ππ’ = (ππ’ 1 , β¦ , ππ’ πΏ ) from distribution πΈ
π
Learning algorithm User ππ’ ππ’(ππ’) π¦π’
3
ππ’ Fully labeled π
4
5
*π~π is the warm start data
π Algorithm 1 Algorithm 2
% settings w/ error β€ π
π Algorithm 1 Algorithm 2
% settings w/ error β€ π
ARRoW-CB, Sup-Only, Bandit-Only, Sim-Bandit (uses both sources)
% settings w/ error β€ π π
π Algorithm 1 Algorithm 2
% settings w/ error β€ π
ARRoW-CB, Sup-Only, Bandit-Only, Sim-Bandit (uses both sources)
% settings w/ error β€ π π