CAB: Continuous Adaptive Blending for Policy Evaluation and Learning - - PowerPoint PPT Presentation

β–Ά
cab continuous adaptive blending for policy evaluation
SMART_READER_LITE
LIVE PREVIEW

CAB: Continuous Adaptive Blending for Policy Evaluation and Learning - - PowerPoint PPT Presentation

CAB: Continuous Adaptive Blending for Policy Evaluation and Learning Yi Su*, Lequn Wang*, Michele Santacatterina and Thorsten Joachims Example: Netflix Context : User/History Action : Movie to be placed here Candidate: Reward :


slide-1
SLIDE 1

CAB: Continuous Adaptive Blending for Policy Evaluation and Learning

Yi Su*, Lequn Wang*, Michele Santacatterina and Thorsten Joachims

slide-2
SLIDE 2

Example: Netflix

Action 𝒛: Movie to be placed here Context π’š: User/History Reward 𝒔: Whether user will click it Candidate:

slide-3
SLIDE 3

Draw 𝑇% from 𝜌% Γ  ' 𝑆 𝜌% Draw 𝑇) from 𝜌) Γ  ' 𝑆 𝜌) Draw 𝑇* from 𝜌* Γ  ' 𝑆 𝜌* Draw 𝑇+ from 𝜌+ Γ  ' 𝑆 𝜌+ Draw 𝑇, from 𝜌, Γ  ' 𝑆 𝜌, Draw 𝑇- from 𝜌- Γ  ' 𝑆 𝜌- Draw 𝑇 from 𝜌. / 𝑉 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 𝜌-

Goal: Off-Policy Evaluation and Learning

Online: A/B Testing Offline: Off-policy evaluation Learning: ERM for batch learning from bandit feedback 2 πœŒβˆ— = 𝑏𝑠𝑕𝑛𝑏𝑦 :∈< ' 𝑆(𝜌) Evaluation: Expected performance for a new policy Ο€

/ 𝑉 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 𝜌%) / 𝑉 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 β„Ž% ' 𝑆 𝜌%@

𝑇 = 𝑦A, 𝑧A, 𝑠

A, 𝜌.(𝑧A|𝑦A) AE% F

slide-4
SLIDE 4

Main Approaches

Contribution I: Present a family

  • f counterfactual estimators.

Contribution II: Design a new estimator that inherits desirable properties.

slide-5
SLIDE 5

Contribution I: Interpolated Counterfactual Estimator Family

Notation: G πœ€(𝑦, 𝑧) be the estimated reward for action 𝑧 given context 𝑦. Let I 𝜌. be the estimated (known) logging policy.

Interpolated Counterfactual Estimator (ICE) Family Given a triplet 𝒳 = (π‘₯L, π‘₯M, π‘₯N) of weighting functions: / 𝑆O 𝜌 = 1

π‘œ R

AE% F

R

Sβˆˆπ’΅

𝜌(𝑧|𝑦A) π‘₯AS

L 𝛽AS + 1

π‘œ R

AE% F

𝜌 𝑧A 𝑦A π‘₯A

M𝛾A + 1

π‘œ R

AE% F

𝜌 𝑧A 𝑦A π‘₯A

N𝛿A

Model the world 𝛽AS = G πœ€ 𝑦A, 𝑧 High bias, small variance Model the bias 𝛾A = ⁄ 𝑠(𝑦A, 𝑧A) Z 𝜌. 𝑧A 𝑦A) High variance, can be unbiased with known propensity Control variate 𝛿A = ⁄ G πœ€(𝑦A, 𝑧A) Z 𝜌. 𝑧A 𝑦A) Variance reduction, prohibited use in LTR

slide-6
SLIDE 6

Contribution II: Continuous Adaptive Blending (CAB) Estimator

ΓΌ

Can be sustainably less biased than clipped IPS and DM.

ΓΌ

While having low variance compared to IPS and DR.

ΓΌ

Subdifferentiable and capable of gradient based learning: POEM (Swaminathan & Joachims,

2015a), BanditNet (Joachims et.al., 2018)

ΓΌ

Unlike DR, can be used in off-policy Learning to Rank (LTR) algorithms. (Joachims et.al.,

2017)

See our poster at Pacific Ballroom #221 Thursday (Today) 6:30-9:00pm