A Contextual-Bandit Approach to Personalized News Article - - PowerPoint PPT Presentation
A Contextual-Bandit Approach to Personalized News Article - - PowerPoint PPT Presentation
A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu, John Langford, Rebort E. Schapire Presentator: Qingyun Wu News Recommendation Cycle A K-armed Bandit Formulation A gambler must decide which of
News Recommendation Cycle
A K-armed Bandit Formulation
- A gambler must decide which of the K
non-identical slot machines(we called them arms) to play in a sequence of trails in order to maximize total reward.
How to pull arms to maximize reward? News Website <—> gambler Candidate news articles <—> arms User Click <—> Reward
A K-armed Bandit formulation
Setting
- Set of K choices(arms)
- Each choice is associate with an unknown probability
distribution supported in [0,1]
- play the game for rounds
pa
a
- In each round
(1)we pick article (2)we observe random sample from Our Goal: maximize
t
T
j Xt
pj
Xt
t=1 T
∑
Ideal Solution
Pick
argmax
a
µa
But we DO NOT know the mean.
Every time we pull an arm we learn a bit more about the distribution.
Feasible Solution
Exploitation VS. Exploration
Exploitation: pull an arm for which we current have the highest estimate of mean of reward
Greedy Strategy: Take the arm with the highest average reward Random Strategy: Randomly choose an arm
Exploration: Pull an arm we never pulled before
Extreme examples: Too confident Too unconfident
Don’t just look at the mean(that’s the expected reward), but also the confidence!
How to make trade off
Exploration Exploitation
UCB(Upper Confidence Bound) algorithm
argmax(
a
µa
^
+α *Varance)
Pick
argmax(
a
µa
^
+α *UCB)
Pick UCB1
argmax
a (µa ^
+ 2lnT na )
Reference: Finite-Analysis of the Multi-armed Bandit Problem, Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Confidence Interval is a range of values within which we are sure the mean lies with a certain probability
Make use of Contextual Information
User feature: demographic information, geographic features, behavioral categories Article feature:URL categories, topic categories
Assumption about the reward:
The expected reward of an arm is linear in its -dimensional feature , with some unknown coefficient vector , namely, for all ,
a
d xt,a
θa
*
t
E(r
t,a | xt,a) = xt,a T θa *
UCB(Upper Confidence Bound) algorithm
Assumption
Parameter Estimation
Bound of the variance
E(r
t,a | xt,a) = xt,a T θa *
ˆ θa =(Da
TDa + Id)−1Da Tca
(Ridge Regression)
xt ,a
T ˆ
θa − E(r
t ,a|xt ,a) ≤α
xt ,a
T (Da TDa + Id)−1xt ,a
Bound we need!!!
Pick
argmax
a (xt,a T ˆ
θa +α xt,a
T (Da T Da + Id)xt,a )