Generative Adversarial User Model for Reinforcement Learning Based Recommendation System
Xinshi Chen1, Shuang Li1, Hui Li2, Shaohua Jiang2, Yuan Qi2, Le Song1,2
1Georgia Tech, 2Ant Financial
G enerative A dversarial User Model for R einforcement L earning - - PowerPoint PPT Presentation
G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem Xinshi Chen 1 , Shuang Li 1 , Hui Li 2 , Shaohua Jiang 2 , Yuan Qi 2 , Le Song 1,2 1 Georgia Tech, 2 Ant Financial ICML 2019 RL for Recommendation
1Georgia Tech, 2Ant Financial
display items choice user
…
state at 𝑢 system
…
state at 𝑢 + 1 display items choice
…
state at 𝑢 + 2
reward=? reward=? display items choice user system
…
state at 𝑢 + 1 display items choice
…
state at 𝑢 + 2
…
state at 𝑢
e.g. (1) For AlphaGo Zero, 4.9 million games of self-play were generated for training. (2) RL for Atari game takes more than 50 hours on GPU for training.
GAN User Model Simulated Environment system RL policy
:
𝒔(𝒕𝒖, 𝒃𝒖) reward displayed items - 𝑏- ∼ 𝜚 𝑡-, - choice
C
:
G
G
×
𝒈∗
𝑢−1
⋯
𝒈∗
𝑢−𝑛
we weight t matr trix ix
𝑥11 ⋯
⋮
𝑥𝑛1 ⋯
concat
𝑥1𝑜
⋮ ⋮
𝑥𝑛𝑜
=
𝑠𝑗
𝑢
ℎ𝑢−1 𝒈𝑗
𝑢
display 𝑙 items all available 𝐿 items
𝑏F
∗
𝑏N
∗
𝑏O
∗
…
∗, 𝑏N ∗, … 𝑏O ∗ = arg max QR,…,QS 𝑅(𝑡-, 𝑏F, 𝑏N, … , 𝑏O)
QX:S 𝑅(𝑡-, 𝑏F, 𝑏N:O)
Q[:S 𝑅(𝑡-, 𝑏F, 𝑏N, 𝑏\:O)
∗, 𝑏N ∗, … 𝑏O ∗ = arg max QR,…,QS 𝑅(𝑡-, 𝑏F, 𝑏N, … , 𝑏O)
decompose
∗ = arg max QR 𝑹𝟐∗(𝑡-, 𝑏F)
∗ = arg max QX 𝑹𝟑∗(𝑡-, 𝑏F ∗, 𝑏N)
∗ = arg max QS 𝑹𝒍∗(𝑡-, 𝑏F ∗, 𝑏N ∗, … , 𝑏O)
Argmax 𝑏1
∗
𝑏2
∗
𝑏𝑙
∗
Argmax 𝑡 𝑏1 𝑏2 𝑏𝑙
Argmax
𝑅1(𝑡, 𝑏1; 𝜄1) 𝑅2(𝑡, 𝑏1
∗, 𝑏2; 𝜄2)
𝑅𝑙(𝑡, 𝑏1:𝑙−1
∗
, 𝑏𝑙; 𝜄𝑙)
Predictive Performance of User Model Recommendation Policy Based On User Model
Cascading-DQN policy pre-trained over a GAN User Model can quickly achieve a high CTR even when it is applied to a new set of users.