G enerative A dversarial User Model for R einforcement L earning - - PowerPoint PPT Presentation

g enerative a dversarial user model for r einforcement l
SMART_READER_LITE
LIVE PREVIEW

G enerative A dversarial User Model for R einforcement L earning - - PowerPoint PPT Presentation

G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem Xinshi Chen 1 , Shuang Li 1 , Hui Li 2 , Shaohua Jiang 2 , Yuan Qi 2 , Le Song 1,2 1 Georgia Tech, 2 Ant Financial ICML 2019 RL for Recommendation


slide-1
SLIDE 1

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System

Xinshi Chen1, Shuang Li1, Hui Li2, Shaohua Jiang2, Yuan Qi2, Le Song1,2

1Georgia Tech, 2Ant Financial

ICML 2019

slide-2
SLIDE 2

RL for Recommendation System

  • A user’s interest evolves over time based on what she observes.
  • Recommender’s action can significantly influence such evolution.
  • A RL based recommender can consider user’s long term interest.

display items choice user

state at 𝑢 system

state at 𝑢 + 1 display items choice

state at 𝑢 + 2

slide-3
SLIDE 3

Challenges

  • 1. User is the environment
  • 2. The reward function (a user’s interest) is unknown

Training of RL policy requires lots of interactions with users

reward=? reward=? display items choice user system

state at 𝑢 + 1 display items choice

state at 𝑢 + 2

state at 𝑢

e.g. (1) For AlphaGo Zero, 4.9 million games of self-play were generated for training. (2) RL for Atari game takes more than 50 hours on GPU for training.

slide-4
SLIDE 4

Our solution

We propose

  • A Generative Adversarial User Model
  • to model user’s action
  • to recover user’s reward
  • Use GAN User Model as a simulator to pre-train the RL policy offline

GAN User Model Simulated Environment system RL policy

simulated interaction

slide-5
SLIDE 5

Generative Adversarial User Model

2 components: User’s reward 𝒔(𝒕𝒖, 𝒃𝒖)

  • 𝑏- is clicked item.
  • 𝑡- is user’s experience (state).

User’s behavior 𝝔(𝒕𝒖, 𝒝𝒖)

  • 𝒝𝒖 contains items displayed by the system.
  • act 𝑏- ∼ 𝜚 to maximize her expected reward.
  • 𝜚∗(𝑡-, 𝒝-) = arg max

:

𝔽: 𝑠 𝑡-, 𝑏- − 𝑆 𝜚 /𝜃

𝒔(𝒕𝒖, 𝒃𝒖) reward displayed items 𝒝- 𝑏- ∼ 𝜚 𝑡-, 𝒝- choice

slide-6
SLIDE 6

Generative Adversarial Training

In analogy to GAN:

  • 𝝔 (behavior) acts as a generator
  • 𝒔 (reward) acts as a discriminator

Jointly learned via a mini-max formulation: min

C

max

:

𝔽: D

  • EF

G

𝑠 𝑡-, 𝑏- − 𝑆 𝜚 /𝜃 − D

  • EF

G

𝑠(𝑡-CHI

  • , 𝑏-CHI
  • )
slide-7
SLIDE 7

Model Parameterization

×

𝒈∗

𝑢−1

𝒈∗

𝑢−𝑛

we weight t matr trix ix

𝑥11 ⋯

𝑥𝑛1 ⋯

concat

𝑥1𝑜

⋮ ⋮

𝑥𝑛𝑜

=

𝑠𝑗

𝑢

ℎ𝑢−1 𝒈𝑗

𝑢

2 architectures for aggregating historical information (i.e. state 𝑡-) (1) LSTM (2) Position Weight

slide-8
SLIDE 8

Set Recommendation RL policy

display 𝑙 items all available 𝐿 items

set recommendation combinatorial action space 𝑳 𝒍

𝑏F

𝑏N

𝑏O

Intractable computation! 𝑏F

∗, 𝑏N ∗, … 𝑏O ∗ = arg max QR,…,QS 𝑅(𝑡-, 𝑏F, 𝑏N, … , 𝑏O)

slide-9
SLIDE 9

Set Recommendation RL policy

𝑹𝟐∗ 𝑡-, 𝑏F ≔ max

QX:S 𝑅(𝑡-, 𝑏F, 𝑏N:O)

𝑹𝟑∗ 𝑡-, 𝑏F, 𝑏N ≔ max

Q[:S 𝑅(𝑡-, 𝑏F, 𝑏N, 𝑏\:O)

𝑏F

∗, 𝑏N ∗, … 𝑏O ∗ = arg max QR,…,QS 𝑅(𝑡-, 𝑏F, 𝑏N, … , 𝑏O)

decompose

𝑏F

∗ = arg max QR 𝑹𝟐∗(𝑡-, 𝑏F)

𝑏N

∗ = arg max QX 𝑹𝟑∗(𝑡-, 𝑏F ∗, 𝑏N)

… … 𝑏O

∗ = arg max QS 𝑹𝒍∗(𝑡-, 𝑏F ∗, 𝑏N ∗, … , 𝑏O)

We design a cascading Q network to compute the optimal action with linear complexity:

slide-10
SLIDE 10

Set Recommendation RL policy: Cascading DQN

Argmax 𝑏1

𝑏2

𝑏𝑙

Argmax 𝑡 𝑏1 𝑏2 𝑏𝑙

Argmax

𝑅1(𝑡, 𝑏1; 𝜄1) 𝑅2(𝑡, 𝑏1

∗, 𝑏2; 𝜄2)

𝑅𝑙(𝑡, 𝑏1:𝑙−1

, 𝑏𝑙; 𝜄𝑙)

slide-11
SLIDE 11

Experiments

Predictive Performance of User Model Recommendation Policy Based On User Model

slide-12
SLIDE 12

Experiments

Cascading-DQN policy pre-trained over a GAN User Model can quickly achieve a high CTR even when it is applied to a new set of users.

slide-13
SLIDE 13

Thanks! Poster: Pacific Ballroom #252, Tue, 06:30 PM Contact: xinshi.chen@gatech.edu