g enerative a dversarial user model for r einforcement l
play

G enerative A dversarial User Model for R einforcement L earning - PowerPoint PPT Presentation

G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem Xinshi Chen 1 , Shuang Li 1 , Hui Li 2 , Shaohua Jiang 2 , Yuan Qi 2 , Le Song 1,2 1 Georgia Tech, 2 Ant Financial ICML 2019 RL for Recommendation


  1. G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem Xinshi Chen 1 , Shuang Li 1 , Hui Li 2 , Shaohua Jiang 2 , Yuan Qi 2 , Le Song 1,2 1 Georgia Tech, 2 Ant Financial ICML 2019

  2. RL for Recommendation System display items display items system … … … choice choice state at 𝑢 + 1 state at 𝑢 + 2 state at 𝑢 user A user’s interest evolves over time based on what she observes. • Recommender’s action can significantly influence such evolution. • A RL based recommender can consider user’s long term interest. •

  3. display items display items system … … … choice choice state at 𝑢 + 1 state at 𝑢 + 2 state at 𝑢 user Challenges reward=? reward=? Training of RL policy requires 1. User is the environment lots of interactions with users e.g. (1) For AlphaGo Zero , 4.9 million games of self-play were generated for training. (2) RL for Atari game takes more than 50 hours on GPU for training. 2. The reward function (a user’s interest) is unknown

  4. Our solution We propose A G enerative A dversarial User Model • - to model user’s action - to recover user’s reward Use GAN User Model as a simulator to pre-train the RL policy offline • simulated interaction GAN User Model system Simulated Environment RL policy

  5. Generative Adversarial User Model 2 components: User’s reward 𝒔(𝒕 𝒖 , 𝒃 𝒖 ) displayed items 𝒝 - 𝑏 - is clicked item. • 𝑡 - is user’s experience (state). • User’s behavior 𝝔(𝒕 𝒖 , 𝒝 𝒖 ) 𝑏 - ∼ 𝜚 𝑡 - , 𝒝 - choice 𝒝 𝒖 contains items displayed by the system. • act 𝑏 - ∼ 𝜚 to maximize her expected reward. • 𝒔(𝒕 𝒖 , 𝒃 𝒖 ) reward 𝜚 ∗ (𝑡 - , 𝒝 - ) = arg max 𝔽 : 𝑠 𝑡 - , 𝑏 - − 𝑆 𝜚 /𝜃 • :

  6. Generative Adversarial Training In analogy to GAN: 𝝔 (behavior) acts as a generator • 𝒔 (reward) acts as a discriminator • Jointly learned via a mini-max formulation : G G 𝑠 𝑡 - , 𝑏 - - - min max 𝔽 : D − 𝑆 𝜚 /𝜃 − D 𝑠(𝑡 -CHI , 𝑏 -CHI ) C : -EF -EF

  7. Model Parameterization 2 architectures for aggregating historical information (i.e. state 𝑡 - ) (1) LSTM 𝑢−𝑛 𝒈 ∗ 𝑢−1 ⋯ 𝒈 ∗ we weight t matr trix ix ℎ 𝑢−1 (2) Position Weight 𝑥 11 ⋯ 𝑥 1𝑜 𝑢 𝑠 𝑗 concat = ⋮ ⋮ ⋮ × 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑜 𝑢 𝒈 𝑗

  8. Set Recommendation RL policy ∗ ∗ ∗ 𝑏 O 𝑏 F 𝑏 N set recommendation … display 𝑙 items all available 𝐿 items ∗ = arg max ∗ , 𝑏 N ∗ , … 𝑏 O Q R ,…,Q S 𝑅(𝑡 - , 𝑏 F , 𝑏 N , … , 𝑏 O ) 𝑏 F combinatorial action space 𝑳 Intractable computation! 𝒍

  9. Set Recommendation RL policy We design a cascading Q network to compute the optimal action with linear complexity: ∗ = arg max ∗ , 𝑏 N ∗ , … 𝑏 O Q R ,…,Q S 𝑅(𝑡 - , 𝑏 F , 𝑏 N , … , 𝑏 O ) 𝑏 F decompose ∗ = arg max 𝑹 𝟐∗ 𝑡 - , 𝑏 F ≔ max Q R 𝑹 𝟐∗ (𝑡 - , 𝑏 F ) Q X:S 𝑅(𝑡 - , 𝑏 F , 𝑏 N:O ) 𝑏 F ∗ = arg max 𝑹 𝟑∗ 𝑡 - , 𝑏 F , 𝑏 N ≔ max ∗ , 𝑏 N ) Q X 𝑹 𝟑∗ (𝑡 - , 𝑏 F Q [:S 𝑅(𝑡 - , 𝑏 F , 𝑏 N , 𝑏 \:O ) 𝑏 N … … ∗ = arg max ∗ , 𝑏 N ∗ , … , 𝑏 O ) Q S 𝑹 𝒍∗ (𝑡 - , 𝑏 F 𝑏 O

  10. Set Recommendation RL policy: Cascading DQN ∗ ∗ 𝑏 1 𝑏 2 ∗ 𝑏 𝑙 … Argmax Argmax Argmax 𝑅 1 (𝑡, 𝑏 1 ; 𝜄 1 ) 𝑅 2 (𝑡, 𝑏 1 ∗ , 𝑏 2 ; 𝜄 2 ) ∗ 𝑅 𝑙 (𝑡, 𝑏 1:𝑙−1 , 𝑏 𝑙 ; 𝜄 𝑙 ) … 𝑏 2 𝑏 𝑙 𝑏 1 𝑡

  11. Experiments Predictive Performance of User Model Recommendation Policy Based On User Model

  12. Experiments Cascading-DQN policy pre-trained over a GAN User Model can quickly achieve a high CTR even when it is applied to a new set of users.

  13. Thanks! Poster: Pacific Ballroom #252, Tue, 06:30 PM Contact: xinshi.chen@gatech.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend