Learning Action Representations for Reinforcement Learning Georgios - - PowerPoint PPT Presentation

learning action representations for reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Learning Action Representations for Reinforcement Learning Georgios - - PowerPoint PPT Presentation

Learning Action Representations for Reinforcement Learning Georgios Scott Yash James Philip Theocharous Jordan Chandak Kostas Thomas Reinforcement Learning Problem Statement Thousands of possible actions! Problem Statement Thousands of


slide-1
SLIDE 1

Learning Action Representations for Reinforcement Learning

Yash Chandak Georgios Theocharous James Kostas Scott Jordan Philip Thomas

slide-2
SLIDE 2

Reinforcement Learning

slide-3
SLIDE 3

Problem Statement

Thousands of possible actions!

slide-4
SLIDE 4

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
slide-5
SLIDE 5

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
  • Advertisement/marketing
slide-6
SLIDE 6

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
  • Advertisement/marketing
  • Medical treatment - drug

prescription

slide-7
SLIDE 7

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
  • Advertisement/marketing
  • Medical treatment - drug

prescription

  • Portfolio management
slide-8
SLIDE 8

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
  • Advertisement/marketing
  • Medical treatment - drug

prescription

  • Portfolio management
  • Video/Songs recommendation
slide-9
SLIDE 9

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
  • Advertisement/marketing
  • Medical treatment - drug

prescription

  • Portfolio management
  • Video/Songs recommendation
  • Option selection
slide-10
SLIDE 10

Problem Statement

Thousands of possible actions!

  • Personalized tutoring systems
  • Advertisement/marketing
  • Medical treatment - drug

prescription

  • Portfolio management
  • Video/Songs recommendation
  • Option selection
slide-11
SLIDE 11
  • Actions are not independent discrete

quantities.

Key Insights

slide-12
SLIDE 12
  • Actions are not independent discrete

quantities.

  • There is a low dimensional structure

underlying their behavior pattern.

Key Insights

slide-13
SLIDE 13
  • Actions are not independent discrete

quantities.

  • There is a low dimensional structure

underlying their behavior pattern.

  • This structure can be learned

independent of the reward.

Key Insights

slide-14
SLIDE 14
  • Actions are not independent discrete

quantities.

  • There is a low dimensional structure

underlying their behavior pattern.

  • This structure can be learned

independent of the reward.

  • Instead of raw actions, agent can act in

this space of behavior and feedback can be generalized to similar actions.

Key Insights

slide-15
SLIDE 15

Proposed Method

slide-16
SLIDE 16

Algorithm

(a) Supervised learning of action representations.

slide-17
SLIDE 17

Algorithm

(a) Supervised learning of action representations. (b) Learning internal policy with policy gradients.

slide-18
SLIDE 18
slide-19
SLIDE 19

Results

slide-20
SLIDE 20

Results

slide-21
SLIDE 21

Results

slide-22
SLIDE 22

Results

slide-23
SLIDE 23

Real-world Applications at Adobe

Actions = 1498 tutorials HelpX Photoshop Actions = 1843 tools

slide-24
SLIDE 24

Poster #112 Today

slide-25
SLIDE 25
slide-26
SLIDE 26

Results (Action representations)

Maze domain Actual behavior of 212 actions Learned representations of 212 actions

slide-27
SLIDE 27

Policy decomposition

slide-28
SLIDE 28

Case 1: Action representations are known

  • The internal policy acts in the space of action representations
  • Any existing policy gradient algorithm can be used to improve its local

performance, independent of the mapping function.

slide-29
SLIDE 29

Case 2: Learning action representations

  • P(a|e) required to map representation to action can be learned by satisfying the

earlier assumption:

  • We parameterize P(a|e) and P(e|s,s’) with learnable functions f and g, respectively.
  • Observed transition tuples are from the required distribution.
  • Parameters can be learned by minimizing the stochastic KL divergence.
  • Procedure is independent of reward.
slide-30
SLIDE 30

Toy Maze:

  • Agent in continuous state with n actuators.
  • 2n actions. Exponentially large action space.
  • Long horizon and single goal reward.

Adobe Datasets:

  • N-gram based multi-time step user behavior model from passive data.
  • Rewards defined using a surrogate objective.
  • Photoshop tool recommendation (1843 tools)
  • HelpX tutorial recommendation (1498 tutorials)

Experiments

slide-31
SLIDE 31

Advantages

  • Exploits structure in space of actions.
  • Quick generalization of feedback to similar actions.
  • Less parameters updated using high variance policy gradients.
  • Drop-in extension for existing policy gradient algorithms.