Deep Reinforcement Learning M. Soleymani Sharif University of - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning M. Soleymani Sharif University of - - PowerPoint PPT Presentation

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016. Supervised Learning


slide-1
SLIDE 1

Deep Reinforcement Learning

  • M. Soleymani

Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016.

slide-2
SLIDE 2

Supervised Learning

  • Data: (x, y)

– x is data – y is label

  • Goal: Learn a function to map x -> y
  • Examples:

Classification, regression,

  • bject

detection, semantic segmentation, image captioning, etc.

slide-3
SLIDE 3

Unsupervised Learning

  • Data: x

– Just data, no labels!

  • Goal: Learn some underlying hidden

structure of the data

  • Examples: Clustering, dimensionality

reduction, feature learning, density estimation, etc.

slide-4
SLIDE 4

Reinforcement Learning

  • Goal: Learn how to take actions

in order to maximize reward

– Concerned with taking sequences

  • f actions
  • Described

in terms

  • f

agent interacting with a previously unknown environment, trying to maximize cumulative reward

an agent interacting with an environment, which provides numeric reward signals

slide-5
SLIDE 5

Overview

  • What is Reinforcement Learning?
  • Markov Decision Processes
  • Q-Learning
  • Policy Gradients
slide-6
SLIDE 6

Reinforcement Learning

slide-7
SLIDE 7

Reinforcement Learning

slide-8
SLIDE 8

Reinforcement Learning

slide-9
SLIDE 9

Reinforcement Learning

slide-10
SLIDE 10

Robot Locomotion

slide-11
SLIDE 11

Motor Control and Robotics

  • Robotics:

– Observations: camera images, joint angles – Actions: joint torques – Rewards: stay balanced, navigate to target locations, serve and protect humans

slide-12
SLIDE 12

Atari Games

slide-13
SLIDE 13

Go

slide-14
SLIDE 14

How Does RL Relate to Other Machine Learning Problems?

  • Differences between RL and supervised learning:

– You don't have full access to the function you're trying to optimize

  • must query it through interaction.

– Interacting with a stateful world: input 𝑦𝑢 depend on your previous actions

slide-15
SLIDE 15

How can we mathematically formalize the RL problem?

slide-16
SLIDE 16

Markov Decision Process

slide-17
SLIDE 17

Markov Decision Process

  • At time step t=0, environment samples initial state 𝑡0~𝑞(𝑡0)
  • Then, for t=0 until done:

– Agent selects action 𝑏𝑢 – Environment samples reward 𝑠

𝑢~𝑆(. |𝑡𝑢, 𝑏𝑢)

– Environment samples next state 𝑡𝑢+1~𝑄(. |𝑡𝑢, 𝑏𝑢) – Agent receives reward 𝑠

𝑢 and next state 𝑡𝑢+1

  • A policy 𝜌 is a function from S to A that specifies what action to take in each state
  • Objective: find policy 𝜌∗ that maximizes cumulative discounted reward:

𝑠

0 + 𝛿𝑠 1 + 𝛿2𝑠 2 + ⋯ = 𝑙=0 ∞

𝛿𝑙𝑠

𝑙

slide-18
SLIDE 18

A simple MDP: Grid World

slide-19
SLIDE 19

A simple MDP: Grid World

slide-20
SLIDE 20

The optimal policy 𝜌∗

  • We want to find optimal policy 𝝆∗ that maximizes the sum of

rewards.

  • How do we handle the randomness (initial state, transition

probability…)?

– Maximize the expected sum of rewards!

slide-21
SLIDE 21

Definitions: Value function and Q-value function

slide-22
SLIDE 22

Value function for policy 𝜌

22

𝑊𝜌 𝑡 = 𝐹 𝑙=0

𝛿𝑙𝑠

𝑢 𝑡0 = 𝑡, 𝜌

𝑅𝜌 𝑡, 𝑏 = 𝐹 𝑙=0

𝛿𝑙𝑠

𝑢 𝑡0 = 𝑡, 𝑏0 = 𝑏 , 𝜌

  • 𝑊𝜌 𝑡 : How good for the agent to be in the state 𝑡 when its policy is 𝜌

– It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌

𝑊𝜌 𝑡 = 𝐹 𝑠 + 𝛿𝑊𝜌(𝑡′)|𝑡, 𝜌 𝑅𝜌 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅𝜌(𝑡′, 𝑏′)|𝑡, 𝑏, 𝜌

Bellman Equations

slide-23
SLIDE 23

Bellman optimality equation

23

𝑊∗ 𝑡 = max

𝑏∈𝒝(𝑡)𝐹 𝑠 + 𝛿𝑊∗(𝑡′)|𝑡, 𝑏

𝑅∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max

𝑏′ 𝑅∗ 𝑡′, 𝑏′

|𝑡, 𝑏

slide-24
SLIDE 24

Bellman equation

slide-25
SLIDE 25

Optimal policy

25

  • It can also be computed as:

𝜌∗ 𝑡 = argmax

𝑏∈𝒝(𝑡)

𝑅∗ 𝑡, 𝑏

slide-26
SLIDE 26

Solving for the optimal policy: Value iteration

slide-27
SLIDE 27

Solving for the optimal policy: Q-learning algorithm

27

  • Initialize

𝑅(𝑡, 𝑏) arbitrarily

  • Repeat (for each episode):
  • Initialize 𝑡
  • Repeat (for each step of episode):
  • Choose 𝑏 from 𝑡 using a policy derived from

𝑅

  • Take action 𝑏, receive reward 𝑠, observe new state 𝑡′
  • 𝑅 𝑡, 𝑏 ←

𝑅 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max

𝑏′

𝑅 𝑡′, 𝑏′ − 𝑅 𝑡, 𝑏

  • 𝑡 ← 𝑡′
  • until 𝑡 is terminal

e.g., greedy, ε-greedy

slide-28
SLIDE 28

Problem

  • Not scalable.

– Must compute Q(s,a) for every state-action pair.

  • it computationally infeasible to compute for entire state space!
  • Solution: use a function approximator to estimate Q(s,a).

– E.g. a neural network!

slide-29
SLIDE 29

Solving for the optimal policy: Q-learning

slide-30
SLIDE 30

Solving for the optimal policy: Q-learning

Iteratively try to make the Q- value close to the target value (yi ) it should have (according to Bellman Equations). [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-31
SLIDE 31

Case Study: Playing Atari Games (seen before)

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-32
SLIDE 32

Q-network Architecture

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015] Last FC layer has 4-d output (if 4 actions) Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 ) Number of actions between 4-18 depending on Atari game A single feedforward pass to compute Q-values for all actions from the current state => efficient!

slide-33
SLIDE 33

Training the Q-network: Experience Replay

  • Learning from batches of consecutive samples is problematic:

– Samples are correlated => inefficient learning – Current Q-network parameters determines next training samples

  • can lead to bad feedback loops
  • Address these problems using experience replay

– Continually update a replay memory table of transitions (𝑡𝑢, 𝑏𝑢, 𝑠

𝑢, 𝑡𝑢+1)

– Train Q-network on random minibatches of transitions from the replay memory  Each transition can also contribute to multiple weight updates => greater data efficiency

 Smoothing out learning and avoiding oscillations or divergence in the parameters

slide-34
SLIDE 34

Putting it together: Deep Q-Learning with Experience Replay

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-35
SLIDE 35

Putting it together: Deep Q-Learning with Experience Replay

Initialize replay memory, Q-network [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-36
SLIDE 36

Putting it together: Deep Q-Learning with Experience Replay

Play M episodes (full games) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-37
SLIDE 37

Putting it together: Deep Q-Learning with Experience Replay

Initialize state (starting game screen pixels) at the beginning of each episode [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-38
SLIDE 38

Putting it together: Deep Q-Learning with Experience Replay

For each time-step of game [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-39
SLIDE 39

Putting it together: Deep Q-Learning with Experience Replay

With small probability, select a random action (explore),

  • therwise select greedy action from current policy

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-40
SLIDE 40

Putting it together: Deep Q-Learning with Experience Replay

Take the selected action observe the reward and next state [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-41
SLIDE 41

Putting it together: Deep Q-Learning with Experience Replay

Store transition in replay memory [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-42
SLIDE 42

Putting it together: Deep Q-Learning with Experience Replay

Sample a random minibatch

  • f transitions and perform a

gradient descent step [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-43
SLIDE 43
slide-44
SLIDE 44

Results on 49 Games

  • The architecture and hyperparameter

values were the same for all 49 games.

  • DQN achieved performance comparable

to or better than an experienced human

  • n 29 out of 49 games.

[V. Mnih et al., Human-level control through deep reinforcement learning, Nature 2015]

slide-45
SLIDE 45

Policy Gradients

  • What is a problem with Q-learning?

– The Q-function can be very complicated!

  • Hard to learn exact value of every (state, action) pair
  • But the policy can be much simple
  • Can we learn a policy directly, e.g. finding the best policy from a

collection of policies?

slide-46
SLIDE 46

The goal of RL

the policy that must be learnt

slide-47
SLIDE 47

Policy Gradients

slide-48
SLIDE 48

REINFORCE algorithm

slide-49
SLIDE 49

REINFORCE algorithm

slide-50
SLIDE 50

REINFORCE algorithm

slide-51
SLIDE 51

Can estimate with Monte Carlo sampling

slide-52
SLIDE 52

REINFORCE algorithm

slide-53
SLIDE 53

REINFORCE algorithm

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂

𝑠(𝜐(𝑜)) 𝛼𝜄 log 𝑞 𝜐(𝑜); 𝜄

slide-54
SLIDE 54

REINFORCE Algorithm

  • Repeat

– Sample 𝜐(𝑜) from 𝜌𝜄 𝑏|𝑡 (run the policy) – 𝛼𝜄𝐾(𝜄) ≈

1 𝑂 𝑜=1 𝑂

𝑢≥0 𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜

𝑢≥0 𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

– 𝜄 ← 𝜄 + 𝛽𝛼𝜄𝐾(𝜄)

slide-55
SLIDE 55

Evaluating the policy gradient

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝑡𝑢 𝑏𝑢 𝜌𝜄 𝑏𝑢|𝑡𝑢

𝑠 𝜐(𝑜)

slide-56
SLIDE 56
  • Policy gradient:

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

  • Maximum Likelihood:

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝑡𝑢 𝑏𝑢 𝜌𝜄 𝑏𝑢|𝑡𝑢 𝜌𝜄 𝑏𝑢|𝑡𝑢

𝑠 𝜐(𝑜)

slide-57
SLIDE 57

Intuition

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂

𝑠 𝜐(𝑜)

𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

  • However, this also suffers from high variance
  • because credit assignment is really hard.
slide-58
SLIDE 58

What did we just do?

𝑞𝜄 𝑞𝜄 𝛼𝜄𝐾 𝜄 ≈ 1 𝑂

𝑜=1 𝑂

𝑠 𝜐 𝑜 𝛼𝜄 log 𝑞𝜄 𝜐 𝑜 𝑢≥0 𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝛼𝜄𝐾𝑁𝑀 𝜄 ≈ 1 𝑂

𝑜=1 𝑂

𝛼𝜄 log 𝑞𝜄 𝜐 𝑜

slide-59
SLIDE 59

Reducing variance

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

  • Causality:

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢

𝑠 𝑡𝑢′

𝑜 , 𝑏𝑢′ 𝑜

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

slide-60
SLIDE 60

Variance reduction

𝛼𝜄𝐾 𝜄 ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝛼𝜄𝐾 𝜄 ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢

𝑠 𝑡𝑢′

𝑜 , 𝑏𝑢′ 𝑜

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝛼𝜄𝐾 𝜄 ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢

𝛿𝑢′−𝑢𝑠 𝑡𝑢′

𝑜 , 𝑏𝑢′ 𝑜

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

slide-61
SLIDE 61

Variance reduction: Baseline

  • Problem: The raw value of a trajectory isn’t necessarily meaningful.

– For example, if rewards are all positive, you keep pushing up probabilities of actions.

  • What is important then?

– Whether a reward is better or worse than what you expect to get

  • Idea: Introduce a baseline function dependent on the state.

𝛼𝜄𝐾 𝜄 ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢

𝛿𝑢′−𝑢𝑠 𝑡𝑢′

𝑜 , 𝑏𝑢′ 𝑜

− 𝑐 𝑡𝑢

(𝑜)

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

Simple baseline: 𝑐 = 1

𝑂 𝑜=1 𝑂

𝑠 𝜐 𝑜

average reward is not the best baseline, but it’s pretty good!

slide-62
SLIDE 62

How to choose the baseline?

  • A simple baseline: constant moving average of rewards experienced

so far from all trajectories

  • Variance reduction techniques seen so far are typically used in

“Vanilla REINFORCE”

slide-63
SLIDE 63

Policy gradient in practice

  • Remember that the gradient has high variance

– This isn’t the same as supervised learning! – Gradients will be really noisy!

  • Consider using much larger batches
  • Tweaking learning rates is very hard

– Adaptive step size rules like ADAM can be OK-ish – policy gradient-specific learning rate adjustment methods

slide-64
SLIDE 64

REINFORCE Algorithm

  • Repeat

– Sample 𝜐(𝑜) from 𝜌𝜄 𝑏𝑢|𝑡𝑢 (run the policy) – 𝛼𝜄𝐾(𝜄) ≈

1 𝑂 𝑜=1 𝑂

𝑢≥0 𝛿𝑢′−𝑢𝑠 𝑡𝑢

𝑜 , 𝑏𝑢 𝑜

𝑢≥0 𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

– 𝜄 ← 𝜄 + 𝛽𝛼𝜄𝐾(𝜄)

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢

𝛿𝑢′−𝑢𝑠 𝑡𝑢′

𝑜 , 𝑏𝑢′ 𝑜

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝑅𝑢

(𝑜)

Reward to go

slide-65
SLIDE 65

How to choose the baseline?

  • A better baseline (to push up the probability of an action from a

state):

– if this action was better than the expected value of what we should get from that state.

  • We are happy with an action 𝑏𝑢 in a state 𝑡𝑢 if it is large
  • we are unhappy with an action if it’s small
slide-66
SLIDE 66

Improving the policy gradient

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢

𝛿𝑢′−𝑢𝑠 𝑡𝑢′

𝑜 , 𝑏𝑢′ 𝑜

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝑅𝑢

(𝑜)

Reward to go

  • 𝑅𝑢

(𝑜): estimate of expected reward if we take action 𝑏𝑢 (𝑜) in state 𝑡𝑢 𝑜

  • 𝑅 𝑡𝑢, 𝑏𝑢 = 𝑢′≥𝑢 𝐹𝑞𝜄 𝛿𝑢′−𝑢𝑠 𝑡𝑢′, 𝑏𝑢′ |𝑡𝑢, 𝑏𝑢

– True expected reward to go

  • 𝑊 𝑡𝑢 = 𝐹𝑏𝑢~𝜌𝜄 𝑏𝑢|𝑡𝑢 𝑅 𝑡𝑢, 𝑏𝑢
  • 𝛼𝜄𝐾(𝜄) ≈

1 𝑂 𝑜=1 𝑂

𝑢≥0 𝑅 𝑡𝑢

𝑜 , 𝑏𝑢 (𝑜) − 𝑊 𝑡𝑢 𝑜

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

slide-67
SLIDE 67

State & state-action value functions

  • 𝑅𝜌 𝑡𝑢, 𝑏𝑢 = 𝑢′≥𝑢 𝐹𝑞𝜄 𝛿𝑢′−𝑢𝑠 𝑡𝑢′, 𝑏𝑢′ |𝑡𝑢, 𝑏𝑢

– True expected reward to go

  • 𝑊𝜌 𝑡𝑢 = 𝐹𝑏𝑢~𝜌𝜄 𝑏𝑢|𝑡𝑢 𝑅 𝑡𝑢, 𝑏𝑢

– Total reward from 𝑡𝑢

  • 𝐵𝜌 𝑡𝑢, 𝑏𝑢 = 𝑅𝜌 𝑡𝑢, 𝑏𝑢 − 𝑊𝜌 𝑡𝑢

– How much better 𝑏𝑢 is

  • 𝛼𝜄𝐾(𝜄) ≈

1 𝑂 𝑜=1 𝑂

𝑢≥0 𝐵𝜌 𝑡𝑢

(𝑜), 𝑏𝑢 (𝑜) 𝛼𝜄 log 𝜌𝜄 𝑏𝑢 𝑜 |𝑡𝑢 (𝑜)

Remark: we can define by the advantage function how much an action was better than expected

slide-68
SLIDE 68

Improving the policy gradient: summary

𝛼𝜄𝐾 𝜄 ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝛿𝑢′−𝑢𝑠 𝑡𝑢

(𝑜), 𝑏𝑢 (𝑜) − 𝑐 𝑢≥0

𝛼𝜄 log 𝜌𝜄 𝑏𝑢

𝑜 |𝑡𝑢 (𝑜)

𝛼𝜄𝐾(𝜄) ≈ 1 𝑂

𝑜=1 𝑂 𝑢≥0

𝐵𝜌 𝑡𝑢

(𝑜), 𝑏𝑢 (𝑜) 𝛼𝜄 log 𝜌𝜄 𝑏𝑢 𝑜 |𝑡𝑢 (𝑜) Instead of using this unbiased, but high variance single-sample estimate, use 𝐵𝜌 that is an estimation of expectation

slide-69
SLIDE 69

Actor-Critic Algorithm

  • Problem: we don’t know value function
  • Can we learn them?

– Yes, like Q-learning!

  • We can combine Policy Gradients and Q-learning by training both an

actor (the policy) and a critic (the Q-function).

– The actor decides which action to take – the critic tells the actor how good its action was and how it should adjust – Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy

slide-70
SLIDE 70

Value function fitting

𝛿 𝛿

slide-71
SLIDE 71

An actor-critic algorithm

𝑧𝑢 ≈ 𝑠 𝑡𝑢, 𝑏𝑢 + 𝑊

𝜚 𝜌 𝑡𝑢+1

ℒ 𝜚 =

𝑢

𝑊

𝜚 𝜌 𝑡𝑢 − 𝑧𝑢 2

repeat

slide-72
SLIDE 72

Actor-critic methods

value

  • ptimization

policy

  • ptimization

Actor-critic

slide-73
SLIDE 73

Advantages of Policy-based RL

  • Advantages

– Better convergence properties – Effective in high dimensional or continuous action spaces – Can learn stochastic policies

  • Disadvantages

– Typically converges to a local rather than global optimum – Evaluating a policy is typically inefficient and high variance

slide-74
SLIDE 74

RL in Other ML Problems

  • Hard Attention

– Observation: current image window – Action: where to look – Reward: classification

  • Sequential/structured prediction, e.g., machine translation

– Observations: words in source language – Actions: emit word in target language – Rewards: sentence-level metric, e.g. BLEU score

  • V. Mnih et al., “Recurrent models of visual attention”, NIPS 2014.
  • M. Ranzato et al., "Sequence level training with recurrent

neural networks“, 2015.

slide-75
SLIDE 75

REINFORCE in action: Recurrent Attention Model (RAM)

  • Objective: Image Classification
  • Take a sequence of “glimpses” selectively focusing on regions of the

image, to predict class

– Inspiration from human perception and eye movements – Saves computational resources => scalability – Able to ignore clutter / irrelevant parts of image

[Mnih et al. 2014]

slide-76
SLIDE 76

REINFORCE in action: Recurrent Attention Model (RAM)

  • Objective: Image Classification
  • State: Glimpses seen so far
  • Action: (x,y) coordinates (center of glimpse) of

where to look next in image

  • Reward: 1 at the final timestep if image correctly

classified, 0 otherwise

  • Glimpsing is a non-differentiable operation

=> learn policy for how to take glimpse actions using REINFORCE

[Mnih et al. 2014]

slide-77
SLIDE 77

REINFORCE in action: Recurrent Attention Model (RAM)

  • Given state of glimpses seen so far, use RNN to model the state and
  • utput next action

[Mnih et al. 2014]

slide-78
SLIDE 78

REINFORCE in action: Recurrent Attention Model (RAM)

  • Given state of glimpses seen so far, use RNN to model the state and
  • utput next action

[Mnih et al. 2014]

slide-79
SLIDE 79

REINFORCE in action: Recurrent Attention Model (RAM)

  • Given state of glimpses seen so far, use RNN to model the state and
  • utput next action
slide-80
SLIDE 80

REINFORCE in action: Recurrent Attention Model (RAM)

slide-81
SLIDE 81

Sequence generation: disadvantages of previous methods

  • Model was trained on a different distribution of inputs from the ones

encounters during test (generated by itself)

  • Errors made along the way will quickly accumulate (exposure bias)
  • The loss function used to train these models is at the word level
  • Training these models to directly optimize metrics like BLEU (by

which they are typically evaluated) is hard

– because these are not differentiable – BLEU: comparing the sequence of actions from the current policy against the

  • ptimal action sequence
  • M. Ranzato et al., “Sequence level training with recurrent neural networks”, 2015.
slide-82
SLIDE 82

Sequence level evaluation

  • A greedy left-to-right process which does not necessarily produce the

most likely sequence according to the model

  • One of the existing methods to reduce this effect is Beam Search

– It pursues not only one but k next word candidates at each point.

  • M. Ranzato et al., “Sequence level training with recurrent neural networks”, 2015.
slide-83
SLIDE 83

Sequence level training

  • Starts from the greedy policy and then slowly deviate from it to let

the model explore and make use of its own predictions.

– greedy policy is obtaind by maximum likelihood on training data

  • M. Ranzato et al., “Sequence level training with recurrent neural networks”, 2015.
slide-84
SLIDE 84

Sequence level training: Loss function

  • baseline

𝑠

𝑢 is estimated by a linear regressor which takes as input the

hidden states ℎ𝑢 of the RNN

  • M. Ranzato et al., “Sequence level training with recurrent neural networks”, 2015.
slide-85
SLIDE 85

Sequence level training

XE+R

  • M. Ranzato et al., “Sequence level training with recurrent neural networks”, 2015.

XENT

slide-86
SLIDE 86

More policy gradients: AlphaGo

slide-87
SLIDE 87

More policy gradients: AlphaGo

  • How to beat the Go world champion:

– Featurize the board (stone color, move legality, bias, …) – Initialize policy network with supervised training from professional go games, then continue training using policy gradient

  • play against itself from random previous iterations, +1 / -1 reward for winning / losing

– Also learn value network (critic) – Finally, combine policy and value networks in a Monte Carlo Tree Search algorithm to select actions by lookahead search

slide-88
SLIDE 88

Summary

  • Policy gradients: very general but suffer from high variance so

requires a lot of samples.

– Challenge: sample-efficiency

  • Q-learning: does not always work but when it works, usually more

sample-efficient.

– Challenge: exploration

  • Guarantees:

– Policy Gradients: Converges to a local minima of 𝐾(𝜄), often good enough! – Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator