Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

Deep Reinforcement Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 1 / 57 Outline


slide-1
SLIDE 1

Deep Reinforcement Learning

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 1 / 57

slide-2
SLIDE 2

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 2 / 57

slide-3
SLIDE 3

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 3 / 57

slide-4
SLIDE 4

(Tabular) RL

Q-learning: Q∗(s,a) ← Q∗(s,a)+η [(R(s,a,s′)+γ maxa′ Q∗(s′,a′))−Q∗(s,a)] SARSA: Qπ(s,a) ← Qπ(s,a)+η [(R(s,a,s′)+γQπ(s′,π(s′)))−Qπ(s,a)]

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 4 / 57

slide-5
SLIDE 5

(Tabular) RL

Q-learning: Q∗(s,a) ← Q∗(s,a)+η [(R(s,a,s′)+γ maxa′ Q∗(s′,a′))−Q∗(s,a)] SARSA: Qπ(s,a) ← Qπ(s,a)+η [(R(s,a,s′)+γQπ(s′,π(s′)))−Qπ(s,a)] In realistic environments with large state/action space, requires a large table to store Q∗/Qπ values

Maze: O(101), Tetris: O(1060), Atari: O(1016922) pixels Continuous states/actions?

May not be able to visit all (s,a)’s in limited training time

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 4 / 57

slide-6
SLIDE 6

Generalizing across States

Idea: to learn a function fQ∗(s,a;Θ) (resp. fQπ) that approximates Q∗(s,a) (resp. Qπ(s,a)), ∀s,a

Trained by a small number (millions) of samples Generalizes to unseen states/actions Smaller Θ to store

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 5 / 57

slide-7
SLIDE 7

Generalizing across States

Idea: to learn a function fQ∗(s,a;Θ) (resp. fQπ) that approximates Q∗(s,a) (resp. Qπ(s,a)), ∀s,a

Trained by a small number (millions) of samples Generalizes to unseen states/actions Smaller Θ to store

E.g., in Q-learning, Q∗ should satisfy Bellman optimality equation: Q∗(s,a) ← ∑

s′

P(s′|s;a)[R(s,a,s′)+γ max

a′ Q∗(s′,a′)],∀s,a

Algorithm (TD estimate): initialize Θ arbitrarily, iterate until converge:

1

Take action a from s using some exploration policy π′ derived from fQ∗ (e.g., ε-greedy)

2

Observe s′ and reward R(s,a,s′), update Θ using SGD: Θ ← Θ−η∇ΘC, where C(Θ) =

  • R(s,a,s′)+γ max

a′ fQ∗(s′,a′;Θ)−fQ∗(s,a;Θ)

2

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 5 / 57

slide-8
SLIDE 8

Works If with Careful Feature Engineering

Tetris: [1]

States: O(1060) configurations Actions: rotation and translation to falling piece

f(s,a;Θ) and C(Θ) modeled as an approximated linear programming problem Hand-crafted features (22 in total)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 6 / 57

slide-9
SLIDE 9

Works If with Careful Feature Engineering

Tetris: [1]

States: O(1060) configurations Actions: rotation and translation to falling piece

f(s,a;Θ) and C(Θ) modeled as an approximated linear programming problem Hand-crafted features (22 in total) Why not use a deep neural network to represent QΘ?

One model for different tasks Automatically learned features

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 6 / 57

slide-10
SLIDE 10

Deep RL

Value-based: use DNNs to represent value/Q-function

E.g., DQN π∗(s) ← argmaxa Q∗(s,a) only feasible if actions are discrete

Policy-based: use DNNs to represent policy π

E.g., DDPG, Action-Critic, A3C, TRPO, PPO

Model-based: deep RL when MDP/env. model is known

E.g., AlphaGo

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 7 / 57

slide-11
SLIDE 11

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 8 / 57

slide-12
SLIDE 12

DNNs for Q∗

Use a DNN fQ∗(s,a;Θ) to represent Q∗(s,a)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

slide-13
SLIDE 13

DNNs for Q∗

Use a DNN fQ∗(s,a;Θ) to represent Q∗(s,a) Algorithm (TD): initialize Θ arbitrarily, iterate until converge:

1

Take action a from s using some exploration policy π′ derived from fQ∗ (e.g., ε-greedy)

2

Observe s′ and reward R(s,a,s′), update Θ using SGD: Θ ← Θ−η∇ΘC, where C(Θ) =

  • R(s,a,s′)+γ max

a′ fQ∗(s′,a′;Θ)−fQ∗(s,a;Θ)

2

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

slide-14
SLIDE 14

DNNs for Q∗

Use a DNN fQ∗(s,a;Θ) to represent Q∗(s,a) Algorithm (TD): initialize Θ arbitrarily, iterate until converge:

1

Take action a from s using some exploration policy π′ derived from fQ∗ (e.g., ε-greedy)

2

Observe s′ and reward R(s,a,s′), update Θ using SGD: Θ ← Θ−η∇ΘC, where C(Θ) =

  • R(s,a,s′)+γ max

a′ fQ∗(s′,a′;Θ)−fQ∗(s,a;Θ)

2

However, diverges due to

Samples are correlated (violates i.i.d. assumption of training examples) Non-stationary target (fQ∗(s′,a′) changes as Θ is updated for current a)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 9 / 57

slide-15
SLIDE 15

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 10 / 57

slide-16
SLIDE 16

Deep Q-Network (DQN)

Naive TD algorithm diverges due to:

Samples are correlated Non-stationary target

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 11 / 57

slide-17
SLIDE 17

Deep Q-Network (DQN)

Naive TD algorithm diverges due to:

Samples are correlated Non-stationary target

Stabilization techniques proposed by (Nature) DQN [5]:

Experience replay Delayed target network

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 11 / 57

slide-18
SLIDE 18

Experience Replay

Use a replay memory D to store recently seen transitions (s,a,r,s′)’s Sample a mini-batch from D and update Θ

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 12 / 57

slide-19
SLIDE 19

Experience Replay

Use a replay memory D to store recently seen transitions (s,a,r,s′)’s Sample a mini-batch from D and update Θ Algorithm (TD): initialize Θ arbitrarily, iterate until converge:

1

Take action a from s using π′ derived from fQ∗ (e.g., ε-greedy)

2

Observe s′ and reward R, add (s,a,R,s′) to D

3

Sample a mini-batch of (s(i),a(i),R(i),s(i+1))’s from D, do: Θ ← Θ−η∇ΘC, where C(Θ) = ∑

i

  • R(i) +γ max

a′ fQ∗(s(i+1),a′;Θ)−fQ∗(s(i),a(i);Θ)

2

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 12 / 57

slide-20
SLIDE 20

Delayed Target Network

To avoid chasing a moving target, set the target value at network

  • utput parametrized by old Θ−

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 13 / 57

slide-21
SLIDE 21

Delayed Target Network

To avoid chasing a moving target, set the target value at network

  • utput parametrized by old Θ−

Algorithm (TD): initialize Θ arbitrarily and Θ− = Θ, iterate until converge:

1

Take action a from s using π′ derived from fQ∗ (e.g., ε-greedy)

2

Observe s′ and reward R, add (s,a,R,s′) to D

3

Sample a mini-batch of (s(i),a(i),R(i),s(i+1))’s from D, do: Θ ← Θ−η∇ΘC, where C(Θ) = ∑

i

  • R(i) +γ max

a′ fQ∗(s(i+1),a′;Θ−)−fQ∗(s(i),a(i);Θ)

2

4

Update Θ− ← Θ every K iterations

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 13 / 57

slide-22
SLIDE 22

Other Tricks

Optimization techniques matter in deep RL

Optimization error may lead to wrong traditions (trajectory) And bad final policy

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 14 / 57

slide-23
SLIDE 23

Other Tricks

Optimization techniques matter in deep RL

Optimization error may lead to wrong traditions (trajectory) And bad final policy

Reward clipping for better conditioned gradients

Can’t differentiate between small and large rewards Better use batch normalization

Use RMSProp instead of vanilla SGD for adaptive learning rate

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 14 / 57

slide-24
SLIDE 24

DQN on Atari

49 Atari 2600 games States: raw pixels Actions: 18 joystick/button positions Rewards: changes in score

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 15 / 57

slide-25
SLIDE 25

Network Architecture

End-to-end from raw pixels to Q∗(s,a) CNN + fully connected layers Input: state s a stack of raw pixels from last 4 frames Output: 18 Q∗(s,a)’s (one for each action)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 16 / 57

slide-26
SLIDE 26

Network Architecture

End-to-end from raw pixels to Q∗(s,a) CNN + fully connected layers Input: state s a stack of raw pixels from last 4 frames Output: 18 Q∗(s,a)’s (one for each action) Network architecture is fixed across all games

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 16 / 57

slide-27
SLIDE 27

Results

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 17 / 57

slide-28
SLIDE 28

Effect of Stability Techniques

Delayed target network is less useful for large networks

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 18 / 57

slide-29
SLIDE 29

Predicted Q∗ Values for Pong

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 19 / 57

slide-30
SLIDE 30

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 20 / 57

slide-31
SLIDE 31

Improvements since DQN

Stabilization:

Double DQN [9] Prioritized replay [7]

Modeling additional prior:

Duelling network [10]

Exploration:

NoisyNet [2]

Large-scale implementation

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 21 / 57

slide-32
SLIDE 32

Double DQN I

DQN update rule: Θ ← Θ−η∇ΘC, where

C(Θ) = ∑

i

  • R(i) +γmax

a′ fQ∗(s(i+1),a′;Θ−)−fQ∗(s(i),a(i);Θ)

2

There is an upward bias in maxa′ fQ∗(s(i+1),a′;Θ−)

fQ∗(s(i+1),a′;Θ−) with high positive error is preferred

At each step, the positive error is added to fQ∗(s(i),a(i);Θ)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 22 / 57

slide-33
SLIDE 33

Double DQN II

Double DQN (DDQN) [9]:

C(Θ) = ∑

i

  • R(i) +γfQ∗(s(i+1),argmax

a′ fQ∗(s(i+1),a′;Θ);Θ−)−fQ∗(s(i),a(i);Θ)

2

Uses Θ to select the best action Uses Θ− to evaluate the best action

Random (unbiased) error added to fQ∗(s(i),a(i);Θ) at each step

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 23 / 57

slide-34
SLIDE 34

Prioritized Replay [7]

Not all (s,a,R,s′)’s from D are equally helpful to training fQ∗ Sample (s,a,R,s)’s with probability proportional to “surprise” in terms

  • f Bellman equation:

|R+γ max

a′ fQ∗(s′,a′;Θ−)−fQ∗(s,a;Θ)|

Rank-based alternative: D a priority queue

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 24 / 57

slide-35
SLIDE 35

Dueling Network [10]

Q∗(s,a) = V∗(s)+A∗(s,a)

A∗(s,a) the advantage function of a

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 25 / 57

slide-36
SLIDE 36

Dueling Network [10]

Q∗(s,a) = V∗(s)+A∗(s,a)

A∗(s,a) the advantage function of a

Idea: to model this prior and learn fQ∗(s,a) = fV∗(s)+fA∗(s,a)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 25 / 57

slide-37
SLIDE 37

Dueling Network [10]

Q∗(s,a) = V∗(s)+A∗(s,a)

A∗(s,a) the advantage function of a

Idea: to model this prior and learn fQ∗(s,a) = fV∗(s)+fA∗(s,a)

Not well-defined: fQ∗(s,a) = (fV∗(s)+c)+(fA∗(s,a)−c) for any c

Dueling DQN: fQ∗(s,a) = fV∗(s)+(fA∗(s,a)−maxa′ fA∗(s,a′))

The best action a∗ has zero advantage and fQ∗(s,a∗) = fV∗(s)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 25 / 57

slide-38
SLIDE 38

Dueling Network [10]

Q∗(s,a) = V∗(s)+A∗(s,a)

A∗(s,a) the advantage function of a

Idea: to model this prior and learn fQ∗(s,a) = fV∗(s)+fA∗(s,a)

Not well-defined: fQ∗(s,a) = (fV∗(s)+c)+(fA∗(s,a)−c) for any c

Dueling DQN: fQ∗(s,a) = fV∗(s)+(fA∗(s,a)−maxa′ fA∗(s,a′))

The best action a∗ has zero advantage and fQ∗(s,a∗) = fV∗(s)

Stabilized version: fQ∗(s,a) = fV∗(s)+(fA∗(s,a)− 1

|A| ∑a′ fA∗(s,a′))

fV∗ and fA∗ are off-target (by a constant) but fA∗ changes more slowly

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 25 / 57

slide-39
SLIDE 39

Improvement over Prioritized DDQN

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 26 / 57

slide-40
SLIDE 40

See, Attend, and Drive

Atari game: Enduro Attention mask: ∂fV∗

∂s (s) and ∂fA∗ ∂s (s)

fQ∗(s,a) = fV∗(s)+fA∗(s,a)

fV∗ pays attention to the road fA∗ pays attention only when there’s obstacles in front

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 27 / 57

slide-41
SLIDE 41

NoisyNet [2]

Instead of using using ε-greedy for exploration, add noise to Θ The level of noise is learned by SGD along with Θ Improvement over Dueling DQN:

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 28 / 57

slide-42
SLIDE 42

Scaling Up DQN on Single Machine

Exploits multi-threading of modern CPUs/GPUs Run/train multiple agents in parallel (one per thread/GPU)

Θ shared between threads in main memory Data-parallelism

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 29 / 57

slide-43
SLIDE 43

Scaling Up DQN on Single Machine

Exploits multi-threading of modern CPUs/GPUs Run/train multiple agents in parallel (one per thread/GPU)

Θ shared between threads in main memory Data-parallelism

Parallelism decorrelates samples

Alternative to experience replay

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 29 / 57

slide-44
SLIDE 44

Scaling Out DQN with Gorila [6]

Distributed system architecture for large-scale RL 10x faster than Nature DQN Applied to recommender systems in Google

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 30 / 57

slide-45
SLIDE 45

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 31 / 57

slide-46
SLIDE 46

Why Policy Network?

Policy-based deep RL: use a DNN gπ(s;Φ) to approximate π(s) Why?

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 32 / 57

slide-47
SLIDE 47

Why Policy Network?

Policy-based deep RL: use a DNN gπ(s;Φ) to approximate π(s) Why? In DQN (or any method based on value/policy iteration), one needs to solve π∗ = argmax

a′ Q∗(s,a′) or ˆ

π = argmax

a′ Qπ(s,a′)

Not applicable to continuous action space A common in, e.g., robotics

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 32 / 57

slide-48
SLIDE 48

Why Policy Network?

Policy-based deep RL: use a DNN gπ(s;Φ) to approximate π(s) Why? In DQN (or any method based on value/policy iteration), one needs to solve π∗ = argmax

a′ Q∗(s,a′) or ˆ

π = argmax

a′ Qπ(s,a′)

Not applicable to continuous action space A common in, e.g., robotics

π may be easier to learn than Q or V

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 32 / 57

slide-49
SLIDE 49

Modeling π

π can be either

deterministic: gπ(s;Φ) = a, or stochastic: gπ(s;Φ) = P(a|s)

Pathwise derivative methods

For deterministic π and continuous A

Policy gradient/optimization methods

For stochastic π

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 33 / 57

slide-50
SLIDE 50

Modeling π

π can be either

deterministic: gπ(s;Φ) = a, or stochastic: gπ(s;Φ) = P(a|s)

Pathwise derivative methods

For deterministic π and continuous A To find Φ such that gπ(s;Φ) gives action a maximizing Q∗(s,a) Changes the trajectory of an episode in the graph of accumulative rewards

Policy gradient/optimization methods

For stochastic π

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 33 / 57

slide-51
SLIDE 51

Modeling π

π can be either

deterministic: gπ(s;Φ) = a, or stochastic: gπ(s;Φ) = P(a|s)

Pathwise derivative methods

For deterministic π and continuous A To find Φ such that gπ(s;Φ) gives action a maximizing Q∗(s,a) Changes the trajectory of an episode in the graph of accumulative rewards

Policy gradient/optimization methods

For stochastic π To find Φ such that gives trajectory of high accumulative rewards Do not change the trajectory (but its probability) of an episode

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 33 / 57

slide-52
SLIDE 52

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 34 / 57

slide-53
SLIDE 53

Deep Deterministic Policy Gradient (DDPG) [4]

Based on DQN

Q-learning is off-policy and works with changing exploration strategies

Deterministic policy: gπ∗(s;Φ) = a ∈ R

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 35 / 57

slide-54
SLIDE 54

Deep Deterministic Policy Gradient (DDPG) [4]

Based on DQN

Q-learning is off-policy and works with changing exploration strategies

Deterministic policy: gπ∗(s;Φ) = a ∈ R Goal: to find Φ maximizing Es [fQ∗(s,a;Θ)], where a = gπ∗(s;Φ) SGD update rule: Φ ← Φ+η

∂Es[fQ∗(s,a;Θ)] ∂Φ

= Φ+ηEs ∂fQ∗

∂a (s,a;Θ)· ∂gπ∗ ∂Φ (s;Φ)

  • Shan-Hung Wu (CS, NTHU)

Deep Reinforcement Learning Machine Learning 35 / 57

slide-55
SLIDE 55

DDPG Algorithm (TD)

Initialize Θ and Φ arbitrarily, set Θ− = Θ and Φ− = Φ, iterate until converge:

1

Take action a = gπ∗(s;Φ)+z from s, where z is a random noise for exploration

2

Observe s′ and reward R, add (s,a,R,s′) to D

3

Sample a mini-batch of (s(i),a(i),R(i),s(i+1))’s from D

4

Update Θ: Θ ← Θ−η∇ΘC, where C(Θ) = ∑

i

  • R(i) +γfQ∗(s(i+1),gπ∗(s;Φ−);Θ−)−fQ∗(s(i),a(i);Θ)

2

5

Update Φ: Φ ← Φ+λ ∑

i

∂fQ∗ ∂a (s(i),gπ∗(s(i);Φ);Θ)· ∂gπ∗ ∂Φ (s(i);Φ)

6

Update Θ− ← τΘ+(1−τ)Θ− and Φ− ← τΦ+(1−τ)Φ−

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 36 / 57

slide-56
SLIDE 56

Limitations

Only applicable to continuous action space A Cannot backprop through samples when calculating

∂Es[fQ∗(s,gπ∗(s;Φ);Θ)] ∂Φ

For discrete A, it’s more natural to use a DNN to model a stochastic policy: gπ(s) = P(a|s),∀a

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 37 / 57

slide-57
SLIDE 57

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 38 / 57

slide-58
SLIDE 58

Episodic Policy Gradient

Policy gradient/optimization methods

For stochastic policy: gπ(s) = P(a|s;Φ),∀a (discrete or continuous) Do not change the trajectory (but its probability) of an episode

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 39 / 57

slide-59
SLIDE 59

Episodic Policy Gradient

Policy gradient/optimization methods

For stochastic policy: gπ(s) = P(a|s;Φ),∀a (discrete or continuous) Do not change the trajectory (but its probability) of an episode

Given an episode, let τ = {(s(t),a(t),R(t),s(t+1))}t be the sequence of state-action transitions

Action a(t) sampled from gπ(s(t))

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 39 / 57

slide-60
SLIDE 60

Episodic Policy Gradient

Policy gradient/optimization methods

For stochastic policy: gπ(s) = P(a|s;Φ),∀a (discrete or continuous) Do not change the trajectory (but its probability) of an episode

Given an episode, let τ = {(s(t),a(t),R(t),s(t+1))}t be the sequence of state-action transitions

Action a(t) sampled from gπ(s(t))

Let R(τ) = ∑t γtR(t), our goal: argmax

Φ Eτ [R(τ);Φ] = argmax Φ ∑ τ

P(τ;Φ)R(τ)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 39 / 57

slide-61
SLIDE 61

Policy Gradient

Let J(Φ) = ∑τ P(τ;Φ)R(τ), we have:

∇ΦJ(Φ) = ∇Φ ∑τ P(τ;Φ)R(τ) = ∑τ ∇ΦP(τ;Φ)R(τ) = ∑τ P(τ;Φ) ∇ΦP(τ;Φ)

P(τ;Φ) R(τ)

= ∑τ P(τ;Φ)∇ΦlogP(τ;Φ)R(τ)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 40 / 57

slide-62
SLIDE 62

Policy Gradient

Let J(Φ) = ∑τ P(τ;Φ)R(τ), we have:

∇ΦJ(Φ) = ∇Φ ∑τ P(τ;Φ)R(τ) = ∑τ ∇ΦP(τ;Φ)R(τ) = ∑τ P(τ;Φ) ∇ΦP(τ;Φ)

P(τ;Φ) R(τ)

= ∑τ P(τ;Φ)∇ΦlogP(τ;Φ)R(τ) = ∑τ P(τ;Φ)∇Φlog∏t P(s(t+1)|s(t),a(t))P(a(t)|s(t);Φ)R(τ) = ∑τ P(τ;Φ)∇Φ ∑t

  • logP(s(t+1)|s(t),a(t))+logP(a(t)|s(t);Φ)
  • R(τ)

= ∑τ P(τ;Φ)∑t ∇Φ logP(a(t)|s(t);Φ)R(τ)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 40 / 57

slide-63
SLIDE 63

Policy Gradient

Let J(Φ) = ∑τ P(τ;Φ)R(τ), we have:

∇ΦJ(Φ) = ∇Φ ∑τ P(τ;Φ)R(τ) = ∑τ ∇ΦP(τ;Φ)R(τ) = ∑τ P(τ;Φ) ∇ΦP(τ;Φ)

P(τ;Φ) R(τ)

= ∑τ P(τ;Φ)∇ΦlogP(τ;Φ)R(τ) = ∑τ P(τ;Φ)∇Φlog∏t P(s(t+1)|s(t),a(t))P(a(t)|s(t);Φ)R(τ) = ∑τ P(τ;Φ)∇Φ ∑t

  • logP(s(t+1)|s(t),a(t))+logP(a(t)|s(t);Φ)
  • R(τ)

= ∑τ P(τ;Φ)∑t ∇Φ logP(a(t)|s(t);Φ)R(τ) = ∑τ P(τ;Φ)∑t ∇Φ logP(a(t)|s(t);Φ)∑t′ γt′R(t′)(s(t),a(t),s(t+1)) = ∑τ P(τ;Φ)∑t ∇Φ logP(a(t)|s(t);Φ)∑H

t′=t γt′R(t′)

Assumes that the environment is MDP-alike

But no need for the exact model

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 40 / 57

slide-64
SLIDE 64

REINFORCE Algorithm

∇ΦJ(Φ) = ∑

τ

P(τ;Φ)∑

t

∇Φ logP(a(t)|s(t);Φ)

H

t′=t

γt′R(t′) REINFORCE (MC estimate): initialize Φ arbitrarily, iterate until converge:

1

Run episodes {τ(i)}i by sampling actions from g(·;Φ)

2

For each time step t in an episode, compute R(i,t) = ∑H(i)

t′=t γt′R(i,t′)

3

Update Φ using SGD: Φ ← Φ+η∇Φˆ J, where ∇Φˆ J(Φ) = ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)R(i,t).

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 41 / 57

slide-65
SLIDE 65

REINFORCE Algorithm

∇ΦJ(Φ) = ∑

τ

P(τ;Φ)∑

t

∇Φ logP(a(t)|s(t);Φ)

H

t′=t

γt′R(t′) REINFORCE (MC estimate): initialize Φ arbitrarily, iterate until converge:

1

Run episodes {τ(i)}i by sampling actions from g(·;Φ)

2

For each time step t in an episode, compute R(i,t) = ∑H(i)

t′=t γt′R(i,t′)

3

Update Φ using SGD: Φ ← Φ+η∇Φˆ J, where ∇Φˆ J(Φ) = ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)R(i,t).

REINFORCE-style policy gradient: ∇ log prob. of actions × episodic rewards

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 41 / 57

slide-66
SLIDE 66

Outline

1

Introduction

2

Value-based Deep RL Deep Q-Network Improvements

3

Policy-based Deep RL Pathwise Derivative Methods Policy Gradient/Optimization Methods Variance Reduction and Actor-Critic

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 42 / 57

slide-67
SLIDE 67

Variance

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)

H(i)

t′=t

γt′R(i,t′) ∑H(i)

t′=t γt′R(i,t′) is an MC estimate of

Qπ(s(i,t),a(i,t)) = E{s(t′),a(t′)}t′

  • ∑t′ γt′R(t′)|s(0) = s(i,t),a(0) = a(i,t)

using samples rolled out from single episode

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 43 / 57

slide-68
SLIDE 68

Variance

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)

H(i)

t′=t

γt′R(i,t′) ∑H(i)

t′=t γt′R(i,t′) is an MC estimate of

Qπ(s(i,t),a(i,t)) = E{s(t′),a(t′)}t′

  • ∑t′ γt′R(t′)|s(0) = s(i,t),a(0) = a(i,t)

using samples rolled out from single episode

TD vs. MC estimate:

TD: biased, but low variance MC: unbiased, but high variance

How to lower the variance of vanilla policy gradient algorithm?

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 43 / 57

slide-69
SLIDE 69

Variance

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)

H(i)

t′=t

γt′R(i,t′) ∑H(i)

t′=t γt′R(i,t′) is an MC estimate of

Qπ(s(i,t),a(i,t)) = E{s(t′),a(t′)}t′

  • ∑t′ γt′R(t′)|s(0) = s(i,t),a(0) = a(i,t)

using samples rolled out from single episode

TD vs. MC estimate:

TD: biased, but low variance MC: unbiased, but high variance

How to lower the variance of vanilla policy gradient algorithm?

To reduce the magnitude of ∑H(i)

t′=t γt′R(i,t′)

Eg., use a smaller γ or subtract a baseline from ∑H(i)

t′=t γt′R(i,t′)

To approximate ∑H(i)

t′=t γt′R(i,t′) by a DNN and take advantage of its

generalizability To collect more samples

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 43 / 57

slide-70
SLIDE 70

Baseline I

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′)−b) b reduces variance without adding bias as long as it’s independent with actions: ∑τ P(τ;Φ)∑t ∇Φ logP(a(t)|s(t);Φ)b = ∑τ P(τ;Φ)∇Φ logP(τ;Φ)b = ∑τ P(τ;Φ)∇ΦP(τ;Φ)

P(τ;Φ) b

= ∇Φ ∑τ P(τ;Φ)b = ∇Φb = 0

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 44 / 57

slide-71
SLIDE 71

Baseline I

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′)−b) b reduces variance without adding bias as long as it’s independent with actions: ∑τ P(τ;Φ)∑t ∇Φ logP(a(t)|s(t);Φ)b = ∑τ P(τ;Φ)∇Φ logP(τ;Φ)b = ∑τ P(τ;Φ)∇ΦP(τ;Φ)

P(τ;Φ) b

= ∇Φ ∑τ P(τ;Φ)b = ∇Φb = 0 Of what value?

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 44 / 57

slide-72
SLIDE 72

Baseline II

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′)−b) The larger the b better, but ∑H(i)

t′=t γt′R(i,t′) −b still needs to guide gπ to

  • utput good τ

∑H(i)

t′=t γt′R(i,t′) an estimate of Qπ(s(i,t),a(i,t))

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 45 / 57

slide-73
SLIDE 73

Baseline II

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′)−b) The larger the b better, but ∑H(i)

t′=t γt′R(i,t′) −b still needs to guide gπ to

  • utput good τ

∑H(i)

t′=t γt′R(i,t′) an estimate of Qπ(s(i,t),a(i,t))

b an estimate of Vπ(s(i,t)) = E{s(t′),a(t′)}t′

  • ∑t′ γt′R(t′)|s(0) = s(i,t)

[3]

∑H(i)

t′=t γt′R(i,t′) −b estimates Qπ(s(i,t),a(i,t))−Vπ(s(i,t)), the advantage of

π at state s(i,t)

In REINFORCE: b =

1 |{j,t′′:s(j,t′′)=s(i,t)}| ∑j,t′′:s(j,t′′)=s(i,t) ∑H(j) t′=t′′ γt′R(j,t′)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 45 / 57

slide-74
SLIDE 74

Function Approximations

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′) −b) ∑H(i)

t′=t γt′R(i,t′) estimates Qπ(s(i,t),a(i,t)) using rolled-out from single

episode Actor-critic: why not use a DNN fQπ(s,a;Θ) to approximate Qπ(s,a),∀s,a?

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 46 / 57

slide-75
SLIDE 75

Function Approximations

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′) −b) ∑H(i)

t′=t γt′R(i,t′) estimates Qπ(s(i,t),a(i,t)) using rolled-out from single

episode Actor-critic: why not use a DNN fQπ(s,a;Θ) to approximate Qπ(s,a),∀s,a? Baseline b = ∑j:s(j,t)=s(i,t) ∑H(j)

t′=t γt′R(j,t′) estimates Vπ(s(i,t))

Advantage actor-critic: approximates Vπ(s) with fVπ(s;Θ)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 46 / 57

slide-76
SLIDE 76

Advantage Actor-Critic (b = fVπ(s;Θ))

∇ΦJ(Φ) ∝ ∑

i,t

∇Φ logP(a(i,t)|s(i,t);Φ)(

H(i)

t′=t

γt′R(i,t′) −b) ∑H(i)

t′=t γt′R(i,t′) ≈ Qπ(s(i,t),a(i,t)) can be approximated by

R(i,t) +γfVπ(s(i,t+1);Θ)

No need for fQπ

Bellman expectation equation for stochastic π: Vπ(s) = ∑

a

π(a|s)∑

s′

P(s′|s;a)[R(s,a,s′)+γVπ(s′)],∀s Algorithm (TD): initialize Θ and Φ arbitrarily, iterate until converge:

1

Take an action a from s using g(s;Φ)

2

Observe s′ and reward R, compute ˆ Qπ ← R+γfVπ(s′;Θ)

3

Update fVπ: Θ ← Θ−η∇Θ ˆ Qπ −fVπ(s;Θ) 2

4

Update gπ: Φ ← Φ+λ∇Φ logP(a|s;Φ)( ˆ Qπ −fVπ(s;Θ))

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 47 / 57

slide-77
SLIDE 77

Pitfall: Exploration

To learn fVπ based on value iteration,the agent has to explore enough But gπ is optimized for exploitation only: Φ ← Φ+λ∇Φ logP(a|s;Φ)( ˆ Qπ −fVπ(s;Θ)) Solution?

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 48 / 57

slide-78
SLIDE 78

Pitfall: Exploration

To learn fVπ based on value iteration,the agent has to explore enough But gπ is optimized for exploitation only: Φ ← Φ+λ∇Φ logP(a|s;Φ)( ˆ Qπ −fVπ(s;Θ)) Solution? To maximize the entropy of gπ(s;Φ) as well Φ ← Φ+λ∇Φ

  • logP(a|s;Φ)( ˆ

Qπ −fVπ(s;Θ)) +µH(a ∼ gπ(s;Φ))]

The larger µ, the more exploration

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 48 / 57

slide-79
SLIDE 79

Asynchronous Advantage Actor-Critic (A3C)

TD estimate reduces variance at the cost of bias/divergence

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 49 / 57

slide-80
SLIDE 80

Asynchronous Advantage Actor-Critic (A3C)

TD estimate reduces variance at the cost of bias/divergence A3C: use asynchronous workers to stabilizes fVπ training

An alternative to experience reply

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 49 / 57

slide-81
SLIDE 81

A3C on Labyrinth

Task: to collect apples (+1 reward) and escape (+10 reward) End-to-end learning from pixels to policy State s(t) modeled as a recurrent neural network (LSTM)

To have long-term memory

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 50 / 57

slide-82
SLIDE 82

Variance Reduction by Having More Samples

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 51 / 57

slide-83
SLIDE 83

Variance Reduction by Having More Samples

Update rule for gπ in A3C: Φ ← Φ+λ∇Φ logP(a(t)|s(t);Φ)ˆ A(t)

π

where ˆ A(t)

π = ˆ

Q(t)

π −fVπ(s(t);Θ) = (R(t) +γfVπ(s(t);Θ))−fVπ(s(t);Θ)

Bellman expectation equation holds for multiple time differences: Qπ(s(t),a(t)) = E[R(t) +γVπ(s(t+1))|s(t) = s(t),a(t) = a(t)] = E[R(t) +γR(t+1) +γ2Vπ(s(t+2))] = E[R(t) +γR(t+1) +γ2R(t+2) +γ3Vπ(s(t+3))] ··· = E[R(t) +γR(t+1) +γ2R(t+2) +γ3R(t+3) +···] A3C replaces ˆ A(t)

π with a K-step lookahead:

ˆ A(t)

K ← R(t) +γR(t+1) +···+γK−1R(t+K−1) +γKfVπ(s(t+K);Θ)−fVπ(s(t);Θ)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 51 / 57

slide-84
SLIDE 84

Variance Reduction by Having More Samples

Update rule for gπ in A3C: Φ ← Φ+λ∇Φ logP(a(t)|s(t);Φ)ˆ A(t)

π

where ˆ A(t)

π = ˆ

Q(t)

π −fVπ(s(t);Θ) = (R(t) +γfVπ(s(t);Θ))−fVπ(s(t);Θ)

Bellman expectation equation holds for multiple time differences: Qπ(s(t),a(t)) = E[R(t) +γVπ(s(t+1))|s(t) = s(t),a(t) = a(t)] = E[R(t) +γR(t+1) +γ2Vπ(s(t+2))] = E[R(t) +γR(t+1) +γ2R(t+2) +γ3Vπ(s(t+3))] ··· = E[R(t) +γR(t+1) +γ2R(t+2) +γ3R(t+3) +···] A3C replaces ˆ A(t)

π with a K-step lookahead:

ˆ A(t)

K ← R(t) +γR(t+1) +···+γK−1R(t+K−1) +γKfVπ(s(t+K);Θ)−fVπ(s(t);Θ)

Update of Φ lags K time steps behind the current action

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 51 / 57

slide-85
SLIDE 85

Generalized Advantage Estimation (GAE)

ˆ A(t)

K = R(t) +γR(t+1) +···+γK−1R(t+K−1) +γKfVπ(s(t+K);Θ)−fVπ(s(t);Θ)

Define TD error at time t: δ (t) = R(t) +γfVπ(s(t+1);Θ)−fVπ(s(t);Θ) We have ˆ A(t)

K = δ (t) +γδ (t+1) +···+γK−1δ (t+K−1)

GAE [8]: let ˆ A(t)

π be the exponential moving average of ˆ

A(t)

1 , ˆ

A(t)

2 ,···:

ˆ A(t)

π

← ˆ A(t)

1 +λ ˆ

A(t)

2 +λ 2 ˆ

A(t)

3 +···

= δ (t) +λ(δ (t) +γδ (t+1))+λ 2(δ (t) +γδ (t+1) +γ2δ (t+2))+··· =

1 1−λ δ (t) + λγ 1−λ δ (t+1) + λ 2γ2 1−λ δ (t+2) +···

∝ δ (t) +λγδ (t+1) +(λγ)2δ (t+2) +···

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 52 / 57

slide-86
SLIDE 86

Generalized Advantage Estimation (GAE)

ˆ A(t)

K = R(t) +γR(t+1) +···+γK−1R(t+K−1) +γKfVπ(s(t+K);Θ)−fVπ(s(t);Θ)

Define TD error at time t: δ (t) = R(t) +γfVπ(s(t+1);Θ)−fVπ(s(t);Θ) We have ˆ A(t)

K = δ (t) +γδ (t+1) +···+γK−1δ (t+K−1)

GAE [8]: let ˆ A(t)

π be the exponential moving average of ˆ

A(t)

1 , ˆ

A(t)

2 ,···:

ˆ A(t)

π

← ˆ A(t)

1 +λ ˆ

A(t)

2 +λ 2 ˆ

A(t)

3 +···

= δ (t) +λ(δ (t) +γδ (t+1))+λ 2(δ (t) +γδ (t+1) +γ2δ (t+2))+··· =

1 1−λ δ (t) + λγ 1−λ δ (t+1) + λ 2γ2 1−λ δ (t+2) +···

∝ δ (t) +λγδ (t+1) +(λγ)2δ (t+2) +··· Biased, but with much lower variance

δ (t)’s can have lower magnitudes when fVπ is good enough

In TD: ˆ A(t)

π ← δ (t) +λγδ (t+1) +···+(λγ)Kδ (t+K)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 52 / 57

slide-87
SLIDE 87

Policy Optimization vs. Policy/Value Iteration

Policy optimization: Policy/value-iteration-based methods (e.g., DQN, DDPG):

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 53 / 57

slide-88
SLIDE 88

Policy Optimization vs. Policy/Value Iteration

Policy optimization:

Optimize policy “directly”

Policy/value-iteration-based methods (e.g., DQN, DDPG):

Optimize policy “indirectly” (via Q/V exploiting Bellman equations)

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 53 / 57

slide-89
SLIDE 89

Policy Optimization vs. Policy/Value Iteration

Policy optimization:

Optimize policy “directly” More compatible with auxiliary objectives & rich NN architectures (e.g., RNN)

Policy/value-iteration-based methods (e.g., DQN, DDPG):

Optimize policy “indirectly” (via Q/V exploiting Bellman equations) More compatible with different exploration strategies

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 53 / 57

slide-90
SLIDE 90

Policy Optimization vs. Policy/Value Iteration

Policy optimization:

Optimize policy “directly” More compatible with auxiliary objectives & rich NN architectures (e.g., RNN) More likely to work with different tasks/settings

Policy/value-iteration-based methods (e.g., DQN, DDPG):

Optimize policy “indirectly” (via Q/V exploiting Bellman equations) More compatible with different exploration strategies Sensitive to task/settings; but more sample-efficient when working

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 53 / 57

slide-91
SLIDE 91

Reference I

[1] Vivek F Farias and Benjamin Van Roy. Tetris: A study of randomized constraint sampling. In Probabilistic and Randomized Methods for Design Under Uncertainty, pages 189–201. Springer, 2006. [2] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017. [3] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 54 / 57

slide-92
SLIDE 92

Reference II

[4] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. [5] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. [6] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 55 / 57

slide-93
SLIDE 93

Reference III

[7] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015. [8] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. [9] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, pages 2094–2100, 2016. [10] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 56 / 57

slide-94
SLIDE 94

Reference IV

Shan-Hung Wu (CS, NTHU) Deep Reinforcement Learning Machine Learning 57 / 57