Deep Neural Networks and Deep Reinforcement Learning Deep Learning, - - PowerPoint PPT Presentation

deep neural networks and deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, - - PowerPoint PPT Presentation

Deep Neural Networks and Deep Reinforcement Learning Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an


slide-1
SLIDE 1

Deep Neural Networks and Deep Reinforcement Learning

Deep Neural Networks and Deep Reinforcement Learning

Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an Introduction, 2nd Edition [sect. 5.1-5.3, 6.1-6.3, 6.5]

slide-2
SLIDE 2

Deep Neural Networks and Deep Reinforcement Learning

Outline

♦ Neural Networks: intro ♦ Deep Neural Networks ♦ Deep Reinforcement Learning ♦ Deep Q Network ♦ Slides based on course offered by Prof. Pascal Poupart at

  • Univ. of Waterloo.
slide-3
SLIDE 3

Deep Neural Networks and Deep Reinforcement Learning

Reinforcement Learning: key points

♦ MDPs and Value iteration (planning) V (s) = maxa

  • s′ T(s, a, s′)(R(s, a, s′) + γVk(s′))

♦ TD Learning and Q-Learning(reinforcement learning) Q(s, a) = (1 − α)Q(s, a) + α(R(s, a, s′) + γmaxa′Q(s, a′)) ♦ Key Issues: number of states and actions maybe too many to maintain and update the Q-Table

slide-4
SLIDE 4

Deep Neural Networks and Deep Reinforcement Learning

Reinforcement Learning: Example of large state spaces

♦ the game of Go: 3361

slide-5
SLIDE 5

Deep Neural Networks and Deep Reinforcement Learning

Reinforcement Learning: Example of large state spaces

♦ Cart pole control problem: (x, x′, θ, θ′) continuous

slide-6
SLIDE 6

Deep Neural Networks and Deep Reinforcement Learning

Reinforcement Learning: Example of large state spaces

♦ Atari games: 210x160x3 (possible pixels considering the RGB layers)

slide-7
SLIDE 7

Deep Neural Networks and Deep Reinforcement Learning

Key Idea: function approximation

♦ Which functions are we interested in: Policy π(s) → a Value function V (s) ∈ ℜ Q-function Q(s, a) ∈ ℜ

slide-8
SLIDE 8

Deep Neural Networks and Deep Reinforcement Learning

Q-Function approximation

♦ State is a set of features: s = (x1, x2, · · · , xn)T CartPole: s = (x, x′, θ, θ′)T Atari, values of pixels ♦ Linear approximation: Q(a, s) ≈ n

i=1 waixi

♦ Non-linear (neural network): Q(s, a) ≈ g(x; w)

slide-9
SLIDE 9

Deep Neural Networks and Deep Reinforcement Learning

Feed-Forward ANN

♦ Network of units (computational neurons) ♦ DAG connecting functions with weighted edges ♦ Each unit computes h(wTx + b) w: weights, x: inputs to node, b: bias h: activation function, usually non-linear

slide-10
SLIDE 10

Deep Neural Networks and Deep Reinforcement Learning

One hidden layer ANN

♦ hidden units: zj = h1(w(1)

j

x + b(1)

j

) ♦ output units: yk = h2(w(2)

k z + b(2) k )

♦ overall yk = h2(

j w(2) kj h1( i w(1) ji xi + b(1) j

) + b(2)

k ))

w: weights, x: inputs to node, b: bias h: activation function, usually non-linear

slide-11
SLIDE 11

Deep Neural Networks and Deep Reinforcement Learning

Activation function h

♦ threshold: h(x) =

  • 1

if x ≥ 0 −1

  • therwise

♦ sigmoid: h(x) = σ(x) =

1 1+e−x

♦ tanh h(x) = tanh(x) = ex−e−x

ex+e−x

♦ gaussian h(x) = e− 1

2 ( x−µ σ )2

♦ identity h(x) = x ♦ rectified linear (ReLU) h(x) = max{0, x}

slide-12
SLIDE 12

Deep Neural Networks and Deep Reinforcement Learning

Universal approximation property

Theorem (Hornik et al., 1989, Cybenko, 1989): a feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function (sigmoid/tahnh/gaussian) can approximate any function arbitrarely closely, provided that the network is given enough hidden units. ♦ any: continuous function on a closed an bounded subset of ℜn (relationship with Borel measurability)

slide-13
SLIDE 13

Deep Neural Networks and Deep Reinforcement Learning

Minimize least squared error

♦ Key idea to optimize the weights: minimize the error with respect to the output (Loss) E(W) =

  • n

En(W)2 =

  • n

|f (xn; W) − yn|2

2

♦ Non convex optimization problem: can train by using gradient descent Given sample (xn, yn) update weights as follows: wji ← wji − η ∂En ∂wji Backpropagation algorithm to compute the gradient in a ANN

slide-14
SLIDE 14

Deep Neural Networks and Deep Reinforcement Learning

Deep Neural Networks

♦ Deep NN: ANN with many hidden layers ♦ Benefit: high expressivity (i.e., compact representation) ♦ Issues: can we train Deep NN in the same way ? can we avoid overfitting ?

slide-15
SLIDE 15

Deep Neural Networks and Deep Reinforcement Learning

Example: Image Classification

♦ ImageNet Large Scale Visual Recognition Challenge

slide-16
SLIDE 16

Deep Neural Networks and Deep Reinforcement Learning

Vanishing Gradient

♦ Deep Neural networks that uses "squashing" functions (e.g., sigmoid, tanh) suffer from vanishing gradients

slide-17
SLIDE 17

Deep Neural Networks and Deep Reinforcement Learning

Sigmoid and Hyperbolic functions

♦ Derivative of Sigmoid and Tanh is always less than one! ♦ when back propagating gradients we multiply several numbers that are less than one

slide-18
SLIDE 18

Deep Neural Networks and Deep Reinforcement Learning

Example: vanishing gradient

♦ y = t(w3t(w2t(w1x))), where t(·) is the tanh function ♦ common weight initialization in (-1,1) ♦ tanh function and its derivative are less than 1 ♦ vanishing gradient

∂y ∂w3 = t′(a3)t(a2) ∂y ∂w2 = t′(a3)w3t′(a2)t(a1) ≤ ∂y ∂w3 ∂y ∂w1 = t′(a3)w3t′(a2)w2t′(a1)x ≤ ∂y ∂w2

slide-19
SLIDE 19

Deep Neural Networks and Deep Reinforcement Learning

Mitigations for vanishing gradient

♦ typical solutions to mitigate vanishing gradient Pre-training Rectified Linear Units Batch normalization Skip connections

slide-20
SLIDE 20

Deep Neural Networks and Deep Reinforcement Learning

Rectified Linear Units (ReLU)

♦ Rectified linear h(x) = max(0, x) Gradient is 0 or 1 Piecewise linear Sparse computation ♦ Soft computation (Softplus): h(x) = log(1 + ex) ♦ Softplus does not mitigate gradient vanishing

slide-21
SLIDE 21

Deep Neural Networks and Deep Reinforcement Learning

Deep Reinforcement Learning: key points

♦ For many real world domains we can not explicitly represent key functions for RL (π(s), V (s), Q(s, a)) ♦ We can try to approximate them Linear approximation Neural Network approximation

Deep RL

♦ Deep Q Network approximates Q(s, a) with a DNN

slide-22
SLIDE 22

Deep Neural Networks and Deep Reinforcement Learning

Gradient Q-Learning

♦ approximate Q(s, a) with a parametrized function Qw(s, a) ♦ Minimize squared error between estimate and target Estimate Qw(s, a) Target: r(s, a, s′) + γ maxa′ Qw(s′, a′) ♦ squared error: Err(w) = (Qw(s, a) − r(s, a, s′) − γ maxa′ Qw(s′, a′))2 ♦ gradient:

∂Err(w) ∂w

= 2(Qw(s, a) − r(s, a, s′) − γ maxa′ Qw(s′, a′))∂Qw(s,a)

∂w

(Scalar 2 is a constant factor and not important for update)

slide-23
SLIDE 23

Deep Neural Networks and Deep Reinforcement Learning

Gradient Q-Learning Algorithm

Algorithm 1 Gradient Q-Learning

1: Initialize weights w randomly in [−1, 1] 2: Initialize s {observe current state} 3: loop 4:

Select and execute action a

5:

Observe new state s′ receive immediate reward r

6:

∂Err(w) ∂w

= (Qw(s, a) − r − γ maxa′ Qw(s′, a′))∂Qw(s,a)

∂w

7:

update weights w ← w − α ∂Err(w)

∂w

8:

update state s ← s′

9: end loop

slide-24
SLIDE 24

Deep Neural Networks and Deep Reinforcement Learning

Convergence of tabular Q-Learning

♦ Q(s, a) ← Q(s, a)+α(r(s, a, s′)+γ maxa′ Q(s′, a′)−Q(s, a)) ♦ Tabular Q-Learning converges to optimal policy if you explore enough if you make the learning rate small enough ... but not decrease it too quickly α = 1/n(s, a), where n(s, a) is number of visits for (s, a)

slide-25
SLIDE 25

Deep Neural Networks and Deep Reinforcement Learning

Convergence of linear gradient Q-Learning

♦ linear approximation of Q(s,a), Q(s, a) ≈

i waixi = wTx

♦ αt = 1/t ♦ gradient Q-learning w ← w − αt(Qw(s, a) − r − γ max

a′ Qw(s′, a′))∂Qw(s, a)

∂w

slide-26
SLIDE 26

Deep Neural Networks and Deep Reinforcement Learning

Non-Convergence of Non-linear gradient Q-Learning

♦ Non-linear approximation of Q(s,a), Q(s, a) ≈ g(x; w) ♦ Even if αt = 1/t, gradient Q-Learning may not converge ♦ Issue: we update the weights to reduce error for a specific experience (i.e., a specific (s, a)) but by changing the weights we may end up changing the Q(s, a) potentially everywhere. this is true also for linear approximation, but in that case convergence can still be guaranteed

slide-27
SLIDE 27

Deep Neural Networks and Deep Reinforcement Learning

Mitigating divergence

♦ Two main approaches to mitigate divergence:

1 experience replay 2 use two different networks

Q-network Target network

slide-28
SLIDE 28

Deep Neural Networks and Deep Reinforcement Learning

Experience replay

♦ Store previous experiences (i.e., (s, a, s′, r)) and use them at each step Store previous (s, a, s′, r) in a dedicated memory buffer At each step sample a mini-batch from this buffer and use the mini-batch to update the weights ♦ Benefits

1 reduces correlation between successive samples (increase

stability)

2 reduces number of interaction with the environment

(increase data efficiency)

slide-29
SLIDE 29

Deep Neural Networks and Deep Reinforcement Learning

Target Network

♦ Maintain a separate target network and update this network periodically (not with every experience) Q-network Qw(s, a) Target network Qw(s, a) ♦ repeat for every (s, a, s′, r) in the mini-batch update the Q-network w ← w − αt(Qw(s, a) − r − γ maxa′ Qw(s′, a′))∂Qw(s,a)

∂w

♦ update the target network w ← w

slide-30
SLIDE 30

Deep Neural Networks and Deep Reinforcement Learning

Deep Q Network

♦ Human-level control through deep reinforcement learning (V. Mnih et al., Nature 2015) ♦ Gradient Q-Learning Deep neural networks to approximate Q(s,a) Experience Replay and Target network ♦ above human-level performance in many Atari video games

slide-31
SLIDE 31

Deep Neural Networks and Deep Reinforcement Learning

Deep Q Network sketch of algorithm

Algorithm 2 DQN

1: Initialize weights w and w randomly in [−1, 1] 2: Initialize s {observe current state} 3: loop 4:

Select and execute action a

5:

Observe new state s′ receive immediate reward r

6:

Add (s, a, s′, r) to experience buffer

7:

Sample mini-batch MB of experiences from buffer

8:

for (ˆ s, ˆ a,ˆ s′,ˆ r) ∈ MB do

9:

∂Err(w) ∂w

= (Qw(ˆ s, ˆ a) − ˆ r − γ maxˆ

a′ Qw(ˆ

s′, ˆ a′))∂Qw(ˆ

s,ˆ a) ∂w

10:

update weights w ← w − α ∂Err(w)

∂w

11:

end for

12:

update state s ← s′

13:

every c steps, update target: w ← w

14: end loop

slide-32
SLIDE 32

Deep Neural Networks and Deep Reinforcement Learning

Deep RL: current trends

♦ Formal verification of DRL models ensure the learned model respect safety properties ♦ Transfer of Learning/Curricula learning ToL: train the model in an environment and deploy in another one Curricula Learning: learn a difficult task by training on a series of simpler tasks ♦ DRL for robotics Adaptation to environment is critical but interacting with the environment is difficult, expensive and potentially dangerous. ♦ MADRL A set of agents/robots that learn at the same time in the same environment

slide-33
SLIDE 33

Deep Neural Networks and Deep Reinforcement Learning

Verify DRL models

Provably guarantee safety properties (not empirical evaluation) Tools for DNN: Neurify (Wang et al., NeurIPS 2018) ensure the learned model respects safety properties Formal verification for DRL is very challenging: need to compare several outputs need to work on large models

slide-34
SLIDE 34

Deep Neural Networks and Deep Reinforcement Learning

DRL for robotics

Acting in the real environment is difficult/dangerous train in a synthetic environment reduce number of iterations for training (i.e., learn faster)

dicrete representation, highly optimized approaches combine DRL and Evolutionary Approaches

slide-35
SLIDE 35

Deep Neural Networks and Deep Reinforcement Learning

Summary

♦ Q(s, a) can not be explicitly represented for most real worl problems, we can approximate Q(s, a) ♦ DNN can approximate a wide set arbitrarily well use gradient descent and back propagation to learn the model ♦ DRL uses DNN to represent Q(s, a) ♦ Deep Q-Network is a powerful approach to perform DRL in discrete action spaces ♦ DRL is a popular, vibrant research area for modern AI