Deep Neural Networks and Deep Reinforcement Learning
Deep Neural Networks and Deep Reinforcement Learning Deep Learning, - - PowerPoint PPT Presentation
Deep Neural Networks and Deep Reinforcement Learning Deep Learning, - - PowerPoint PPT Presentation
Deep Neural Networks and Deep Reinforcement Learning Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an
Deep Neural Networks and Deep Reinforcement Learning
Outline
♦ Neural Networks: intro ♦ Deep Neural Networks ♦ Deep Reinforcement Learning ♦ Deep Q Network ♦ Slides based on course offered by Prof. Pascal Poupart at
- Univ. of Waterloo.
Deep Neural Networks and Deep Reinforcement Learning
Reinforcement Learning: key points
♦ MDPs and Value iteration (planning) V (s) = maxa
- s′ T(s, a, s′)(R(s, a, s′) + γVk(s′))
♦ TD Learning and Q-Learning(reinforcement learning) Q(s, a) = (1 − α)Q(s, a) + α(R(s, a, s′) + γmaxa′Q(s, a′)) ♦ Key Issues: number of states and actions maybe too many to maintain and update the Q-Table
Deep Neural Networks and Deep Reinforcement Learning
Reinforcement Learning: Example of large state spaces
♦ the game of Go: 3361
Deep Neural Networks and Deep Reinforcement Learning
Reinforcement Learning: Example of large state spaces
♦ Cart pole control problem: (x, x′, θ, θ′) continuous
Deep Neural Networks and Deep Reinforcement Learning
Reinforcement Learning: Example of large state spaces
♦ Atari games: 210x160x3 (possible pixels considering the RGB layers)
Deep Neural Networks and Deep Reinforcement Learning
Key Idea: function approximation
♦ Which functions are we interested in: Policy π(s) → a Value function V (s) ∈ ℜ Q-function Q(s, a) ∈ ℜ
Deep Neural Networks and Deep Reinforcement Learning
Q-Function approximation
♦ State is a set of features: s = (x1, x2, · · · , xn)T CartPole: s = (x, x′, θ, θ′)T Atari, values of pixels ♦ Linear approximation: Q(a, s) ≈ n
i=1 waixi
♦ Non-linear (neural network): Q(s, a) ≈ g(x; w)
Deep Neural Networks and Deep Reinforcement Learning
Feed-Forward ANN
♦ Network of units (computational neurons) ♦ DAG connecting functions with weighted edges ♦ Each unit computes h(wTx + b) w: weights, x: inputs to node, b: bias h: activation function, usually non-linear
Deep Neural Networks and Deep Reinforcement Learning
One hidden layer ANN
♦ hidden units: zj = h1(w(1)
j
x + b(1)
j
) ♦ output units: yk = h2(w(2)
k z + b(2) k )
♦ overall yk = h2(
j w(2) kj h1( i w(1) ji xi + b(1) j
) + b(2)
k ))
w: weights, x: inputs to node, b: bias h: activation function, usually non-linear
Deep Neural Networks and Deep Reinforcement Learning
Activation function h
♦ threshold: h(x) =
- 1
if x ≥ 0 −1
- therwise
♦ sigmoid: h(x) = σ(x) =
1 1+e−x
♦ tanh h(x) = tanh(x) = ex−e−x
ex+e−x
♦ gaussian h(x) = e− 1
2 ( x−µ σ )2
♦ identity h(x) = x ♦ rectified linear (ReLU) h(x) = max{0, x}
Deep Neural Networks and Deep Reinforcement Learning
Universal approximation property
Theorem (Hornik et al., 1989, Cybenko, 1989): a feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function (sigmoid/tahnh/gaussian) can approximate any function arbitrarely closely, provided that the network is given enough hidden units. ♦ any: continuous function on a closed an bounded subset of ℜn (relationship with Borel measurability)
Deep Neural Networks and Deep Reinforcement Learning
Minimize least squared error
♦ Key idea to optimize the weights: minimize the error with respect to the output (Loss) E(W) =
- n
En(W)2 =
- n
|f (xn; W) − yn|2
2
♦ Non convex optimization problem: can train by using gradient descent Given sample (xn, yn) update weights as follows: wji ← wji − η ∂En ∂wji Backpropagation algorithm to compute the gradient in a ANN
Deep Neural Networks and Deep Reinforcement Learning
Deep Neural Networks
♦ Deep NN: ANN with many hidden layers ♦ Benefit: high expressivity (i.e., compact representation) ♦ Issues: can we train Deep NN in the same way ? can we avoid overfitting ?
Deep Neural Networks and Deep Reinforcement Learning
Example: Image Classification
♦ ImageNet Large Scale Visual Recognition Challenge
Deep Neural Networks and Deep Reinforcement Learning
Vanishing Gradient
♦ Deep Neural networks that uses "squashing" functions (e.g., sigmoid, tanh) suffer from vanishing gradients
Deep Neural Networks and Deep Reinforcement Learning
Sigmoid and Hyperbolic functions
♦ Derivative of Sigmoid and Tanh is always less than one! ♦ when back propagating gradients we multiply several numbers that are less than one
Deep Neural Networks and Deep Reinforcement Learning
Example: vanishing gradient
♦ y = t(w3t(w2t(w1x))), where t(·) is the tanh function ♦ common weight initialization in (-1,1) ♦ tanh function and its derivative are less than 1 ♦ vanishing gradient
∂y ∂w3 = t′(a3)t(a2) ∂y ∂w2 = t′(a3)w3t′(a2)t(a1) ≤ ∂y ∂w3 ∂y ∂w1 = t′(a3)w3t′(a2)w2t′(a1)x ≤ ∂y ∂w2
Deep Neural Networks and Deep Reinforcement Learning
Mitigations for vanishing gradient
♦ typical solutions to mitigate vanishing gradient Pre-training Rectified Linear Units Batch normalization Skip connections
Deep Neural Networks and Deep Reinforcement Learning
Rectified Linear Units (ReLU)
♦ Rectified linear h(x) = max(0, x) Gradient is 0 or 1 Piecewise linear Sparse computation ♦ Soft computation (Softplus): h(x) = log(1 + ex) ♦ Softplus does not mitigate gradient vanishing
Deep Neural Networks and Deep Reinforcement Learning
Deep Reinforcement Learning: key points
♦ For many real world domains we can not explicitly represent key functions for RL (π(s), V (s), Q(s, a)) ♦ We can try to approximate them Linear approximation Neural Network approximation
Deep RL
♦ Deep Q Network approximates Q(s, a) with a DNN
Deep Neural Networks and Deep Reinforcement Learning
Gradient Q-Learning
♦ approximate Q(s, a) with a parametrized function Qw(s, a) ♦ Minimize squared error between estimate and target Estimate Qw(s, a) Target: r(s, a, s′) + γ maxa′ Qw(s′, a′) ♦ squared error: Err(w) = (Qw(s, a) − r(s, a, s′) − γ maxa′ Qw(s′, a′))2 ♦ gradient:
∂Err(w) ∂w
= 2(Qw(s, a) − r(s, a, s′) − γ maxa′ Qw(s′, a′))∂Qw(s,a)
∂w
(Scalar 2 is a constant factor and not important for update)
Deep Neural Networks and Deep Reinforcement Learning
Gradient Q-Learning Algorithm
Algorithm 1 Gradient Q-Learning
1: Initialize weights w randomly in [−1, 1] 2: Initialize s {observe current state} 3: loop 4:
Select and execute action a
5:
Observe new state s′ receive immediate reward r
6:
∂Err(w) ∂w
= (Qw(s, a) − r − γ maxa′ Qw(s′, a′))∂Qw(s,a)
∂w
7:
update weights w ← w − α ∂Err(w)
∂w
8:
update state s ← s′
9: end loop
Deep Neural Networks and Deep Reinforcement Learning
Convergence of tabular Q-Learning
♦ Q(s, a) ← Q(s, a)+α(r(s, a, s′)+γ maxa′ Q(s′, a′)−Q(s, a)) ♦ Tabular Q-Learning converges to optimal policy if you explore enough if you make the learning rate small enough ... but not decrease it too quickly α = 1/n(s, a), where n(s, a) is number of visits for (s, a)
Deep Neural Networks and Deep Reinforcement Learning
Convergence of linear gradient Q-Learning
♦ linear approximation of Q(s,a), Q(s, a) ≈
i waixi = wTx
♦ αt = 1/t ♦ gradient Q-learning w ← w − αt(Qw(s, a) − r − γ max
a′ Qw(s′, a′))∂Qw(s, a)
∂w
Deep Neural Networks and Deep Reinforcement Learning
Non-Convergence of Non-linear gradient Q-Learning
♦ Non-linear approximation of Q(s,a), Q(s, a) ≈ g(x; w) ♦ Even if αt = 1/t, gradient Q-Learning may not converge ♦ Issue: we update the weights to reduce error for a specific experience (i.e., a specific (s, a)) but by changing the weights we may end up changing the Q(s, a) potentially everywhere. this is true also for linear approximation, but in that case convergence can still be guaranteed
Deep Neural Networks and Deep Reinforcement Learning
Mitigating divergence
♦ Two main approaches to mitigate divergence:
1 experience replay 2 use two different networks
Q-network Target network
Deep Neural Networks and Deep Reinforcement Learning
Experience replay
♦ Store previous experiences (i.e., (s, a, s′, r)) and use them at each step Store previous (s, a, s′, r) in a dedicated memory buffer At each step sample a mini-batch from this buffer and use the mini-batch to update the weights ♦ Benefits
1 reduces correlation between successive samples (increase
stability)
2 reduces number of interaction with the environment
(increase data efficiency)
Deep Neural Networks and Deep Reinforcement Learning
Target Network
♦ Maintain a separate target network and update this network periodically (not with every experience) Q-network Qw(s, a) Target network Qw(s, a) ♦ repeat for every (s, a, s′, r) in the mini-batch update the Q-network w ← w − αt(Qw(s, a) − r − γ maxa′ Qw(s′, a′))∂Qw(s,a)
∂w
♦ update the target network w ← w
Deep Neural Networks and Deep Reinforcement Learning
Deep Q Network
♦ Human-level control through deep reinforcement learning (V. Mnih et al., Nature 2015) ♦ Gradient Q-Learning Deep neural networks to approximate Q(s,a) Experience Replay and Target network ♦ above human-level performance in many Atari video games
Deep Neural Networks and Deep Reinforcement Learning
Deep Q Network sketch of algorithm
Algorithm 2 DQN
1: Initialize weights w and w randomly in [−1, 1] 2: Initialize s {observe current state} 3: loop 4:
Select and execute action a
5:
Observe new state s′ receive immediate reward r
6:
Add (s, a, s′, r) to experience buffer
7:
Sample mini-batch MB of experiences from buffer
8:
for (ˆ s, ˆ a,ˆ s′,ˆ r) ∈ MB do
9:
∂Err(w) ∂w
= (Qw(ˆ s, ˆ a) − ˆ r − γ maxˆ
a′ Qw(ˆ
s′, ˆ a′))∂Qw(ˆ
s,ˆ a) ∂w
10:
update weights w ← w − α ∂Err(w)
∂w
11:
end for
12:
update state s ← s′
13:
every c steps, update target: w ← w
14: end loop
Deep Neural Networks and Deep Reinforcement Learning
Deep RL: current trends
♦ Formal verification of DRL models ensure the learned model respect safety properties ♦ Transfer of Learning/Curricula learning ToL: train the model in an environment and deploy in another one Curricula Learning: learn a difficult task by training on a series of simpler tasks ♦ DRL for robotics Adaptation to environment is critical but interacting with the environment is difficult, expensive and potentially dangerous. ♦ MADRL A set of agents/robots that learn at the same time in the same environment
Deep Neural Networks and Deep Reinforcement Learning
Verify DRL models
Provably guarantee safety properties (not empirical evaluation) Tools for DNN: Neurify (Wang et al., NeurIPS 2018) ensure the learned model respects safety properties Formal verification for DRL is very challenging: need to compare several outputs need to work on large models
Deep Neural Networks and Deep Reinforcement Learning
DRL for robotics
Acting in the real environment is difficult/dangerous train in a synthetic environment reduce number of iterations for training (i.e., learn faster)
dicrete representation, highly optimized approaches combine DRL and Evolutionary Approaches
Deep Neural Networks and Deep Reinforcement Learning