Deep Reinforcement Learning [Human-Level Control through deep - - PowerPoint PPT Presentation
Deep Reinforcement Learning [Human-Level Control through deep - - PowerPoint PPT Presentation
Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature 2015] CS 486/686 University of Waterloo Lecture 20: July 10, 2017 Outline Value Function Approximation Linear approximation Neural
CS486/686 Lecture Slides (c) 2017 P. Poupart
2
Outline
- Value Function Approximation
– Linear approximation – Neural network approximation
- Deep Q-network
CS486/686 Lecture Slides (c) 2017 P. Poupart
3
Quick recap
- Markov Decision Processes: value iteration
- Reinforcement Learning: Q-Learning
- Complexity depends on number of states and
actions
CS486/686 Lecture Slides (c) 2017 P. Poupart
4
Large State Spaces
- Computer Go:
states
- Inverted pendulum:
– 4-dimensional continuous state space
- Atari: 210x160x3 dimensions (pixel values)
CS486/686 Lecture Slides (c) 2017 P. Poupart
5
Functions to be Approximated
- Policy:
- Q-function:
- Value function:
CS486/686 Lecture Slides (c) 2017 P. Poupart
6
Q-function Approximation
- Let
- Linear
- Non-linear (e.g., neural network)
CS486/686 Lecture Slides (c) 2017 P. Poupart
7
Gradient Q-learning
- Minimize squared error between Q-value
estimate and target
– Q-value estimate: – Target:
- Squared error:
- Gradient
- 𝒙
fixed
CS486/686 Lecture Slides (c) 2017 P. Poupart
8
Gradient Q-learning
Initialize weights at random in Observe current state Loop
Select action and execute it Receive immediate reward Observe new state Gradient:
- 𝒙
𝒙
- 𝒙
- 𝒙 ,
𝒙
Update weights:
- 𝒙
Update state: ’
CS486/686 Lecture Slides (c) 2017 P. Poupart
9
Recap: Convergence of Tabular Q-learning
- Tabular Q-Learning converges to optimal Q-
function under the following conditions:
and
- Let
– Where is # of times that is visited
- Q-learning
CS486/686 Lecture Slides (c) 2017 P. Poupart
10
Convergence of Linear Gradient Q-Learning
- Linear Q-Learning converges under the same
conditions:
and
- Let
- Let
- Q-learning
- 𝒙
CS486/686 Lecture Slides (c) 2017 P. Poupart
11
Divergence of non-linear Q-learning
- Even when the following conditions hold
and
non-linear Q-learning may diverge
- Intuition:
– Adjusting to increase at might introduce errors at nearby state-action pairs.
CS486/686 Lecture Slides (c) 2017 P. Poupart
12
Mitigating divergence
- Two tricks are often used in practice:
- 1. Experience replay
- 2. Use two networks:
– Q-network – Target network
CS486/686 Lecture Slides (c) 2017 P. Poupart
13
Experience Replay
- Idea: store previous experiences
into a buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning
- Advantages
– Break correlations between successive updates (more stable learning) – Fewer interactions with environment needed to converge (greater data efficiency)
CS486/686 Lecture Slides (c) 2017 P. Poupart
14
Target Network
- Idea: Use a separate target network that is
updated only periodically repeat for each in mini-batch:
- Advantage: mitigate divergence
target update
CS486/686 Lecture Slides (c) 2017 P. Poupart
15
Target Network
- Similar to value iteration:
repeat for all
- repeat for each
in mini-batch:
- target
update target update
CS486/686 Lecture Slides (c) 2017 P. Poupart
16
Deep Q-network
- Google Deep Mind:
- Deep Q-network: Gradient Q-learning with
– Deep neural networks – Experience replay – Target network
- Breakthrough: human-level play in many
Atari video games
CS486/686 Lecture Slides (c) 2017 P. Poupart
17
Deep Q-network
Initialize weights and at random in Observe current state Loop Select action and execute it Receive immediate reward Observe new state Add
- to experience buffer
Sample mini-batch of experiences from buffer For each experience
- in mini-batch
Gradient:
𝒙 𝒙
- 𝐱
- 𝒙 ̂,
- 𝒙
Update weights:
- 𝒙
Update state: ’ Every steps, update target:
CS486/686 Lecture Slides (c) 2017 P. Poupart
18
Deep Q-Network for Atari
CS486/686 Lecture Slides (c) 2017 P. Poupart
19