Deep Reinforcement Learning [Human-Level Control through deep - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning [Human-Level Control through deep - - PowerPoint PPT Presentation

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature 2015] CS 486/686 University of Waterloo Lecture 20: July 10, 2017 Outline Value Function Approximation Linear approximation Neural


slide-1
SLIDE 1

Deep Reinforcement Learning

[Human-Level Control through deep reinforcement learning, Nature 2015]

CS 486/686 University of Waterloo Lecture 20: July 10, 2017

slide-2
SLIDE 2

CS486/686 Lecture Slides (c) 2017 P. Poupart

2

Outline

  • Value Function Approximation

– Linear approximation – Neural network approximation

  • Deep Q-network
slide-3
SLIDE 3

CS486/686 Lecture Slides (c) 2017 P. Poupart

3

Quick recap

  • Markov Decision Processes: value iteration
  • Reinforcement Learning: Q-Learning
  • Complexity depends on number of states and

actions

slide-4
SLIDE 4

CS486/686 Lecture Slides (c) 2017 P. Poupart

4

Large State Spaces

  • Computer Go:

states

  • Inverted pendulum:

– 4-dimensional continuous state space

  • Atari: 210x160x3 dimensions (pixel values)
slide-5
SLIDE 5

CS486/686 Lecture Slides (c) 2017 P. Poupart

5

Functions to be Approximated

  • Policy:
  • Q-function:
  • Value function:
slide-6
SLIDE 6

CS486/686 Lecture Slides (c) 2017 P. Poupart

6

Q-function Approximation

  • Let
  • Linear
  • Non-linear (e.g., neural network)
slide-7
SLIDE 7

CS486/686 Lecture Slides (c) 2017 P. Poupart

7

Gradient Q-learning

  • Minimize squared error between Q-value

estimate and target

– Q-value estimate: – Target:

  • Squared error:
  • Gradient
  • 𝒙

fixed

slide-8
SLIDE 8

CS486/686 Lecture Slides (c) 2017 P. Poupart

8

Gradient Q-learning

Initialize weights at random in Observe current state Loop

Select action and execute it Receive immediate reward Observe new state Gradient:

  • 𝒙

𝒙

  • 𝒙
  • 𝒙 ,

𝒙

Update weights:

  • 𝒙

Update state: ’

slide-9
SLIDE 9

CS486/686 Lecture Slides (c) 2017 P. Poupart

9

Recap: Convergence of Tabular Q-learning

  • Tabular Q-Learning converges to optimal Q-

function under the following conditions:

and

  • Let

– Where is # of times that is visited

  • Q-learning
slide-10
SLIDE 10

CS486/686 Lecture Slides (c) 2017 P. Poupart

10

Convergence of Linear Gradient Q-Learning

  • Linear Q-Learning converges under the same

conditions:

and

  • Let
  • Let
  • Q-learning
  • 𝒙
slide-11
SLIDE 11

CS486/686 Lecture Slides (c) 2017 P. Poupart

11

Divergence of non-linear Q-learning

  • Even when the following conditions hold

and

non-linear Q-learning may diverge

  • Intuition:

– Adjusting to increase at might introduce errors at nearby state-action pairs.

slide-12
SLIDE 12

CS486/686 Lecture Slides (c) 2017 P. Poupart

12

Mitigating divergence

  • Two tricks are often used in practice:
  • 1. Experience replay
  • 2. Use two networks:

– Q-network – Target network

slide-13
SLIDE 13

CS486/686 Lecture Slides (c) 2017 P. Poupart

13

Experience Replay

  • Idea: store previous experiences

into a buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning

  • Advantages

– Break correlations between successive updates (more stable learning) – Fewer interactions with environment needed to converge (greater data efficiency)

slide-14
SLIDE 14

CS486/686 Lecture Slides (c) 2017 P. Poupart

14

Target Network

  • Idea: Use a separate target network that is

updated only periodically repeat for each in mini-batch:

  • Advantage: mitigate divergence

target update

slide-15
SLIDE 15

CS486/686 Lecture Slides (c) 2017 P. Poupart

15

Target Network

  • Similar to value iteration:

repeat for all

  • repeat for each

in mini-batch:

  • target

update target update

slide-16
SLIDE 16

CS486/686 Lecture Slides (c) 2017 P. Poupart

16

Deep Q-network

  • Google Deep Mind:
  • Deep Q-network: Gradient Q-learning with

– Deep neural networks – Experience replay – Target network

  • Breakthrough: human-level play in many

Atari video games

slide-17
SLIDE 17

CS486/686 Lecture Slides (c) 2017 P. Poupart

17

Deep Q-network

Initialize weights and at random in Observe current state Loop Select action and execute it Receive immediate reward Observe new state Add

  • to experience buffer

Sample mini-batch of experiences from buffer For each experience

  • in mini-batch

Gradient:

𝒙 𝒙

  • 𝐱
  • 𝒙 ̂,
  • 𝒙

Update weights:

  • 𝒙

Update state: ’ Every steps, update target:

slide-18
SLIDE 18

CS486/686 Lecture Slides (c) 2017 P. Poupart

18

Deep Q-Network for Atari

slide-19
SLIDE 19

CS486/686 Lecture Slides (c) 2017 P. Poupart

19

DQN versus Linear approx.