Deep Recurrent Q-Learning for Partially Observable MDPs Matthew - - PowerPoint PPT Presentation

deep recurrent q learning for partially observable mdps
SMART_READER_LITE
LIVE PREVIEW

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew - - PowerPoint PPT Presentation

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone University of Texas at Austin November 13, 2015 1 Motivation Intelligent decision making is the heart of AI 2 Motivation Intelligent decision making


slide-1
SLIDE 1

Deep Recurrent Q-Learning for Partially Observable MDPs

Matthew Hausknecht and Peter Stone

University of Texas at Austin

November 13, 2015

1

slide-2
SLIDE 2

Motivation

Intelligent decision making is the heart of AI

2

slide-3
SLIDE 3

Motivation

Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments

2

slide-4
SLIDE 4

Motivation

Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework

2

slide-5
SLIDE 5

Motivation

Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN)

2

slide-6
SLIDE 6

Motivation

Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) DQN lacks mechanisms for handling partial observability

2

slide-7
SLIDE 7

Motivation

Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) DQN lacks mechanisms for handling partial observability Extend DQN to handle Partially Observable Markov Decision Processes (POMDPs)

2

slide-8
SLIDE 8

Outline

Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix

3

slide-9
SLIDE 9

Markov Decision Process (MDP)

State st Action at Reward rt

At each timestep Agent performs actions at and receives reward rt and state st+1 from the environment

4

slide-10
SLIDE 10

Markov Decision Process (MDP)

State st Action at Reward rt

At each timestep Agent performs actions at and receives reward rt and state st+1 from the environment Markov property ensures that st+1 depends only on st, at

4

slide-11
SLIDE 11

Markov Decision Process (MDP)

State st Action at Reward rt

At each timestep Agent performs actions at and receives reward rt and state st+1 from the environment Markov property ensures that st+1 depends only on st, at Learning an optimal policy π∗ requires no memory of past states

4

slide-12
SLIDE 12

Partially Observable Markov Decision Process (POMDP)

Observation

  • t

Action at Reward rt

True state of environment is

  • hidden. Observations ot provide
  • nly partial information.

5

slide-13
SLIDE 13

Partially Observable Markov Decision Process (POMDP)

Observation

  • t

Action at Reward rt

True state of environment is

  • hidden. Observations ot provide
  • nly partial information.

Memory of past observations may help understand true system state, improve the policy

5

slide-14
SLIDE 14

Atari Domain

Action Observation Score

160 × 210 state space → 84 × 84 grayscale 18 discrete actions Rewards clipped ∈ {−1, 0, 1} Source: www.

arcadelearningenvironment.org 6

slide-15
SLIDE 15

Atari Domain: MDP or POMDP?

Action Observation Score

7

slide-16
SLIDE 16

Atari Domain: MDP or POMDP?

Action Observation Score

Depends on the state representation!

7

slide-17
SLIDE 17

Atari Domain: MDP or POMDP?

Action Observation Score

Depends on the state representation!

  • Single Frame ⇒ POMDP
  • Four Frames ⇒ MDP
  • Console RAM ⇒ MDP

7

slide-18
SLIDE 18

Deep Q-Network (DQN)

Q-Values

18

IP1

512

Conv3 Conv2 Conv1

4 84 84

Model-free Reinforcement Learning method using deep neural network as Q-Value function approximator Mnih et al. (2015) Takes the last four game screens as input: enough to make most Atari games Markov

8

slide-19
SLIDE 19

Deep Q-Network (DQN)

Q-Values

18

IP1

512

Conv3 Conv2 Conv1

4 84 84

Model-free Reinforcement Learning method using deep neural network as Q-Value function approximator Mnih et al. (2015) Takes the last four game screens as input: enough to make most Atari games Markov How well does DQN perform in partially observed domains?

8

slide-20
SLIDE 20

Flickering Atari

Observation

  • t

Action at Reward rt

Induce partial observability by stochastically obscuring the game screen

9

slide-21
SLIDE 21

Flickering Atari

Observation

  • t

Action at Reward rt

Induce partial observability by stochastically obscuring the game screen

  • t =

st with p = 1

2

< 0, . . . , 0 >

  • therwise

9

slide-22
SLIDE 22

Flickering Atari

Observation

  • t

Action at Reward rt

Induce partial observability by stochastically obscuring the game screen

  • t =

st with p = 1

2

< 0, . . . , 0 >

  • therwise

Game state must now be inferred from past observations

9

slide-23
SLIDE 23

DQN Pong

True Game Screen Perceived Game Screen

10

slide-24
SLIDE 24

DQN Flickering Pong

True Game Screen Perceived Game Screen

11

slide-25
SLIDE 25

Outline

Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix

12

slide-26
SLIDE 26

Deep Recurrent Q-Network

. . . LSTM

512 18

1 84 84

t t − 1 Long Short Term Memory Hochreiter (1997)

13

slide-27
SLIDE 27

Deep Recurrent Q-Network

. . . LSTM

512 18

1 84 84

t t − 1 Long Short Term Memory Hochreiter (1997) Identical to DQN Except:

  • Replaces DQN’s IP1 with

recurrent LSTM layer of same dimension

  • Each timestep takes a

single frame as input

13

slide-28
SLIDE 28

Deep Recurrent Q-Network

. . . LSTM

512 18

1 84 84

t t − 1 Long Short Term Memory Hochreiter (1997) Identical to DQN Except:

  • Replaces DQN’s IP1 with

recurrent LSTM layer of same dimension

  • Each timestep takes a

single frame as input LSTM provides a selective memory of past game states

13

slide-29
SLIDE 29

Deep Recurrent Q-Network

. . . LSTM

512 18

1 84 84

t t − 1 Long Short Term Memory Hochreiter (1997) Identical to DQN Except:

  • Replaces DQN’s IP1 with

recurrent LSTM layer of same dimension

  • Each timestep takes a

single frame as input LSTM provides a selective memory of past game states Trained end-to-end using BPTT: unrolled for last 10 timesteps

13

slide-30
SLIDE 30

DRQN Maximal Activations

Unit detects the agent missing the ball

14

slide-31
SLIDE 31

DRQN Maximal Activations

Unit detects the agent missing the ball Unit detects ball reflection on paddle

14

slide-32
SLIDE 32

DRQN Maximal Activations

Unit detects the agent missing the ball Unit detects ball reflection on paddle Unit detects ball reflection on wall

14

slide-33
SLIDE 33

Outline

Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix

15

slide-34
SLIDE 34

DRQN Flickering Pong

True Game Screen Perceived Game Screen

16

slide-35
SLIDE 35

Flickering Pong

17

slide-36
SLIDE 36

Pong Generalization: POMDP ⇒ MDP

How does DRQN generalize when trained on Flickering Pong and evaluated on standard Pong?

18

slide-37
SLIDE 37

Pong Generalization: POMDP ⇒ MDP

0.0 0.2 0.4 0.6 0.8 1.0

Observation Probability

20 15 10 5 5 10 15 20 DRQN 1-frame DQN 10-frame DQN 4-frame 18

slide-38
SLIDE 38

Performance on Flickering Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±std Pong 12.1 (±2.2)

  • 9.9 (±3.3)

19

slide-39
SLIDE 39

Performance on Flickering Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±std Pong 12.1 (±2.2)

  • 9.9 (±3.3)

Beam Rider 618 (±115) 1685.6 (±875)

19

slide-40
SLIDE 40

Performance on Flickering Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±std Pong 12.1 (±2.2)

  • 9.9 (±3.3)

Beam Rider 618 (±115) 1685.6 (±875) Asteroids 1032 (±410) 1010 (±535) Bowling 65.5 (±13) 57.3 (±8) Centipede 4319.2 (±4378) 5268.1 (±2052) Chopper Cmd 1330 (±294) 1450 (±787.8) Double Dunk

  • 14 (±2.5)
  • 16.2 (±2.6)

Frostbite 414 (±494) 436 (±462.5) Ice Hockey

  • 5.4 (±2.7)
  • 4.2 (±1.5)
  • Ms. Pacman

1739 (±942) 1824 (±490)

19

slide-41
SLIDE 41

Performance on Standard Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±std Double Dunk

  • 2 (±7.8)
  • 10 (±3.5)

Frostbite 2875 (±535) 519 (±363)

20

slide-42
SLIDE 42

Performance on Standard Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±std Double Dunk

  • 2 (±7.8)
  • 10 (±3.5)

Frostbite 2875 (±535) 519 (±363) Beam Rider 3269 (±1167) 6923 (±1027)

20

slide-43
SLIDE 43

Performance on Standard Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±std Double Dunk

  • 2 (±7.8)
  • 10 (±3.5)

Frostbite 2875 (±535) 519 (±363) Beam Rider 3269 (±1167) 6923 (±1027) Asteroids 1020 (±312) 1070 (±345) Bowling 62 (±5.9) 72 (±11) Centipede 3534 (±1601) 3653 (±1903) Chopper Cmd 2070 (±875) 1460 (±976) Ice Hockey

  • 4.4 (±1.6)
  • 3.5 (±3.5)
  • Ms. Pacman

2048 (±653) 2363 (±735)

20

slide-44
SLIDE 44

Performance on Standard Atari Games

21

slide-45
SLIDE 45

DRQN Frostbite

True Game Screen Perceived Game Screen

22

slide-46
SLIDE 46

Generalization: MDP ⇒ POMDP

How does DRQN generalize when trained on standard Atari and evaluated on flickering Atari?

23

slide-47
SLIDE 47

Generalization: MDP ⇒ POMDP

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Observation Probability

0.2 0.3 0.4 0.5 0.6 0.7

Percentage Original Score

DRQN DQN 23

slide-48
SLIDE 48

Outline

Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix

24

slide-49
SLIDE 49

Related Work

Deep Recurrent Q-Network Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

  • memory. Neural Comput., 9(8):1735–1780.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,

  • D. (2015). Human-level control through deep reinforcement
  • learning. Nature, 518(7540):529–533.

Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Language understanding for text-based games using deep reinforcement

  • learning. CoRR, abs/1506.08941.

Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007). Solving deep memory POMDPs with recurrent policy gradients.

25

slide-50
SLIDE 50

Thanks!

. . . LSTM

1 84 84

t t − 1 LSTM can help deal with partial observability Largest gains in generalization between MDP ⇔ POMDP Future work understanding why DRQN does better/worse

  • n certain games

Source: https://github.com/

mhauskn/dqn/tree/recurrent Matthew Hausknecht and Peter Stone 26

slide-51
SLIDE 51

Outline

Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix

27

slide-52
SLIDE 52

Computational Efficiency

Backwards (ms) Forwards (ms) Frames 1 4 10 1 4 10 Baseline 8.82 13.6 26.7 2.0 4.0 9.0 Unroll 1 18.2 22.3 33.7 2.4 4.4 9.4 Unroll 10 77.3 111.3 180.5 2.5 4.4 8.3 Unroll 30 204.5 263.4 491.1 2.5 3.8 9.4

Table : Average milliseconds per backwards/forwards pass. Frames refers to the number of channels in the input image. Baseline is a non recurrent network (e.g. DQN). Unroll refers to an LSTM network backpropagated through time 1/10/30 steps.

28