SLIDE 1
Deep Recurrent Q-Learning for Partially Observable MDPs Matthew - - PowerPoint PPT Presentation
Deep Recurrent Q-Learning for Partially Observable MDPs Matthew - - PowerPoint PPT Presentation
Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone University of Texas at Austin November 13, 2015 1 Motivation Intelligent decision making is the heart of AI 2 Motivation Intelligent decision making
SLIDE 2
SLIDE 3
Motivation
Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments
2
SLIDE 4
Motivation
Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework
2
SLIDE 5
Motivation
Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN)
2
SLIDE 6
Motivation
Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) DQN lacks mechanisms for handling partial observability
2
SLIDE 7
Motivation
Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) DQN lacks mechanisms for handling partial observability Extend DQN to handle Partially Observable Markov Decision Processes (POMDPs)
2
SLIDE 8
Outline
Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix
3
SLIDE 9
Markov Decision Process (MDP)
State st Action at Reward rt
At each timestep Agent performs actions at and receives reward rt and state st+1 from the environment
4
SLIDE 10
Markov Decision Process (MDP)
State st Action at Reward rt
At each timestep Agent performs actions at and receives reward rt and state st+1 from the environment Markov property ensures that st+1 depends only on st, at
4
SLIDE 11
Markov Decision Process (MDP)
State st Action at Reward rt
At each timestep Agent performs actions at and receives reward rt and state st+1 from the environment Markov property ensures that st+1 depends only on st, at Learning an optimal policy π∗ requires no memory of past states
4
SLIDE 12
Partially Observable Markov Decision Process (POMDP)
Observation
- t
Action at Reward rt
True state of environment is
- hidden. Observations ot provide
- nly partial information.
5
SLIDE 13
Partially Observable Markov Decision Process (POMDP)
Observation
- t
Action at Reward rt
True state of environment is
- hidden. Observations ot provide
- nly partial information.
Memory of past observations may help understand true system state, improve the policy
5
SLIDE 14
Atari Domain
Action Observation Score
160 × 210 state space → 84 × 84 grayscale 18 discrete actions Rewards clipped ∈ {−1, 0, 1} Source: www.
arcadelearningenvironment.org 6
SLIDE 15
Atari Domain: MDP or POMDP?
Action Observation Score
7
SLIDE 16
Atari Domain: MDP or POMDP?
Action Observation Score
Depends on the state representation!
7
SLIDE 17
Atari Domain: MDP or POMDP?
Action Observation Score
Depends on the state representation!
- Single Frame ⇒ POMDP
- Four Frames ⇒ MDP
- Console RAM ⇒ MDP
7
SLIDE 18
Deep Q-Network (DQN)
Q-Values
18
IP1
512
Conv3 Conv2 Conv1
4 84 84
Model-free Reinforcement Learning method using deep neural network as Q-Value function approximator Mnih et al. (2015) Takes the last four game screens as input: enough to make most Atari games Markov
8
SLIDE 19
Deep Q-Network (DQN)
Q-Values
18
IP1
512
Conv3 Conv2 Conv1
4 84 84
Model-free Reinforcement Learning method using deep neural network as Q-Value function approximator Mnih et al. (2015) Takes the last four game screens as input: enough to make most Atari games Markov How well does DQN perform in partially observed domains?
8
SLIDE 20
Flickering Atari
Observation
- t
Action at Reward rt
Induce partial observability by stochastically obscuring the game screen
9
SLIDE 21
Flickering Atari
Observation
- t
Action at Reward rt
Induce partial observability by stochastically obscuring the game screen
- t =
st with p = 1
2
< 0, . . . , 0 >
- therwise
9
SLIDE 22
Flickering Atari
Observation
- t
Action at Reward rt
Induce partial observability by stochastically obscuring the game screen
- t =
st with p = 1
2
< 0, . . . , 0 >
- therwise
Game state must now be inferred from past observations
9
SLIDE 23
DQN Pong
True Game Screen Perceived Game Screen
10
SLIDE 24
DQN Flickering Pong
True Game Screen Perceived Game Screen
11
SLIDE 25
Outline
Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix
12
SLIDE 26
Deep Recurrent Q-Network
. . . LSTM
512 18
1 84 84
t t − 1 Long Short Term Memory Hochreiter (1997)
13
SLIDE 27
Deep Recurrent Q-Network
. . . LSTM
512 18
1 84 84
t t − 1 Long Short Term Memory Hochreiter (1997) Identical to DQN Except:
- Replaces DQN’s IP1 with
recurrent LSTM layer of same dimension
- Each timestep takes a
single frame as input
13
SLIDE 28
Deep Recurrent Q-Network
. . . LSTM
512 18
1 84 84
t t − 1 Long Short Term Memory Hochreiter (1997) Identical to DQN Except:
- Replaces DQN’s IP1 with
recurrent LSTM layer of same dimension
- Each timestep takes a
single frame as input LSTM provides a selective memory of past game states
13
SLIDE 29
Deep Recurrent Q-Network
. . . LSTM
512 18
1 84 84
t t − 1 Long Short Term Memory Hochreiter (1997) Identical to DQN Except:
- Replaces DQN’s IP1 with
recurrent LSTM layer of same dimension
- Each timestep takes a
single frame as input LSTM provides a selective memory of past game states Trained end-to-end using BPTT: unrolled for last 10 timesteps
13
SLIDE 30
DRQN Maximal Activations
Unit detects the agent missing the ball
14
SLIDE 31
DRQN Maximal Activations
Unit detects the agent missing the ball Unit detects ball reflection on paddle
14
SLIDE 32
DRQN Maximal Activations
Unit detects the agent missing the ball Unit detects ball reflection on paddle Unit detects ball reflection on wall
14
SLIDE 33
Outline
Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix
15
SLIDE 34
DRQN Flickering Pong
True Game Screen Perceived Game Screen
16
SLIDE 35
Flickering Pong
17
SLIDE 36
Pong Generalization: POMDP ⇒ MDP
How does DRQN generalize when trained on Flickering Pong and evaluated on standard Pong?
18
SLIDE 37
Pong Generalization: POMDP ⇒ MDP
0.0 0.2 0.4 0.6 0.8 1.0
Observation Probability
20 15 10 5 5 10 15 20 DRQN 1-frame DQN 10-frame DQN 4-frame 18
SLIDE 38
Performance on Flickering Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±std Pong 12.1 (±2.2)
- 9.9 (±3.3)
19
SLIDE 39
Performance on Flickering Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±std Pong 12.1 (±2.2)
- 9.9 (±3.3)
Beam Rider 618 (±115) 1685.6 (±875)
19
SLIDE 40
Performance on Flickering Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±std Pong 12.1 (±2.2)
- 9.9 (±3.3)
Beam Rider 618 (±115) 1685.6 (±875) Asteroids 1032 (±410) 1010 (±535) Bowling 65.5 (±13) 57.3 (±8) Centipede 4319.2 (±4378) 5268.1 (±2052) Chopper Cmd 1330 (±294) 1450 (±787.8) Double Dunk
- 14 (±2.5)
- 16.2 (±2.6)
Frostbite 414 (±494) 436 (±462.5) Ice Hockey
- 5.4 (±2.7)
- 4.2 (±1.5)
- Ms. Pacman
1739 (±942) 1824 (±490)
19
SLIDE 41
Performance on Standard Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±std Double Dunk
- 2 (±7.8)
- 10 (±3.5)
Frostbite 2875 (±535) 519 (±363)
20
SLIDE 42
Performance on Standard Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±std Double Dunk
- 2 (±7.8)
- 10 (±3.5)
Frostbite 2875 (±535) 519 (±363) Beam Rider 3269 (±1167) 6923 (±1027)
20
SLIDE 43
Performance on Standard Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±std Double Dunk
- 2 (±7.8)
- 10 (±3.5)
Frostbite 2875 (±535) 519 (±363) Beam Rider 3269 (±1167) 6923 (±1027) Asteroids 1020 (±312) 1070 (±345) Bowling 62 (±5.9) 72 (±11) Centipede 3534 (±1601) 3653 (±1903) Chopper Cmd 2070 (±875) 1460 (±976) Ice Hockey
- 4.4 (±1.6)
- 3.5 (±3.5)
- Ms. Pacman
2048 (±653) 2363 (±735)
20
SLIDE 44
Performance on Standard Atari Games
21
SLIDE 45
DRQN Frostbite
True Game Screen Perceived Game Screen
22
SLIDE 46
Generalization: MDP ⇒ POMDP
How does DRQN generalize when trained on standard Atari and evaluated on flickering Atari?
23
SLIDE 47
Generalization: MDP ⇒ POMDP
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Observation Probability
0.2 0.3 0.4 0.5 0.6 0.7
Percentage Original Score
DRQN DQN 23
SLIDE 48
Outline
Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix
24
SLIDE 49
Related Work
Deep Recurrent Q-Network Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
- memory. Neural Comput., 9(8):1735–1780.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,
- D. (2015). Human-level control through deep reinforcement
- learning. Nature, 518(7540):529–533.
Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Language understanding for text-based games using deep reinforcement
- learning. CoRR, abs/1506.08941.
Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007). Solving deep memory POMDPs with recurrent policy gradients.
25
SLIDE 50
Thanks!
. . . LSTM
1 84 84
t t − 1 LSTM can help deal with partial observability Largest gains in generalization between MDP ⇔ POMDP Future work understanding why DRQN does better/worse
- n certain games
Source: https://github.com/
mhauskn/dqn/tree/recurrent Matthew Hausknecht and Peter Stone 26
SLIDE 51
Outline
Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix
27
SLIDE 52