Deep Recurrent Q-Learning for Partially Observable MDPs Matthew - PowerPoint PPT Presentation

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone University of Texas at Austin November 13, 2015 1

Motivation Intelligent decision making is the heart of AI 2

Motivation Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments 2

Motivation Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework 2

Motivation Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) 2

Motivation Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) DQN lacks mechanisms for handling partial observability 2

Motivation Intelligent decision making is the heart of AI Desire agents capable of learning to act intelligently in diverse environments Reinforcement Learning provides a general learning framework RL + deep neural networks yields robust controllers that learn from pixels (DQN) DQN lacks mechanisms for handling partial observability Extend DQN to handle Partially Observable Markov Decision Processes (POMDPs) 2

Outline Motivation Background MDP POMDP Atari Domain Deep Q-Network Deep Recurrent Q-Network Results Related Work Appendix 3

Markov Decision Process (MDP) At each timestep Agent performs actions a t and receives reward r t and state s t +1 from the environment State Action Reward s t a t r t 4

Markov Decision Process (MDP) At each timestep Agent performs actions a t and receives reward r t and state s t +1 from the environment State Action Reward s t a t r t Markov property ensures that s t +1 depends only on s t , a t 4

Markov Decision Process (MDP) At each timestep Agent performs actions a t and receives reward r t and state s t +1 from the environment State Action Reward s t a t r t Markov property ensures that s t +1 depends only on s t , a t Learning an optimal policy π ∗ requires no memory of past states 4

Partially Observable Markov Decision Process (POMDP) True state of environment is hidden. Observations o t provide only partial information. Action Observation Reward o t r t a t 5

Partially Observable Markov Decision Process (POMDP) True state of environment is hidden. Observations o t provide only partial information. Action Observation Reward o t r t a t Memory of past observations may help understand true system state, improve the policy 5

Atari Domain 160 × 210 state space → 84 × 84 grayscale 18 discrete actions Rewards clipped ∈ {− 1 , 0 , 1 } Observation Score Action Source: www. arcadelearningenvironment.org 6

Atari Domain: MDP or POMDP? Observation Score Action 7

Atari Domain: MDP or POMDP? Depends on the state representation! Observation Score Action 7

Atari Domain: MDP or POMDP? Depends on the state representation! • Single Frame ⇒ POMDP • Four Frames ⇒ MDP Observation Score Action • Console RAM ⇒ MDP 7

Deep Q-Network (DQN) Q-Values 18 512 IP1 Model-free Reinforcement Learning method using deep neural network as Q-Value function approximator Mnih Conv3 et al. (2015) Takes the last four game screens as input: enough to make most Conv2 Atari games Markov Conv1 4 84 84 8

Deep Q-Network (DQN) Q-Values 18 512 IP1 Model-free Reinforcement Learning method using deep neural network as Q-Value function approximator Mnih Conv3 et al. (2015) Takes the last four game screens as input: enough to make most Conv2 Atari games Markov How well does DQN perform in Conv1 partially observed domains? 4 84 84 8

Flickering Atari Induce partial observability by stochastically obscuring the game screen Observation Action Reward o t a t r t 9

Flickering Atari Induce partial observability by stochastically obscuring the game screen Observation Action Reward � s t with p = 1 o t a t r t 2 o t = < 0 , . . . , 0 > otherwise 9

Flickering Atari Induce partial observability by stochastically obscuring the game screen Observation Action Reward � s t with p = 1 o t a t r t 2 o t = < 0 , . . . , 0 > otherwise Game state must now be inferred from past observations 9

DQN Pong True Game Screen Perceived Game Screen 10

DQN Flickering Pong True Game Screen Perceived Game Screen 11

Deep Recurrent Q-Network Long Short Term Memory t − 1 t Hochreiter (1997) 18 . . . 512 LSTM 1 84 84 13

Deep Recurrent Q-Network Long Short Term Memory t − 1 t Hochreiter (1997) 18 . . . 512 LSTM Identical to DQN Except: • Replaces DQN’s IP1 with recurrent LSTM layer of same dimension • Each timestep takes a single frame as input 1 84 84 13

Deep Recurrent Q-Network Long Short Term Memory t − 1 t Hochreiter (1997) 18 . . . 512 LSTM Identical to DQN Except: • Replaces DQN’s IP1 with recurrent LSTM layer of same dimension • Each timestep takes a single frame as input LSTM provides a selective memory of past game states 1 84 84 13

Deep Recurrent Q-Network Long Short Term Memory t − 1 t Hochreiter (1997) 18 . . . 512 LSTM Identical to DQN Except: • Replaces DQN’s IP1 with recurrent LSTM layer of same dimension • Each timestep takes a single frame as input LSTM provides a selective memory of past game states 1 84 Trained end-to-end using 84 BPTT: unrolled for last 10 timesteps 13

DRQN Maximal Activations Unit detects the agent missing the ball 14

DRQN Maximal Activations Unit detects the agent missing the ball Unit detects ball reflection on paddle 14

DRQN Maximal Activations Unit detects the agent missing the ball Unit detects ball reflection on paddle Unit detects ball reflection on wall 14

DRQN Flickering Pong True Game Screen Perceived Game Screen 16

Flickering Pong 17

Pong Generalization: POMDP ⇒ MDP How does DRQN generalize when trained on Flickering Pong and evaluated on standard Pong? 18

Pong Generalization: POMDP ⇒ MDP 20 DRQN 1-frame 15 DQN 10-frame DQN 4-frame 10 5 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 Observation Probability 18

Performance on Flickering Atari Games Game 10-frame DRQN ± std 10-frame DQN ± std Pong 12.1 ( ± 2 . 2) -9.9 ( ± 3 . 3) 19

Performance on Flickering Atari Games Game 10-frame DRQN ± std 10-frame DQN ± std Pong 12.1 ( ± 2 . 2) -9.9 ( ± 3 . 3) Beam Rider 618 ( ± 115) 1685.6 ( ± 875) 19

Performance on Flickering Atari Games Game 10-frame DRQN ± std 10-frame DQN ± std Pong 12.1 ( ± 2 . 2) -9.9 ( ± 3 . 3) Beam Rider 618 ( ± 115) 1685.6 ( ± 875) Asteroids 1032 ( ± 410) 1010 ( ± 535) Bowling 65.5 ( ± 13) 57.3 ( ± 8) Centipede 4319.2 ( ± 4378) 5268.1 ( ± 2052) Chopper Cmd 1330 ( ± 294) 1450 ( ± 787 . 8) Double Dunk -14 ( ± 2 . 5) -16.2 ( ± 2 . 6) Frostbite 414 ( ± 494) 436 ( ± 462 . 5) Ice Hockey -5.4 ( ± 2 . 7) -4.2 ( ± 1 . 5) Ms. Pacman 1739 ( ± 942) 1824 ( ± 490) 19

Performance on Standard Atari Games Game 10-frame DRQN ± std 10-frame DQN ± std Double Dunk -2 ( ± 7 . 8) -10 ( ± 3 . 5) Frostbite 2875 ( ± 535) 519 ( ± 363) 20

Performance on Standard Atari Games Game 10-frame DRQN ± std 10-frame DQN ± std Double Dunk -2 ( ± 7 . 8) -10 ( ± 3 . 5) Frostbite 2875 ( ± 535) 519 ( ± 363) Beam Rider 3269 ( ± 1167) 6923 ( ± 1027) 20

Performance on Standard Atari Games Game 10-frame DRQN ± std 10-frame DQN ± std Double Dunk -2 ( ± 7 . 8) -10 ( ± 3 . 5) Frostbite 2875 ( ± 535) 519 ( ± 363) Beam Rider 3269 ( ± 1167) 6923 ( ± 1027) Asteroids 1020 ( ± 312) 1070 ( ± 345) Bowling 62 ( ± 5 . 9) 72 ( ± 11) Centipede 3534 ( ± 1601) 3653 ( ± 1903) Chopper Cmd 2070 ( ± 875) 1460 ( ± 976) Ice Hockey -4.4 ( ± 1 . 6) -3.5 ( ± 3 . 5) Ms. Pacman 2048 ( ± 653) 2363 ( ± 735) 20

Performance on Standard Atari Games 21

DRQN Frostbite True Game Screen Perceived Game Screen 22

Generalization: MDP ⇒ POMDP How does DRQN generalize when trained on standard Atari and evaluated on flickering Atari? 23

Generalization: MDP ⇒ POMDP DRQN 0.7 DQN Percentage Original Score 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Observation Probability 23

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew - PowerPoint PPT Presentation

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone University of Texas at Austin November 13, 2015 1 Motivation Intelligent decision making is the heart of AI 2 Motivation Intelligent decision making

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

Why UIs are like they are? Week 4 Are there any laws or theory that tell us how to design a user

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Contact me about expert training for 10/24/2012 www.ellenfinkelstein.com teams and individuals!

Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory N AJMEH S

Recurrent Neural Networks (RNN) Artificial Intelligence @ Allegheny College Janyl Jumadinova

Cognitive Load Theory Why is learning to code so hard? What is cognitive load theory? Memory

User Interface Design 1 User Interface Design Think of examples Good examples, personal

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural