Neural Map: Structured Memory for Deep Reinforcement Learning - PowerPoint PPT Presentation

CS 885 Reinforcement Learning Neural Map: Structured Memory for Deep Reinforcement Learning Emilio Parisotto and Ruslan Salakhutdinov, ICLR 2018 Presented by Andreas Stöckel July 6, 2018

motivation: navigating partially observable environments (i) 1

motivation: navigating partially observable environments (i) Screenshot of “The Elder Scrolls: Arena”, 1994 1

mandates memory units tend to forget quickly incorporating locality into DNC motivation: navigating partially observable environments (ii) Partially observable environment Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer External memories such as Long-short-term memory (LSTM) Observations Reinforcement learning setup through arbitrary environments Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Reward: ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating

units tend to forget quickly incorporating locality into DNC motivation: navigating partially observable environments (ii) Reinforcement learning setup Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer External memories such as Long-short-term memory (LSTM) mandates memory Observations through arbitrary environments Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating

incorporating locality into DNC motivation: navigating partially observable environments (ii) through arbitrary environments Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer External memories such as mandates memory Reinforcement learning setup Observations Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Long-short-term memory (LSTM) units tend to forget quickly ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating

incorporating locality into DNC motivation: navigating partially observable environments (ii) through arbitrary environments Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer mandates memory Reinforcement learning setup Observations agent reaches goal Discrete a t Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Long-short-term memory (LSTM) units tend to forget quickly ▶ External memories such as ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating

motivation: navigating partially observable environments (ii) through arbitrary environments interference (DNC) have problems with Difgerentiable Neural Computer mandates memory Reinforcement learning setup Observations Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Long-short-term memory (LSTM) units tend to forget quickly ▶ External memories such as ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating ⇒ Reduce interference by incorporating locality into DNC

overview I Background Computer Actor-Critic II Neural Map Network III Empirical Evaluation IV Summary & Conclusion ▶ Memory Systems ▶ 2D Goal-Search Environment ▶ Difgerentiable Neural ▶ 3D Doom Environment ▶ Asynchronous Advantage

Part I BACKGROUND

3 external memories for deep neural networks EXTERNAL NEURAL NETWORK MEMORIES No write operator Write operator Memory networks Di � erentiable Neural Computer Convolutional network over External memory matrix with past M states di � erentiable access operations

t c R t , t e T t v T differentiable neural computer (dnc): read and write t c W t c W 1 M t 1 M t erase vector e t , value v t Write operation: Given write context c W t M T r t t Given read context vector c R Read operation: 4 ▶ External memory matrix M ∈ R W × H ▶ Associative memory: ▶ Associate context with value c t → v t ▶ Given ˜ c t ≈ c t retrieve r t ≈ v t

differentiable neural computer (dnc): read and write Given read context vector c R t erase vector e t , value v t Given write context c W t t 4 ▶ External memory matrix M ∈ R W × H ▶ Read operation: ▶ Associative memory: r t = M T t c R ▶ Associate context with value ▶ Write operation: c t → v t t , ▶ Given ˜ c t ≈ c t retrieve r t ≈ v t M t + 1 = M t ◦ ( 1 − c W t ) + c W t e T t v T

asynchronous advantage actor-critic (a3c) as baseline (Actor-Critic) 5 ▶ REINFORCE policy gradient descent with value function V π ( s ) ( ) ∇ π log π ( a t | s t ) G t − V π ( s t ) ∞ ∑ G t = γ k R t + k k = 0 ▶ Here: π ( a | s ) is deep neural network

Part II NEURAL MAP NETWORK

neural map: overview (i) 6 W x , y x , y r t W c t SMAX C M t M t+1 READ WRITE f CNN f NN H s t INPUT o t π ( a | s ) SMAX OUTPUT OUT f NN

neural map: overview (ii) r t f OUT Policy o t Output Update Variables Write c t Context t Read Input state Time step Location 7 Neural map Operations t ∈ Z = read ( M t ) ( x t , y t ) ∈ R 2 = context ( M t , s t , r t ) w ( x t , y t ) s t , r t , c t , M ( x t , y t ) ( ) s t ∈ R n = write t + 1 M t , w ( x t , y t ) ( ) M t ∈ R C × H × W = update M t + 1 t + 1 r t , c t , w ( x t , y t ) [ ] = t + 1 ( ) π ( s t | a ) = SoftMax NN ( o t )

neural map: read 8 W x , y x , y r t c t W SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Summarize memory in single “read vector” r t = f READ CNN ( M t ) ∈ R C

exp a x y z w exp a z w a x y q t M x y M x y neural map: context t t t t x y c t x y x y t t C t 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t [ ] q t = W ∈ R C s t , r t

exp a x y z w exp a z w M x y neural map: context x y C t t x y x y c t t t t t t 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t [ ] q t = W ∈ R C s t , r t a ( x , y ) q t , M ( x , y ) ⟨ ⟩ = ∈ R

M x y neural map: context t C t t x y x y c t t t t t exp 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t a ( x , y ) ( ) α ( x , y ) [ ] q t = W ∈ R C = ) ∈ R s t , r t a ( z , w ) ( ∑ z , w exp a ( x , y ) q t , M ( x , y ) ⟨ ⟩ = ∈ R

neural map: context t t t t t t t exp 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t a ( x , y ) ( ) α ( x , y ) [ ] q t = W ∈ R C = ) ∈ R s t , r t a ( z , w ) ( ∑ z , w exp a ( x , y ) q t , M ( x , y ) x , y α ( x , y ) M ( x , y ) ∑ ⟨ ⟩ = ∈ R c t = ∈ R C

neural map: write & update NN t 10 W x , y x , y r t W c t SMAX C M t M t+1 READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Compute content of memory for t + 1 at location x , y  w ( x t , y t ) if ( a , b ) = ( x t , y t )  w ( x t , y t ) s t , r t , c t , M ( x t , y t ) M ( a , b ) t + 1 ([ ]) = f WRITE = t + 1 t + 1 M ( a , b ) if ( a , b ) ̸ = ( x t , y t )  t + 1

neural map: output & policy t f OUT 11 W x , y x , y r t W c t SMAX C M t M t+1 READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Compute policy π ( a | s ) c t , r t , M ( x t , y t ) [ ] ( ) o t = π ( a | s ) = SoftMax NN ( o t )

LSTM s t r t c t 1 h t context M t h t neural map: extensions Use GRU equations to disable/weaken memory update LSTM controller for 3D environments W , H too small, confuses network (reads what it just wrote) Use LSTM to remember read/write operations h t 1 c t Egocentric navigation Real-world agents do not know their global x , y location Always write to centre of memory, translate memory on movement 12 ▶ Write operation: Gated Recurrent Units (GRUs)

neural map: extensions Use GRU equations to disable/weaken memory update Egocentric navigation Real-world agents do not know their global x , y location Always write to centre of memory, translate memory on movement 12 ▶ Write operation: Gated Recurrent Units (GRUs) ▶ LSTM controller for 3D environments ▶ W , H too small, confuses network (reads what it just wrote) ⇒ Use LSTM to remember read/write operations h t = LSTM ( s t , r t , c t − 1 , h t − 1 ) c t = context ( M t , h t )

Neural Map: Structured Memory for Deep Reinforcement Learning - PowerPoint PPT Presentation

CS 885 Reinforcement Learning Neural Map: Structured Memory for Deep Reinforcement Learning Emilio Parisotto and Ruslan Salakhutdinov, ICLR 2018 Presented by Andreas Stckel July 6, 2018 motivation: navigating partially observable

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Neural Map Structured Memory for Deep RL Emilio Parisotto eparisot@andrew.cmu.edu PhD Student

Memory Management Chester Rebeiro IIT Madras Memory map of process 1 Process 1 Memory map of

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Smart Snek A neural map implementation What is neural mapping? Provided that the robot has

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Associative memories 9/25/2014 Memorized associations are ubiquitous Stimulus Response

Laws! George Wilson Data61/CSIRO george.wilson@data61.csiro.au 28th August 2018 class Monoid m

Double Groups and Semigroups Science Atlantic 2014 University of New Brunswick, Saint John

Theory of Computation Chapter 4: Boolean Logic Guan-Shieng Huang Apr. 7, 2003 Feb. 19, 2006

Fast TracKing at ATLAS Why and How Jamie Saxon University of Chicago What is the FTK?

s Sgn w s i ij j i j Synchronous / Asynchronous updating

The Fast (but not furious) TracKer The Fast (but not furious) TracKer for ATLAS: for ATLAS: a

Semantically Far Inspirations Considered Harmful? Accounting For Cognitive States In