Reinforcement Learning: From neural processes modelling to Robotics - PowerPoint PPT Presentation

TDRL Model TDRL Model The Q-learning model Dopamine TD Applications Model-based RL slide # 31 / 141 How can the agent learn a policy? How to learn to perform the right actions other solution: learning Q-values (qualities) Q : (S,A) R Q-table: state / action a1 : North a2 : South a3 : East a4 : West s1 0.92 0.10 0.35 0.05 s2 0.25 0.52 0.43 0.37 s3 0.78 0.9 1.0 0.81 s4 0.0 1.0 0.9 0.9 … … … … … 0 0.1 0.9 0.3 0.8 0.9 0.1 0.8 0.3 0 0. 0.1 0.8 0 0 0.1

TDRL Model TDRL Model The Q-learning model Dopamine TD Applications Model-based RL slide # 32 / 141 How can the agent learn a policy? How to learn to perform the right actions other solution: learning Q-values (qualities) Q : (S,A) R Q-table: state / action a1 : North a2 : South a3 : East a4 : West s1 0.92 0.10 0.35 0.05 s2 0.25 0.52 0.43 0.37 s3 0.78 0.9 1.0 0.81 s4 0.0 1.0 0.9 0.9 … … … … … exp( β . Q(s,a)) The β parameter regulates the exploration – P(a) = Σ exp( β . Q(s,b)) exploitation trade-off. b

TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 33 / 141  ACTOR-CRITIC State-dependent Reward Prediction Error  SARSA (independent from the action)  Q-LEARNING

TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 34 / 141  ACTOR-CRITIC State-dependent Reward Prediction Error  SARSA (independent from the action) Also used to update  Q-LEARNING the ACTOR P(a t |s t ) P(a t |s t ) + α δ t+1

TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 35 / 141  ACTOR-CRITIC  SARSA Reward Prediction Error dependent on the action  Q-LEARNING chosen to be performed next

TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 36 / 141  ACTOR-CRITIC  SARSA  Q-LEARNING Reward Prediction Error dependent on the best action

TDRL Model Dopamine Dopamine TD Applications Model-based RL slide # 37 / 141 Links with biology Activity of dopaminergic neurons

TDRL Model Dopamine Dopamine CLASSICAL CONDITIONING TD Applications Model-based RL slide # 38 / 141 TD-learning explains classical conditioning (predictive learning) Taken from Bernard Balleine’s lecture at Okinawa Computational Neuroscience Course (2005).

TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 39 / 141  Analogy with dopaminergic neurons’ activity R S +1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 40 / 141  Analogy with dopaminergic neurons’ activity R S +1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 41 / 141  Analogy with dopaminergic neurons’ activity R S 0 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 42 / 141  Analogy with dopaminergic neurons’ activity R S -1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 43 / 141 Barto (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review. Houk et al. (1995) Dopaminergic neuron

TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 44 / 141 Which state space as an input? Temporal-order input also called: [0 0 1 0 0 0 0] Tapped-delay line Dopaminergic neuron Montague et al. (1996); Suri & Schultz (2001) Daw (2003); Bertin et al. (2007).

TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 45 / 141 Which state space as an input? reward 5 1 Temporal-order input 2 4 [0 0 1 0 0 0 0] or spatial or visual information 3 Dopaminergic neuron

TDRL Model Dopamine Dopamine TD Applications Model-based RL slide # 46 / 141 Wide application of RL models to model-based analyses of behavioral and physiological data during decision-making tasks

TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 47 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

TDRL Model Dopamine Model-based analysis of brain data TD Applications TD Applications Model-based RL slide # 50 / 141 Sequence of observed trials : Left (Reward); Left (Nothing); Right (Nothing); Left (Reward); … fMRI scanner RL model Brain responses Prediction error ? cf. travail de Mathias Pessiglione (ICM) ou Giorgio Coricelli (ENS)

TDRL Model Model-based analysis Dopamine TD Applications Work by Jean Bellot (PhD student) Model-based RL slide # 51 / 141 TD-learning models Behavior of the animal High fitting error Low fitting error Bellot, Sigaud, Khamassi (2012) SAB conference.

TDRL Model Model-based analysis Dopamine TD Applications TD Applications My post-doc work Model-based RL slide # 52 / 141 • Analysis of single neurons recorded in the monkey dorsolateral prefrontal cortex and anterior cingulate cortex • Correlates of prediction errors? Action values? Level of control/exploration? Khamassi et al. (2013) Prog Brain Res; Khamassi et al. (in revision)

TDRL Model Model-based analysis Dopamine TD Applications TD Applications My post-doc work Model-based RL slide # 53 / 141 Q δ β * Multiple regression analysis with bootstrap Khamassi et al. (2013) Prog Brain Res; Khamassi et al. (in revision)

TDRL Model Dopamine TD Applications TD Applications Model-based RL slide # 54 / 141 This works well, but… Most experiments are single-step • All these cases are discrete • Very small number of states, actions • We supposed a perfect state identification •

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 55 / 141 Sensory input 3 2 1 Actions 4 5 reward TD-Learning model applied to spatial navigation behavior learning in a robot performing the bio-inspired plus-maze task Khamassi et al. (2005) . Adaptive Behavior. Khamassi et al. (2006). Lecture Notes in Computer Science

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 56 / 141 Coordination by a self-organizing map Actor-Critic multi-modules neural network

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 57 / 141 Hand-tuned Autonomous Random

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 58 / 141 Two methods : 1. Self-Organizing Maps (SOMs) 2. specialization based on performance (tests modules' capacity for state prediction) Baldassarre (2002); Doya et al. (2002). Within a particular subpart of the maze, only the module Autonomous with the most accurate reward prediction is trained. Each module thus becomes an expert responsible for learning in a given task subset.

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 59 / 141 average

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 60 / 141 Nb of iterations required (Average performance during the second half of the experiment) 94 1. hand-tuned 3,500 2. specialization based on performance 404 3. autonomous categorization (SOM) 30,000 4. random robot

TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 61 / 141 Nb of iterations required (Average performance during the second half of the experiment) 94 1. hand-tuned 3,500 2. specialization based on performance 404 3. autonomous categorization (SOM) 30,000 4. random robot

TDRL Model Dopamine Outline TD Applications Model-based RL slide # 62 / 141 1. Model-free Reinforcement Learning Temporal-Difference RL Algorithm  Dopamine activity  Wide application to Neuroscience of decision-making  2. Model-based Reinforcement Learning Off-line learning / Replay during sleep  Dual-system RL  Online parameters tuning (meta-learning)  Link with Neurobehavioral data  Applications to Robotics 

TDRL Model Dopamine TD Applications Model-based RL Model-based RL slide # 63 / 141 Off-learning (Model-based RL) and hippocampal & prefrontal cortex activity replay during sleep

TDRL Model Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL Model-based RL slide # 64 / 141 After N simulations Very long! δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) V(s t ) = V(s t ) + α . δ t+1 learning rate (=0.1) discount factor (=0.9)

TDRL Model Dopamine TRAINING DURING SLEEP TD Applications Model-based RL Model-based RL slide # 65 / 141 Method in Artificial Intelligence: Off-line Dyna-Q-learning (Sutton & Barto, 1998)

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 66 / 141 To incrementally learn a model of transition and reward functions, then plan within this model by updates “in the head of the agent” (Sutton, 1990). S : state space A : action space Internal model Transition function T : S x A S Reward function R : S x A R

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 67 / 141 s : state of the agent ( )

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 68 / 141 s : state of the agent ( ) maxQ=0.3 maxQ=0.9 maxQ=0.7

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 69 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 70 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7 stored transition function T: proba( ) = 0.9 proba( ) = 0.1 proba( ) = 0

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 71 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7 stored transition function T: proba( ) = 0.9 proba( ) = 0.1 proba( ) = 0 0.6 0 0.9*0.7 + 0.1*0.9 + 0*0.3 + …

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 72 / 141 No reward prediction error! Only: Estimated Q-values Transition function Reward function This process is called Value Iteration or Dynamic prog.

TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 73 / 141 Links with neurobiological data Activity of hippocampal place neurons

TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 74 / 141 Nakazawa, McHugh, Wilson, Tonegawa (2004) Nature Reviews Neuroscience

TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 75 / 141 • Reactivation of hippocampal place cells during sleep (Wilson & McNaughton, 1994, Science)

TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 76 / 141 • Forward replay of hippocampal place cells during sleep (sequence is compressed 7 times) (Euston et al., 2007, Science)

TDRL Model Dopamine Sharp-Wave Ripple (SWR) events TD Applications Model-based RL Model-based RL slide # 77 / 141  “Ripple” events = irregular bursts of population activity that give rise to brief but intense high- frequency (100-250 Hz) oscillations in the CA1 pyramidal cell layer.

TDRL Model Selective suppression of SWRs Dopamine TD Applications impairs spatial memory Model-based RL Model-based RL slide # 78 / 141  Girardeau G, Benchenane K, Wiener SI, Buzsáki G, Zugaro MB (2009) Nat Neurosci.

TDRL Model Contribution to decision making (forward Dopamine TD Applications planning) and evaluation of transitions Model-based RL Model-based RL slide # 79 / 141 Johnson & Redish (2007) J Neurosci

TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 80 / 141 SUMMARY OF NEUROSCIENCE DATA Replay their sequential activity during sleep (Foster & Wilson, 2006; Euston et al., 2007; Gupta et al., 2010) Performance is impaired if this replay is disrupted (Girardeau, Benchenane et al. 2012; Jadhav et al. 2012) Only task-related replay in PFC (Peyrache et al., 2009) Hippocampus may contribute to model-based navigation strategies, striatum to model-free navigation strategies (Khamassi & Humphries, 2012)

TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 81 / 141 How to recover from damage without needing to identify the damage?

TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 82 / 141 The reality gap Self-model vs reality: how to use a simulator? Solution: Learn a transferability function (how well does the simulation match reality?) with SVM or neural networks. Idea: the damage is a large reality gap. Koos, Mouret & Doncieux. IEEE Trans Evolutionary Comput 2012

TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 83 / 141 Experiments Koos, Cully & Mouret. Int J Robot Res 2013

TDRL Model Dopamine TD Applications Model-based RL Meta-Learning slide # 84 / 141 META-LEARNING (regulation of decision-making) Dual-system RL coordination 1. Online parameters tuning 2.

TDRL Model Dopamine Multiple decision systems TD Applications Model-based RL Model-based RL slide # 85 / 141 Skinner box (instrumental conditioning) Model-based system Model-free sys. (Daw Niv Dayan 2005, Nat Neurosci) Behavior is initially model-based and becomes model- free (habitual) with overtraining.

TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 86 / 141 Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 87 / 141 Devalue Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 88 / 141 Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 89 / 141 Habitual Goal-directed Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

TDRL Model Model-free vs model-based: Dopamine TD Applications outcome sensitivity Model-based RL Model-based RL slide # 90 / 141 Switch with experience [reduce computational Change R: slow to update Change R: fast to update load] Habitual Goal-directed Daw et al 2005 Nat Neurosci

TDRL Model Dopamine Multiple decision systems TD Applications Model-based RL Model-based RL slide # 91 / 141 Keramati et al. (2011): extension of the Daw 2005 model with a speed-accuracy trade-off arbitration criterion.

TDRL Model Progressive shift from model-based Dopamine TD Applications navigation to model-free navigation Model-based RL Model-based RL slide # 92 / 141 Khamassi & Humphries (2012) Frontiers in Behavioral Neuroscience

TDRL Model Model-based and model-free Dopamine TD Applications navigation strategies Model-based RL Model-based RL slide # 93 / 141 Model-based navigation Model-free navigation Benoît Girard 2010 UPMC lecture

TDRL Model Old behavioral evidence for Dopamine TD Applications Place-based model-based RL Model-based RL Model-based RL slide # 94 / 141 Martinet et al. (2011) model applied to the Tolman maze

TDRL Model Old behavioral evidence for Dopamine TD Applications Place-based model-based RL Model-based RL Model-based RL slide # 95 / 141 Martinet et al. (2011) model applied to the Tolman maze

TDRL Model MULTIPLE NAVIGATION STRATEGIES Dopamine TD Applications IN THE RAT Model-based RL Model-based RL slide # 96 / 141 N O E Rats with a lesion Rats with a lesion of of the hippocampus the dorsal striatum S Packard and Knowlton, 2002 Rotation 180 ° Previous platform location Devan and White, 1999

TDRL Model MULTIPLE DECISION SYSTEMS IN A Dopamine TD Applications NAVIGATION MODEL Model-based RL slide # 97 / 141 Model-based system Model-free system (hippocampal (basal ganglia) place cells) Work by Laurent Dollé: Dollé et al., 2008, 2010, submitted

TDRL Model MULTIPLE NAVIGATION STRATEGIES Dopamine TD Applications IN A TD-LEARNING MODEL Model-based RL Model-based RL slide # 98 / 141 Task with a cued platform (visible flag) changing location every 4 trials Task of Pearce et al., 1998 Model: Dollé et al., 2010

TDRL Model Dopamine PSIKHARPAX ROBOT TD Applications Model-based RL slide # 99 / 141 Work by: Ken Caluwaerts (2010) Steve N’Guyen (2010) Mariacarla Staffa (2011) Antoine Favre-Félix (2011) Caluwaerts et al. (2012) Biomimetics & Bioinspiration

TDRL Model Dopamine PSIKHARPAX ROBOT TD Applications Model-based RL Model-based RL slide # 100 / 141 Planning strategy only Planning strategy + Taxon strategy Caluwaerts et al. (2012) Biomimetics & Bioinspiration

Reinforcement Learning: From neural processes modelling to Robotics - PowerPoint PPT Presentation

TDRL Model Dopamine TD Applications Model-based RL slide # 1 / 141 Reinforcement Learning: From neural processes modelling to Robotics applications Mehdi Khamassi (CNRS, ISIR-UPMC, Paris) 30 January 2015 Michle Sebags course @ Univ.

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Reinforcement Learning You can think of supervised learning as the teacher providing answers

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

01 02 03 PE Lead What is Ideas for you to How to access free mindfulness and take away resources

biosynthesis could be an index of functionality in pheochromocytoma diagnosis Ana-Maria Stefanescu

Increased harms Occasional users Regular users < once a month

Treatment of Superficial Venous Disease Polly Kokinos, MD South Bay Vascular Center and Vein

The Acutely Unwell Patient 25/04/2012 Dr Alasdair Scott BSc (Hons) MBBS PhD Academic FY1, North

Five Common Errors in the ICU I have no conflicts of interest to disclose. Management of the

Neurobiological Foundations of Reward and Risk ... and corresponding risk prediction errors

Approach to Geriatric Syndromes Geriatrics: Family Medicine Board Cognitive Impairments?

Reinforcement Learning: From neural processes modelling to Robotics - PowerPoint PPT Presentation

TDRL Model Dopamine TD Applications Model-based RL slide # 1 / 141 Reinforcement Learning: From neural processes modelling to Robotics applications Mehdi Khamassi (CNRS, ISIR-UPMC, Paris) 30 January 2015 Michle Sebags course @ Univ.

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Reinforcement Learning You can think of supervised learning as the teacher providing answers

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

01 02 03 PE Lead What is Ideas for you to How to access free mindfulness and take away resources

biosynthesis could be an index of functionality in pheochromocytoma diagnosis Ana-Maria Stefanescu

Increased harms Occasional users Regular users &lt; once a month

Treatment of Superficial Venous Disease Polly Kokinos, MD South Bay Vascular Center and Vein

The Acutely Unwell Patient 25/04/2012 Dr Alasdair Scott BSc (Hons) MBBS PhD Academic FY1, North

Five Common Errors in the ICU I have no conflicts of interest to disclose. Management of the

Neurobiological Foundations of Reward and Risk ... and corresponding risk prediction errors

Approach to Geriatric Syndromes Geriatrics: Family Medicine Board Cognitive Impairments?

Increased harms Occasional users Regular users < once a month