reinforcement learning from neural processes modelling to
play

Reinforcement Learning: From neural processes modelling to Robotics - PowerPoint PPT Presentation

TDRL Model Dopamine TD Applications Model-based RL slide # 1 / 141 Reinforcement Learning: From neural processes modelling to Robotics applications Mehdi Khamassi (CNRS, ISIR-UPMC, Paris) 30 January 2015 Michle Sebags course @ Univ.


  1. TDRL Model TDRL Model The Q-learning model Dopamine TD Applications Model-based RL slide # 31 / 141 How can the agent learn a policy? How to learn to perform the right actions other solution: learning Q-values (qualities) Q : (S,A) R Q-table: state / action a1 : North a2 : South a3 : East a4 : West s1 0.92 0.10 0.35 0.05 s2 0.25 0.52 0.43 0.37 s3 0.78 0.9 1.0 0.81 s4 0.0 1.0 0.9 0.9 … … … … … 0 0.1 0.9 0.3 0.8 0.9 0.1 0.8 0.3 0 0. 0.1 0.8 0 0 0.1

  2. TDRL Model TDRL Model The Q-learning model Dopamine TD Applications Model-based RL slide # 32 / 141 How can the agent learn a policy? How to learn to perform the right actions other solution: learning Q-values (qualities) Q : (S,A) R Q-table: state / action a1 : North a2 : South a3 : East a4 : West s1 0.92 0.10 0.35 0.05 s2 0.25 0.52 0.43 0.37 s3 0.78 0.9 1.0 0.81 s4 0.0 1.0 0.9 0.9 … … … … … exp( β . Q(s,a)) The β parameter regulates the exploration – P(a) = Σ exp( β . Q(s,b)) exploitation trade-off. b

  3. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 33 / 141  ACTOR-CRITIC State-dependent Reward Prediction Error  SARSA (independent from the action)  Q-LEARNING

  4. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 34 / 141  ACTOR-CRITIC State-dependent Reward Prediction Error  SARSA (independent from the action) Also used to update  Q-LEARNING the ACTOR P(a t |s t ) P(a t |s t ) + α δ t+1

  5. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 35 / 141  ACTOR-CRITIC  SARSA Reward Prediction Error dependent on the action  Q-LEARNING chosen to be performed next

  6. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 36 / 141  ACTOR-CRITIC  SARSA  Q-LEARNING Reward Prediction Error dependent on the best action

  7. TDRL Model Dopamine Dopamine TD Applications Model-based RL slide # 37 / 141 Links with biology Activity of dopaminergic neurons

  8. TDRL Model Dopamine Dopamine CLASSICAL CONDITIONING TD Applications Model-based RL slide # 38 / 141 TD-learning explains classical conditioning (predictive learning) Taken from Bernard Balleine’s lecture at Okinawa Computational Neuroscience Course (2005).

  9. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 39 / 141  Analogy with dopaminergic neurons’ activity R S +1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  10. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 40 / 141  Analogy with dopaminergic neurons’ activity R S +1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  11. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 41 / 141  Analogy with dopaminergic neurons’ activity R S 0 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  12. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 42 / 141  Analogy with dopaminergic neurons’ activity R S -1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  13. TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 43 / 141 Barto (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review. Houk et al. (1995) Dopaminergic neuron

  14. TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 44 / 141 Which state space as an input? Temporal-order input also called: [0 0 1 0 0 0 0] Tapped-delay line Dopaminergic neuron Montague et al. (1996); Suri & Schultz (2001) Daw (2003); Bertin et al. (2007).

  15. TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 45 / 141 Which state space as an input? reward 5 1 Temporal-order input 2 4 [0 0 1 0 0 0 0] or spatial or visual information 3 Dopaminergic neuron

  16. TDRL Model Dopamine Dopamine TD Applications Model-based RL slide # 46 / 141 Wide application of RL models to model-based analyses of behavioral and physiological data during decision-making tasks

  17. TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 47 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

  18. TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 48 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

  19. TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 49 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

  20. TDRL Model Dopamine Model-based analysis of brain data TD Applications TD Applications Model-based RL slide # 50 / 141 Sequence of observed trials : Left (Reward); Left (Nothing); Right (Nothing); Left (Reward); … fMRI scanner RL model Brain responses Prediction error ? cf. travail de Mathias Pessiglione (ICM) ou Giorgio Coricelli (ENS)

  21. TDRL Model Model-based analysis Dopamine TD Applications Work by Jean Bellot (PhD student) Model-based RL slide # 51 / 141 TD-learning models Behavior of the animal High fitting error Low fitting error Bellot, Sigaud, Khamassi (2012) SAB conference.

  22. TDRL Model Model-based analysis Dopamine TD Applications TD Applications My post-doc work Model-based RL slide # 52 / 141 • Analysis of single neurons recorded in the monkey dorsolateral prefrontal cortex and anterior cingulate cortex • Correlates of prediction errors? Action values? Level of control/exploration? Khamassi et al. (2013) Prog Brain Res; Khamassi et al. (in revision)

  23. TDRL Model Model-based analysis Dopamine TD Applications TD Applications My post-doc work Model-based RL slide # 53 / 141 Q δ β * Multiple regression analysis with bootstrap Khamassi et al. (2013) Prog Brain Res; Khamassi et al. (in revision)

  24. TDRL Model Dopamine TD Applications TD Applications Model-based RL slide # 54 / 141 This works well, but… Most experiments are single-step • All these cases are discrete • Very small number of states, actions • We supposed a perfect state identification •

  25. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 55 / 141 Sensory input 3 2 1 Actions 4 5 reward TD-Learning model applied to spatial navigation behavior learning in a robot performing the bio-inspired plus-maze task Khamassi et al. (2005) . Adaptive Behavior. Khamassi et al. (2006). Lecture Notes in Computer Science

  26. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 56 / 141 Coordination by a self-organizing map Actor-Critic multi-modules neural network

  27. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 57 / 141 Hand-tuned Autonomous Random

  28. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 58 / 141 Two methods : 1. Self-Organizing Maps (SOMs) 2. specialization based on performance (tests modules' capacity for state prediction) Baldassarre (2002); Doya et al. (2002). Within a particular subpart of the maze, only the module Autonomous with the most accurate reward prediction is trained. Each module thus becomes an expert responsible for learning in a given task subset.

  29. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 59 / 141 average

  30. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 60 / 141 Nb of iterations required (Average performance during the second half of the experiment) 94 1. hand-tuned 3,500 2. specialization based on performance 404 3. autonomous categorization (SOM) 30,000 4. random robot

  31. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 61 / 141 Nb of iterations required (Average performance during the second half of the experiment) 94 1. hand-tuned 3,500 2. specialization based on performance 404 3. autonomous categorization (SOM) 30,000 4. random robot

  32. TDRL Model Dopamine Outline TD Applications Model-based RL slide # 62 / 141 1. Model-free Reinforcement Learning Temporal-Difference RL Algorithm  Dopamine activity  Wide application to Neuroscience of decision-making  2. Model-based Reinforcement Learning Off-line learning / Replay during sleep  Dual-system RL  Online parameters tuning (meta-learning)  Link with Neurobehavioral data  Applications to Robotics 

  33. TDRL Model Dopamine TD Applications Model-based RL Model-based RL slide # 63 / 141 Off-learning (Model-based RL) and hippocampal & prefrontal cortex activity replay during sleep

  34. TDRL Model Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL Model-based RL slide # 64 / 141 After N simulations Very long! δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) V(s t ) = V(s t ) + α . δ t+1 learning rate (=0.1) discount factor (=0.9)

  35. TDRL Model Dopamine TRAINING DURING SLEEP TD Applications Model-based RL Model-based RL slide # 65 / 141 Method in Artificial Intelligence: Off-line Dyna-Q-learning (Sutton & Barto, 1998)

  36. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 66 / 141 To incrementally learn a model of transition and reward functions, then plan within this model by updates “in the head of the agent” (Sutton, 1990). S : state space A : action space Internal model Transition function T : S x A S Reward function R : S x A R

  37. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 67 / 141 s : state of the agent ( )

  38. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 68 / 141 s : state of the agent ( ) maxQ=0.3 maxQ=0.9 maxQ=0.7

  39. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 69 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7

  40. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 70 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7 stored transition function T: proba( ) = 0.9 proba( ) = 0.1 proba( ) = 0

  41. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 71 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7 stored transition function T: proba( ) = 0.9 proba( ) = 0.1 proba( ) = 0 0.6 0 0.9*0.7 + 0.1*0.9 + 0*0.3 + …

  42. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 72 / 141 No reward prediction error! Only: Estimated Q-values Transition function Reward function This process is called Value Iteration or Dynamic prog.

  43. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 73 / 141 Links with neurobiological data Activity of hippocampal place neurons

  44. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 74 / 141 Nakazawa, McHugh, Wilson, Tonegawa (2004) Nature Reviews Neuroscience

  45. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 75 / 141 • Reactivation of hippocampal place cells during sleep (Wilson & McNaughton, 1994, Science)

  46. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 76 / 141 • Forward replay of hippocampal place cells during sleep (sequence is compressed 7 times) (Euston et al., 2007, Science)

  47. TDRL Model Dopamine Sharp-Wave Ripple (SWR) events TD Applications Model-based RL Model-based RL slide # 77 / 141  “Ripple” events = irregular bursts of population activity that give rise to brief but intense high- frequency (100-250 Hz) oscillations in the CA1 pyramidal cell layer.

  48. TDRL Model Selective suppression of SWRs Dopamine TD Applications impairs spatial memory Model-based RL Model-based RL slide # 78 / 141  Girardeau G, Benchenane K, Wiener SI, Buzsáki G, Zugaro MB (2009) Nat Neurosci.

  49. TDRL Model Contribution to decision making (forward Dopamine TD Applications planning) and evaluation of transitions Model-based RL Model-based RL slide # 79 / 141 Johnson & Redish (2007) J Neurosci

  50. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 80 / 141 SUMMARY OF NEUROSCIENCE DATA Replay their sequential activity during sleep (Foster & Wilson, 2006; Euston et al., 2007; Gupta et al., 2010) Performance is impaired if this replay is disrupted (Girardeau, Benchenane et al. 2012; Jadhav et al. 2012) Only task-related replay in PFC (Peyrache et al., 2009) Hippocampus may contribute to model-based navigation strategies, striatum to model-free navigation strategies (Khamassi & Humphries, 2012)

  51. TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 81 / 141 How to recover from damage without needing to identify the damage?

  52. TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 82 / 141 The reality gap Self-model vs reality: how to use a simulator? Solution: Learn a transferability function (how well does the simulation match reality?) with SVM or neural networks. Idea: the damage is a large reality gap. Koos, Mouret & Doncieux. IEEE Trans Evolutionary Comput 2012

  53. TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 83 / 141 Experiments Koos, Cully & Mouret. Int J Robot Res 2013

  54. TDRL Model Dopamine TD Applications Model-based RL Meta-Learning slide # 84 / 141 META-LEARNING (regulation of decision-making) Dual-system RL coordination 1. Online parameters tuning 2.

  55. TDRL Model Dopamine Multiple decision systems TD Applications Model-based RL Model-based RL slide # 85 / 141 Skinner box (instrumental conditioning) Model-based system Model-free sys. (Daw Niv Dayan 2005, Nat Neurosci) Behavior is initially model-based and becomes model- free (habitual) with overtraining.

  56. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 86 / 141 Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  57. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 87 / 141 Devalue Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  58. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 88 / 141 Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  59. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 89 / 141 Habitual Goal-directed Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  60. TDRL Model Model-free vs model-based: Dopamine TD Applications outcome sensitivity Model-based RL Model-based RL slide # 90 / 141 Switch with experience [reduce computational Change R: slow to update Change R: fast to update load] Habitual Goal-directed Daw et al 2005 Nat Neurosci

  61. TDRL Model Dopamine Multiple decision systems TD Applications Model-based RL Model-based RL slide # 91 / 141 Keramati et al. (2011): extension of the Daw 2005 model with a speed-accuracy trade-off arbitration criterion.

  62. TDRL Model Progressive shift from model-based Dopamine TD Applications navigation to model-free navigation Model-based RL Model-based RL slide # 92 / 141 Khamassi & Humphries (2012) Frontiers in Behavioral Neuroscience

  63. TDRL Model Model-based and model-free Dopamine TD Applications navigation strategies Model-based RL Model-based RL slide # 93 / 141 Model-based navigation Model-free navigation Benoît Girard 2010 UPMC lecture

  64. TDRL Model Old behavioral evidence for Dopamine TD Applications Place-based model-based RL Model-based RL Model-based RL slide # 94 / 141 Martinet et al. (2011) model applied to the Tolman maze

  65. TDRL Model Old behavioral evidence for Dopamine TD Applications Place-based model-based RL Model-based RL Model-based RL slide # 95 / 141 Martinet et al. (2011) model applied to the Tolman maze

  66. TDRL Model MULTIPLE NAVIGATION STRATEGIES Dopamine TD Applications IN THE RAT Model-based RL Model-based RL slide # 96 / 141 N O E Rats with a lesion Rats with a lesion of of the hippocampus the dorsal striatum S Packard and Knowlton, 2002 Rotation 180 ° Previous platform location Devan and White, 1999

  67. TDRL Model MULTIPLE DECISION SYSTEMS IN A Dopamine TD Applications NAVIGATION MODEL Model-based RL slide # 97 / 141 Model-based system Model-free system (hippocampal (basal ganglia) place cells) Work by Laurent Dollé: Dollé et al., 2008, 2010, submitted

  68. TDRL Model MULTIPLE NAVIGATION STRATEGIES Dopamine TD Applications IN A TD-LEARNING MODEL Model-based RL Model-based RL slide # 98 / 141 Task with a cued platform (visible flag) changing location every 4 trials Task of Pearce et al., 1998 Model: Dollé et al., 2010

  69. TDRL Model Dopamine PSIKHARPAX ROBOT TD Applications Model-based RL slide # 99 / 141 Work by: Ken Caluwaerts (2010) Steve N’Guyen (2010) Mariacarla Staffa (2011) Antoine Favre-Félix (2011) Caluwaerts et al. (2012) Biomimetics & Bioinspiration

  70. TDRL Model Dopamine PSIKHARPAX ROBOT TD Applications Model-based RL Model-based RL slide # 100 / 141 Planning strategy only Planning strategy + Taxon strategy Caluwaerts et al. (2012) Biomimetics & Bioinspiration

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend