brain and reinforcement learning
play

Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural - PowerPoint PPT Presentation

MLSS 2012 in Kyoto Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology Location of Okinawa Seoul ! Beijing ! Tokyo ! 2.5 hour ! 3 hour ! 2.5 hour ! Shanghai ! 2 hour


  1. MLSS 2012 in Kyoto Brain and Reinforcement Learning � Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology

  2. Location of Okinawa � Seoul ! Beijing ! Tokyo ! 2.5 hour ! 3 hour ! 2.5 hour ! Shanghai ! 2 hour ! 1.5 hour ! Okinawa ! Taipei ! 2.5 hour ! Manila !

  3. Okinawa Institute of Science & Technology ! Apr. 2004: Initial research ! President: Sydney Brenner ! Nov. 2011: Graduate university ! President: Jonathan Dorfan ! Sept. 2012: Ph.D. course ! 20 students/year

  4. Our Research Interests How to build adaptive, How the brain realizes autonomous systems robust, flexible adaptation ! robot experiments ! neurobiology

  5. Learning to Walk (Doya & Nakano, 1985) ! Action: cycle of 4 postures ! Reward: speed sensor output ! Problem: a long jump followed by a fall Need for long-term evaluation of action

  6. Reinforcement Learning reward r ! action a ! agent ! environment ! state s ! ! Learn action policy: s → a to maximize rewards ! Value function: expected future rewards ! V(s(t)) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) + γ 3 r(t+3) +…] 0 ≤ γ ≤ 1: discount factor γ V(s(t+1) ! ! Temporal difference (TD) error: ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t))

  7. Example: Grid World ! Reward field Value function γ =0.9 γ =0.3 2 2 1 1 0 0 -1 -1 -2 -2 6 6 6 6 4 4 4 4 2 2 2 2

  8. Cart-Pole Swing-Up ! Reward: height of pole ! Punishment: collision ! Value in 4D state space

  9. Learning to Stand Up � Morimoto & Doya, 2000) ! State: joint/head angles, angular velocity ! Action: torques to motors ! Reward: head height – tumble

  10. Learning to Survive and Reproduce ! Catch battery packs ! Copy ‘genes’ by IR ports ! survival ! reproduction, evolution

  11. Markov Decision Process (MDP) reward r ! ! Markov decision process ! state s ∈ S ! action a ∈ A action a ! agent ! environment ! ! policy p(a|s) state s ! ! reward r(s,a) ! dynamics p(s’|s,a) ! Optimal policy: maximize cumulative reward ! finite horizon: E[ r(1) + r(2) + r(3) + ... + r(T)] ! infinite horizon: E[ r(1) + γ r(2) + γ 2 r(3) + … ] 0 ≤ γ ≤ 1: temporal discount factor ! average reward: E[ r(1) + r(2) + ... + r(T)]/T, T →∞

  12. Solving MDPs � Dynamic Programming Reinforcement Learning ! p(s’|s,a) and r(s,a) are known ! p(s’|s,a) and r(s,a) are unknown ! Solve Bellman equation ! Learn from actual experience V(s) = max a E[ r(s,a) + γ V(s’)] {s,a,r,s,a,r,…} ! V(s): value function ! Monte Carlo expected reward from state s ! SARSA ! Apply optimal policy ! Q-learning a = argmax a E[ r(s,a) + γ V*(s’)] ! Actor-Critic ! Value iteration ! Policy gradient ! Policy iteration ! Model-based ! learn p(s’|s,a), r(s,a) and do DP

  13. Actor-Critic and TD learning � ! Actor: parameterized policy: P(a|s; w) ! Critic: learn value function V(s(t)) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) +…] ! in a table or a neural network ! Temporal Difference (TD) error: ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) ! Update ! Critic: Δ V(s(t)) = α δ (t) ! Actor: Δ w = α δ (t) ∂ P(a(t)|s(t);w)/ ∂ w … reinforce a(t) by δ (t)

  14. SARSA and Q Learning ! Action value function ! Q(s,a) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) …| s(t)=s,a(t)=a] ! Action selection ! ε -greedy: a = argmax a Q(s,a) with prob 1- ε* ! Boltzman: P(a i |s) = exp[ β Q(s,a i )] / Σ j exp[ β Q(s,a j )] ! SARSA: on-policy update ! Δ Q(s(t),a(t)) = α { r(t)+ γ Q(s(t+1),a(t+1))-Q(s(t),a(t)) } ! Q learning: off-policy update ! Δ Q(s(t),a(t)) = α { r(t)+ γ max a’ Q(s(t+1),a’)-Q(s(t),a(t)) }

  15. “Lose to Gain” Task ! N states, 2 actions +r 2 � a 2 � -r 1 � -r 1 � -r 1 � s 1 � s 2 � s 3 � s 4 � +r 1 � +r 1 � +r 1 � a 1 � -r 2 � ! if r 2 >> r 1 , then better take a 2 �

  16. Reinforcement Learning ! Predict reward: value function ! V(s) = E[ r(t) + γ r(t+1) + γ 2 r(t+2)…| s(t)=s] ! Q(s,a) = E[ r(t) + γ r(t+1) + γ 2 r(t+2)…| s(t)=s, a(t)=a] ! Select action How to implement these steps? ! greedy : a = argmax Q(s,a) ! Boltzmann : P(a|s) ∝ exp[ β Q(s,a)] ! Update prediction: TD error * ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) How to tune these parameters? ! Δ V(s(t)) = α δ (t) ! Δ Q(s(t),a(t)) = α δ (t)

  17. Dopamine Neurons Code TD Error δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) unpredicted ������� r �� ���� V ������� predicted δ* ������� r �� ���� V ������� omitted δ* �������� r �� ���� V (Schultz et al. 1997) ������� δ* . 2. Dopamine neurons report rewards according to an error in re-

  18. Basal Ganglia for Reinforcement Learning? (Doya 2000, 2007) state � action � Cerebral cortex ! state/action coding ! Striatum ! reward prediction ! Q(s,a) � V(s) � Pallidum ! action selection ! δ � Dopamine neurons ! TD signal ! Thalamus !

  19. Monkey Free Choice Task (Samejima et al., 2005) P( reward | Left )= Q L % 5 0 9 0 - 90 50-90 50-50 50-10 50 10-50 10 10 50 90 % P( reward | Right) = Q R Dissociation of action and reward !

  20. Action Value Coding in Striatum (Samejima et al., 2005) ! Q L neuron Q L ! Q R ! ! -1 0 sec ! -Q R neuron Q L ! Q R !

  21. Forced and Free Choice Task Makoto Ito 0.5-1s 1-2s Left poking Right Center Left ! Center ! Right ! Cue tone Rwd tone Pellet No-rwd Cue tone Reward prob. (L, R) Left tone Fixed (900Hz) (50%,0%) Right tone Fixed (6500Hz) (0%, 50%) pellet dish ! Varied Free-choice tone (90%, 50%) (White noise) (50%, 90%) (50%, 10%) (10%, 50%)

  22. Time Course of Choice 10-50 50-10 90-50 50-90 10-50 90-50 50-10 50-90 10-50 50-10 �������������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� � P L � �� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� P(r|a=L) Trials Left for tone A 0.9 Right for tone A Tone B 0.5 0.1 P(r|a=R) 0.1 0.5 0.9

  23. Generalized Q-learning Model (Ito & Doya, 2009) � ! Action selection P(a(t)=L) = expQ L (t)/(expQ L (t)+expQ R (t)) ! Action value update: i � {L,R} Q i (t+1) = (1- α 1 )Q i (t) + α 1 κ 1 if a(t)=i, r(t)=1 (1- α 1 )Q i (t) - α 1 κ 2 if a(t)=i, r(t)=0 (1- α 2 )Q i (t) if a(t) ≠ i, r(t)=1 (1- α 2 )Q i (t) if a(t) ≠ i, r(t)=0 ! Parameters ! α 1 : learning rate ! α 2 : forgetting rate ! κ 1 : reward reinforcement ! κ 2 : no-reward aversion

  24. Model Fitting by Particle Filter � (90 50) ! (50 90) ! (50 10) ! Left, reward ! � Left, no-reward ! � Right, no-reward ! � Right, reward ! � Q L ! � Q R ! � � �� �� �� �� �� �� �� �� �� �� �� �� ��� � ��� α 1 ! ��� α 2 ! ��� ��� Trials ! � �� �� �� �� �� �� �� �� �� ���

  25. Model Fitting � 1st Markov model(4) ! ������ ** ! 2nd Markov model(16) ! ������ ** ! 3rd Markov model(64) ! ������ * ! 4th Markov model(256) ! ������ ** ! ! Generalized Q learning standard Q (const)(2) ! ������ ** ! ! α 1 : learning F-Q (const)(3) ! ** ! ������ ! α 2 : forgetting DF-Q (const)(4) ! ������ ** ! local matching law(1) ! ������ ** ! ! κ 1 : reinforcement ! κ 2 : aversion standard Q (variable)(2) ! ������ ** ! F-Q (variable)(2) ! ������ ! standard: α 2 = κ 2 =0 DF-Q (variable)(2) ! ������ ! forgetting: κ 2 =0 ��� ����� ���� ����� ���� ����� ���� ����� ���� normalized likelihood !

  26. Neural Activity in the Striatum ! Dorsolateral C ! R ! ! Dorsomedial C ! R ! ! Ventral

  27. Information of Action and Reward Action (out of center hole) ! Reward (into choice hole) ! DL(122) ! bit/sec ! DM(56) ! NA(59) ! poking C ! Tone ! poking L ! poking R ! pellet dish ! sec !

  28. Action value coded by a DLS neuron Firing rate during tone presentation (blue in left panel) ! DLS ! Trials ! Action value for left estimated by FQ-learning ! Q ! Trials ! Firing rate during tone presentation (blue) ! FSA ! action value for left estimated by FQ-learning !

  29. State value coded by a VS neuron Firing rate during tone presentation (blue in left panel) ! VS ! Action value for left estimated by FQ-learning ! Q ! Firing rate during tone presentation (blue) ! FSA ! action value for left estimated by FQ-learning !

  30. Hierarchy in Cortico-Striatal Network ! Dorsolateral striatum – motor ! early action coding ! what action to take? ! Dorsomedial striatum - frontal ! action value ! in what context? ! Ventral striatum - limbic ! state value ! whether worth doing? � (Voorn et al., 2004) !

  31. Specialization by Learning Algorithms (Doya, 1999) � Cerebral Cortex : Unsupervised Learning ! output� input � Cortex� Basal Ganglia: Reinforcement Learning ! reward � Basal� thalamus� output� input� Ganglia� SN� Cerebellum� Cerebellum: Supervised Learning ! target� IO� +� error � -� input� output�

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend