Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural - PowerPoint PPT Presentation

MLSS 2012 in Kyoto Brain and Reinforcement Learning � Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology

Location of Okinawa � Seoul ! Beijing ! Tokyo ! 2.5 hour ! 3 hour ! 2.5 hour ! Shanghai ! 2 hour ! 1.5 hour ! Okinawa ! Taipei ! 2.5 hour ! Manila !

Okinawa Institute of Science & Technology ! Apr. 2004: Initial research ! President: Sydney Brenner ! Nov. 2011: Graduate university ! President: Jonathan Dorfan ! Sept. 2012: Ph.D. course ! 20 students/year

Our Research Interests How to build adaptive, How the brain realizes autonomous systems robust, flexible adaptation ! robot experiments ! neurobiology

Learning to Walk (Doya & Nakano, 1985) ! Action: cycle of 4 postures ! Reward: speed sensor output ! Problem: a long jump followed by a fall Need for long-term evaluation of action

Reinforcement Learning reward r ! action a ! agent ! environment ! state s ! ! Learn action policy: s → a to maximize rewards ! Value function: expected future rewards ! V(s(t)) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) + γ 3 r(t+3) +…] 0 ≤ γ ≤ 1: discount factor γ V(s(t+1) ! ! Temporal difference (TD) error: ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t))

Example: Grid World ! Reward field Value function γ =0.9 γ =0.3 2 2 1 1 0 0 -1 -1 -2 -2 6 6 6 6 4 4 4 4 2 2 2 2

Cart-Pole Swing-Up ! Reward: height of pole ! Punishment: collision ! Value in 4D state space

Learning to Stand Up � Morimoto & Doya, 2000) ! State: joint/head angles, angular velocity ! Action: torques to motors ! Reward: head height – tumble

Learning to Survive and Reproduce ! Catch battery packs ! Copy ‘genes’ by IR ports ! survival ! reproduction, evolution

Markov Decision Process (MDP) reward r ! ! Markov decision process ! state s ∈ S ! action a ∈ A action a ! agent ! environment ! ! policy p(a|s) state s ! ! reward r(s,a) ! dynamics p(s’|s,a) ! Optimal policy: maximize cumulative reward ! finite horizon: E[ r(1) + r(2) + r(3) + ... + r(T)] ! infinite horizon: E[ r(1) + γ r(2) + γ 2 r(3) + … ] 0 ≤ γ ≤ 1: temporal discount factor ! average reward: E[ r(1) + r(2) + ... + r(T)]/T, T →∞

Solving MDPs � Dynamic Programming Reinforcement Learning ! p(s’|s,a) and r(s,a) are known ! p(s’|s,a) and r(s,a) are unknown ! Solve Bellman equation ! Learn from actual experience V(s) = max a E[ r(s,a) + γ V(s’)] {s,a,r,s,a,r,…} ! V(s): value function ! Monte Carlo expected reward from state s ! SARSA ! Apply optimal policy ! Q-learning a = argmax a E[ r(s,a) + γ V*(s’)] ! Actor-Critic ! Value iteration ! Policy gradient ! Policy iteration ! Model-based ! learn p(s’|s,a), r(s,a) and do DP

Actor-Critic and TD learning � ! Actor: parameterized policy: P(a|s; w) ! Critic: learn value function V(s(t)) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) +…] ! in a table or a neural network ! Temporal Difference (TD) error: ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) ! Update ! Critic: Δ V(s(t)) = α δ (t) ! Actor: Δ w = α δ (t) ∂ P(a(t)|s(t);w)/ ∂ w … reinforce a(t) by δ (t)

SARSA and Q Learning ! Action value function ! Q(s,a) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) …| s(t)=s,a(t)=a] ! Action selection ! ε -greedy: a = argmax a Q(s,a) with prob 1- ε* ! Boltzman: P(a i |s) = exp[ β Q(s,a i )] / Σ j exp[ β Q(s,a j )] ! SARSA: on-policy update ! Δ Q(s(t),a(t)) = α { r(t)+ γ Q(s(t+1),a(t+1))-Q(s(t),a(t)) } ! Q learning: off-policy update ! Δ Q(s(t),a(t)) = α { r(t)+ γ max a’ Q(s(t+1),a’)-Q(s(t),a(t)) }

“Lose to Gain” Task ! N states, 2 actions +r 2 � a 2 � -r 1 � -r 1 � -r 1 � s 1 � s 2 � s 3 � s 4 � +r 1 � +r 1 � +r 1 � a 1 � -r 2 � ! if r 2 >> r 1 , then better take a 2 �

Reinforcement Learning ! Predict reward: value function ! V(s) = E[ r(t) + γ r(t+1) + γ 2 r(t+2)…| s(t)=s] ! Q(s,a) = E[ r(t) + γ r(t+1) + γ 2 r(t+2)…| s(t)=s, a(t)=a] ! Select action How to implement these steps? ! greedy : a = argmax Q(s,a) ! Boltzmann : P(a|s) ∝ exp[ β Q(s,a)] ! Update prediction: TD error * ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) How to tune these parameters? ! Δ V(s(t)) = α δ (t) ! Δ Q(s(t),a(t)) = α δ (t)

Dopamine Neurons Code TD Error δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) unpredicted �� r �� V �� predicted δ* �� r �� V �� omitted δ* �� r �� V (Schultz et al. 1997) �� δ* . 2. Dopamine neurons report rewards according to an error in re-

Basal Ganglia for Reinforcement Learning? (Doya 2000, 2007) state � action � Cerebral cortex ! state/action coding ! Striatum ! reward prediction ! Q(s,a) � V(s) � Pallidum ! action selection ! δ � Dopamine neurons ! TD signal ! Thalamus !

Monkey Free Choice Task (Samejima et al., 2005) P( reward | Left )= Q L % 5 0 9 0 - 90 50-90 50-50 50-10 50 10-50 10 10 50 90 % P( reward | Right) = Q R Dissociation of action and reward !

Action Value Coding in Striatum (Samejima et al., 2005) ! Q L neuron Q L ! Q R ! ! -1 0 sec ! -Q R neuron Q L ! Q R !

Forced and Free Choice Task Makoto Ito 0.5-1s 1-2s Left poking Right Center Left ! Center ! Right ! Cue tone Rwd tone Pellet No-rwd Cue tone Reward prob. (L, R) Left tone Fixed (900Hz) (50%,0%) Right tone Fixed (6500Hz) (0%, 50%) pellet dish ! Varied Free-choice tone (90%, 50%) (White noise) (50%, 90%) (50%, 10%) (10%, 50%)

Time Course of Choice 10-50 50-10 90-50 50-90 10-50 90-50 50-10 50-90 10-50 50-10 �� P L � �� P(r|a=L) Trials Left for tone A 0.9 Right for tone A Tone B 0.5 0.1 P(r|a=R) 0.1 0.5 0.9

Generalized Q-learning Model (Ito & Doya, 2009) � ! Action selection P(a(t)=L) = expQ L (t)/(expQ L (t)+expQ R (t)) ! Action value update: i � {L,R} Q i (t+1) = (1- α 1 )Q i (t) + α 1 κ 1 if a(t)=i, r(t)=1 (1- α 1 )Q i (t) - α 1 κ 2 if a(t)=i, r(t)=0 (1- α 2 )Q i (t) if a(t) ≠ i, r(t)=1 (1- α 2 )Q i (t) if a(t) ≠ i, r(t)=0 ! Parameters ! α 1 : learning rate ! α 2 : forgetting rate ! κ 1 : reward reinforcement ! κ 2 : no-reward aversion

Model Fitting by Particle Filter � (90 50) ! (50 90) ! (50 10) ! Left, reward ! � Left, no-reward ! � Right, no-reward ! � Right, reward ! � Q L ! � Q R ! � � �� α 1 ! �� α 2 ! �� Trials ! � ��

Model Fitting � 1st Markov model(4) ! �� ** ! 2nd Markov model(16) ! �� ** ! 3rd Markov model(64) ! �� * ! 4th Markov model(256) ! �� ** ! ! Generalized Q learning standard Q (const)(2) ! �� ** ! ! α 1 : learning F-Q (const)(3) ! ** ! �� ! α 2 : forgetting DF-Q (const)(4) ! �� ** ! local matching law(1) ! �� ** ! ! κ 1 : reinforcement ! κ 2 : aversion standard Q (variable)(2) ! �� ** ! F-Q (variable)(2) ! �� ! standard: α 2 = κ 2 =0 DF-Q (variable)(2) ! �� ! forgetting: κ 2 =0 �� normalized likelihood !

Neural Activity in the Striatum ! Dorsolateral C ! R ! ! Dorsomedial C ! R ! ! Ventral

Information of Action and Reward Action (out of center hole) ! Reward (into choice hole) ! DL(122) ! bit/sec ! DM(56) ! NA(59) ! poking C ! Tone ! poking L ! poking R ! pellet dish ! sec !

Action value coded by a DLS neuron Firing rate during tone presentation (blue in left panel) ! DLS ! Trials ! Action value for left estimated by FQ-learning ! Q ! Trials ! Firing rate during tone presentation (blue) ! FSA ! action value for left estimated by FQ-learning !

State value coded by a VS neuron Firing rate during tone presentation (blue in left panel) ! VS ! Action value for left estimated by FQ-learning ! Q ! Firing rate during tone presentation (blue) ! FSA ! action value for left estimated by FQ-learning !

Hierarchy in Cortico-Striatal Network ! Dorsolateral striatum – motor ! early action coding ! what action to take? ! Dorsomedial striatum - frontal ! action value ! in what context? ! Ventral striatum - limbic ! state value ! whether worth doing? � (Voorn et al., 2004) !

Specialization by Learning Algorithms (Doya, 1999) � Cerebral Cortex ： Unsupervised Learning ! output� input � Cortex� Basal Ganglia: Reinforcement Learning ! reward � Basal� thalamus� output� input� Ganglia� SN� Cerebellum� Cerebellum: Supervised Learning ! target� IO� +� error � -� input� output�

Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural - PowerPoint PPT Presentation

MLSS 2012 in Kyoto Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology Location of Okinawa Seoul ! Beijing ! Tokyo ! 2.5 hour ! 3 hour ! 2.5 hour ! Shanghai ! 2 hour

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

BRAIN VENTRICULAR SYSTEM CSF THE BRAIN BRAIN The brain (encephalon) lies within the cranium. It

Pitch Anything by Oren Klaff BUYER 3 3 Neocortex Neocortex 2 2 Mid Brain Mid Brain

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Bifurcation Analysis of a Model of Parkinsonian STN - GPe Activity Flix NJAP Postdoctoral

A Bootstrap Method to Improve Brain Subcortical Network Segregation in Resting-State FMRI Data C.

Machine Learning and Brain Science Kenji Doya doya@oist.jp Neural Computation Unit Okinawa

Modularity, synchronization, and noise: a view from nonlinear contraction theory Quang-Cuong Pham

Topics in Brain Computer Interfaces Topics in Brain Computer Interfaces CS295- -7 7 CS295

A reinforcement learning model of song acquisition in the bird Michale Fee McGovern Institute

the evolution of human sociality is the fundamental condundrum of biology EO Wilson 1975 MY

Learning and Adaptation for Sensorimotor Control October 24-26 2018 Action on-based ed l