Summary of part I: prediction and RL Prediction is important for - PowerPoint PPT Presentation

Summary of part I: prediction and RL Prediction is important for action selection • The problem: prediction of future reward • The algorithm: temporal difference learning • Neural implementation: dopamine dependent learning in BG ⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model ⇒ Precise (normative!) theory for generation of dopamine firing patterns ⇒ Explains anticipatory dopaminergic responding, second order conditioning ⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

prediction error hypothesis of dopamine measured firing rate m model prediction error at end of trial: δ t = r t - V t (just like R-W) t ∑ (1 − η ) t − i r V t = η i i = 1 Bayer & Glimcher (2005)

Global plan • Reinforcement learning I: – prediction – classical conditioning – dopamine • Reinforcement learning II: • Reinforcement learning II: – dynamic programming; action selection – Pavlovian misbehaviour – vigor • Chapter 9 of Theoretical Neuroscience

Action Selection • Evolutionary specification • Immediate reinforcement: – leg flexion – Thorndike puzzle box – Thorndike puzzle box – pigeon; rat; human matching • Delayed reinforcement: – these tasks – mazes Bandler; – chess Blanchard

Immediate Reinforcement • stochastic policy: L m R m ; • based on action values: 5

Indirect Actor use RW rule: L = R = r r 0 . 05 ; 0 . 25 switch every 100 trials L R p p 6

Direct Actor m = L + R E P L r P R r ( ) [ ] [ ] ∂ ∂ P L P R [ ] [ ] = β = − β P L P R P L P R [ ] [ ] [ ] [ ] ∂ L ∂ R m m ( ( ) ) m m ∂ ∂ E E ( ( ) ) ( ( ) ) = = β β L L − − L L + + R R P L P L r r P L P L r r P R P R r r [ ] [ ] [ ] [ ] [ ] [ ] ∂ L m m ∂ E ( ) ( ) m = β L − P L r E [ ] ( ) ∂ L m m ∂ E ( ) ( ) m ≈ β L − r E ( ) if L is chosen ∂ L m m L − R → − ε L − R + ε a − − m m m m r E L R (1 )( ) ( ( ))( )

Direct Actor 8

Could we Tell? • correlate past rewards, actions with present choice • indirect actor (separate clocks): • direct actor (single clock):

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Matching • income not return • approximately exponential in r • alternation choice kernel

Action at a (Temporal) Distance x =1 x =1 x =2 x =3 x =2 x =3 • learning an appropriate action at x= 1 : – depends on the actions at x= 2 and x= 3 – gains no immediate feedback • idea: use prediction as surrogate feedback 12

x =1 Action Selection x =2 x =3 = σ L − R P L x m x m x start with policy: [ ; ] ( ( ) ( )) x =1 x =2 x =3 V V V ( 1 ), ( 2 ), ( 3 ) evaluate it: improve it: x =1 0.025 x =2 x =3 -0.175 -0.125 0.125 ∆ α δ m thus choose R more frequently than L;C 13 *

Policy δ > 0 if ⇒ ∆ v • value is too pessimistic ⇒ ∆ P • action is better than average x =1 x =2 x =3 14

actor/critic m 1 m 2 m 3 m n dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

Formally: Dynamic Programming

Variants: SARSA [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + actual − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) ( 2 , ) ( 1 , ) t Morris et al, 2006

Variants: Q learning [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) max ( 2 , ) ( 1 , ) t u Roesch et al, 2007

Summary • prediction learning – Bellman evaluation • actor-critic – asynchronous policy iteration – asynchronous policy iteration • indirect method (Q learning) – asynchronous value iteration [ ] = + = V E r V x x * * ( 1 ) ( ) | 1 + t t t 1 [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1

Impulsivity & Hyperbolic Discounting • humans (and animals) show impulsivity in: – diets – addiction – spending, … • intertemporal conflict between short and long term choices • often explained via hyperbolic discount functions • often explained via hyperbolic discount functions • alternative is Pavlovian imperative to an immediate reinforcer • framing, trolley dilemmas, etc

Direct/Indirect Pathways Frank • direct: D1: GO; learn from DA increase • indirect: D2: noGO; learn from DA decrease • hyperdirect (STN) delay actions given strongly attractive choices

Frank • DARPP-32: D1 effect • DRD2: D2 effect

Three Decision Makers • tree search • position evaluation • situation memory

Multiple Systems in RL • model-based RL – build a forward model of the task, outcomes – search in the forward model (online DP) • optimal use of information • computationally ruinous • computationally ruinous • cached-based RL – learn Q values, which summarize future worth • computationally trivial • bootstrap-based; so statistically inefficient • learn both – select according to uncertainty

Animal Canary • OFC; dlPFC; dorsomedial striatum; BLA? • dosolateral striatum, amygdala

Two Systems:

Behavioural Effects

Effects of Learning • distributional value iteration • (Bayesian Q learning) • fixed additional uncertainty per step

One Outcome shallow tree implies goal-directed control wins

Human Canary... a b c £££ • if a → c → and c , then do more of a or b? – MB: b – MF: a (or even no effect)

Behaviour • action values depend on both systems: ( ) = + β Q x u Q x u Q x u , ( , ) ( , ) tot MF MB β • expect that will vary by subject (but be fixed)

Neural Prediction Errors (1 → 2) R ventral striatum R ventral striatum (anatomical definition) • note that MB RL does not use this prediction error – training signal?

Neural Prediction Errors (1) • right nucleus accumbens behaviour 1-2, not 1

Vigour • Two components to choice: – what: • lever pressing • direction to run • direction to run • meal to choose – when/how fast/how vigorous • free operant tasks • real-valued DP 34

The model cost vigour cost unit cost P R C (reward) how V τ U R C fast U ? τ LP NP Other S 0 S 1 S 2 τ τ 2 time τ τ τ 1 time τ τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) goal 35

The model Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) S 0 S 1 S 2 τ 2 time τ τ τ τ τ 1 time τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) ARL 36

Average Reward RL Compute differential values of actions ρ = average Differential value rewards of taking action L with latency τ minus costs, per unit time when in state x + Future − τρ Q L , τ (x) = Rewards – Costs Returns C + C v V ( x ' ) τ u • steady state behavior (not learning dynamics) 37 (Extension of Schwartz 1993)

Average Reward Cost/benefit Tradeoffs 1. Which action to take? ⇒ Choose action with largest expected reward minus cost 2.How fast to perform it? • slow → delays (all) rewards • slow → delays (all) rewards • slow → less costly (vigour • slow → less costly (vigour cost) • net rate of rewards = cost of delay (opportunity cost of time) ⇒ Choose rate that balances vigour and opportunity costs explains faster (irrelevant) actions under hunger, etc masochism 38

Optimal response rates 1 st Nose poke Niv, Dayan, Joel, unpublished Experimental data 0.4 30 rate per minute probability 20 0.2 1 st NP 10 LP 0 0 Ex 0 0.5 1 0 0.5 1 1.5 1.5 0 0 20 20 40 40 seconds seconds since reinforcement 1 st Nose poke Model simulation 0.4 30 rate per minute probability 20 0.2 10 0 0 0 0.5 1 1.5 0 20 40 seconds seconds since reinforcement 39

Optimal response rates Experimental data Model simulation 50 100 Pigeon A Model s on lever A Pigeon B Perfect matching s on key A 80 Perfect matching 60 % Responses o % Responses o 40 20 Herrnstein 1961 0 More: 0 20 40 60 80 100 0 50 • # responses % Reinforcements on lever A % Reinforcements on key A • interval length • amount of reward • ratio vs. interval • breaking point • temporal structure • etc. 40

Effects of motivation (in the model) RR25 C τ τ = − − + − ⋅ Q x u p R C v V x R ( , , ) ( ' ) τ r u τ ∂ C C Q x u ( , , ) τ τ τ τ = − = = R v v 0 τ τ τ τ opt ∂ ∂ R R 2 opt opt low utility high utility mean latency energizing effect LP Other 41

Summary of part I: prediction and RL Prediction is important for - PowerPoint PPT Presentation

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Part-II Parametric Signal Modeling and Linear Prediction Theory 3. Linear Prediction Electrical

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

(seasonal) prediction systems Arun Kumar Climate Prediction Center College Park, Maryland, USA

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 20 Probabilistic Prediction Also

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 26 Probabilistic Prediction Also

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

Link prediction The link prediction space is vast and imbalanced : real approaches focus only in

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Studying the effect of species dominance on diversity patterns using Hill numbers-based indices

Mathematical and Perceptual Models for Image Segmentation Thrasos Pappas Electrical &

Dominant Decay Channel of Higgs Particle Observed at ATLAS Zhijun Liang

Centrality Measures on Big Graphs: Exact, Approximated, and Distributed Algorithms Francesco

DEEP BRAIN STIMULATION: BEST PRACTICE AND MORE Francesca Morgante Institute of Molecular and

European Bioinformatics Institute European Bioinformatics Institute British outstation of the

Neuroscience of Decision Making Sam uel McClure Psycholog y Departm ent 1 5 May 2 0 0 8 The

ADHD DHD in Primar imary y Care Andres J. Pumariega, M.D. Professor and Chair, Department of

Summary of part I: prediction and RL Prediction is important for - PowerPoint PPT Presentation

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Part-II Parametric Signal Modeling and Linear Prediction Theory 3. Linear Prediction Electrical

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

(seasonal) prediction systems Arun Kumar Climate Prediction Center College Park, Maryland, USA

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 20 Probabilistic Prediction Also

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 26 Probabilistic Prediction Also

Image and Video Coding: Intra Prediction &amp; Picture Partitioning Intra-Picture Prediction

Link prediction The link prediction space is vast and imbalanced : real approaches focus only in

DeepLoc Data set statistics &amp; performance Protein prediction II Gregor Sturm, Johannes Rest,

Studying the effect of species dominance on diversity patterns using Hill numbers-based indices

Mathematical and Perceptual Models for Image Segmentation Thrasos Pappas Electrical &amp;

Dominant Decay Channel of Higgs Particle Observed at ATLAS Zhijun Liang

Centrality Measures on Big Graphs: Exact, Approximated, and Distributed Algorithms Francesco

DEEP BRAIN STIMULATION: BEST PRACTICE AND MORE Francesca Morgante Institute of Molecular and

European Bioinformatics Institute European Bioinformatics Institute British outstation of the

Neuroscience of Decision Making Sam uel McClure Psycholog y Departm ent 1 5 May 2 0 0 8 The

ADHD DHD in Primar imary y Care Andres J. Pumariega, M.D. Professor and Chair, Department of

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Mathematical and Perceptual Models for Image Segmentation Thrasos Pappas Electrical &