Inter-individual variability in human feedback learning Stefano - - PowerPoint PPT Presentation

inter individual variability in human feedback learning
SMART_READER_LITE
LIVE PREVIEW

Inter-individual variability in human feedback learning Stefano - - PowerPoint PPT Presentation

Financial Education and Investor Behavior Conference Rio de Janeiro - 7/12/2015 Inter-individual variability in human feedback learning Stefano Palminteri, PhD Institute of Cognitive Science (UCL, London) Laboratoire de Neurosciences


slide-1
SLIDE 1

Inter-individual variability in human feedback learning

Financial Education and Investor Behavior Conference

Rio de Janeiro - 7/12/2015

Stefano Palminteri, PhD

stefano.palminteri@gmail.com Institute of Cognitive Science (UCL, London) Laboratoire de Neurosciences Cognitives (ENS, Paris)

slide-2
SLIDE 2

Thorndike, Skinner, Sutton, Barto etc…

Reinforcement learning (I)

E volution

  • C. Elegans
  • A. Mellifera
  • M. Musculus
  • H. Sapiens

An evolutionary pervasive psychological process Learning by trial and error to select actions that maximize the occurrence of pleasant events (rewards) and minimize the occurrence of unpleasant events (punishments)

slide-3
SLIDE 3

Learning is driven by prediction errors and choices are made comparing action values

P(A)t

0,0 0,5 1,0

  • 1
  • 0,5

0,5 1

(V(A)t-V(B)t)

Policy: P(A)t=1/(1+exp((V(B)t-V(A)t)/β))

Prediction error (PEt )

0,0 0,5 1,0 1 20

Trials

Prediction error: PEt=R-V(A)t

Prediction (V(A)t)

0,0 0,5 1,0 1 20

Trials

Learning rule: V(A)t+1=V(A)t+αPEt

Reinforcement learning (II)

Q-learning or Rescorla-Wagner model (RW)

slide-4
SLIDE 4

Fundamental dimensions: positive/negative vs. exploration/exploration

Decision rule: P(A)t=1/(1+exp((V(B)t-V(A)t)/β)) Prediction error: PEt=R-V(A)t Learning rule: V(A)t+1=V(A)t+αPEt

Reinforcement learning (III)

1 Positive prediction errors 2 Negative prediction errors 1 Exploit previous knowledge 2 Explore new options

slide-5
SLIDE 5

The general idea: Can “low level” reinforcement learning biases explain ‘high level’ behavioral biases?

The framework (I)

Erev, Camerer, Schultz, etc.

“Low level”

Motor learning Economy

“High level”

Reinforcement learning processes has been show to

  • perate at different levels of human behavior
slide-6
SLIDE 6

Context (s1,…sj) Options (a1,…ai)

V(sj,ai)

Option values Selection

P(sj,ai)

Choice Probabilities

Agent Environment

Action (a) Obtained outcomes

Update 1

1 Learning from direct experience (“factual”)

Decision biases Learning biases!

The framework (II)

slide-7
SLIDE 7

Today special question: good news/bad news effect

Optimism bias (I)

Sharot et al. Belief(t)

Insensitivity to negative errors can generates:

  • Inflated likelihood for desired events
  • Reduced likelihood for undesired events

Information(t)

Revising beliefs as a function of:

  • Better than expected (PE>0)
  • Worst than expected (PE<0)

PE<0 PE>0 Belief(t+1) Data "It is the peculiar and perpetual error of the human understanding to be more moved and excited by affirmatives than negatives; whereas it ought properly to hold itself indifferently disposed towards both alike" (p. 36).

Francis Bacon (1620)

slide-8
SLIDE 8

Current hypothesis: Asymmetric learning from positive and negative prediction errors as an atomic computational mechanism to generate and sustain optimistic beliefs (low  high level) Current questions: 1) Is this learning asymmetry specific of abstract belief or applies also to rewards? 2) Is this learning asymmetry dependent on the stimuli having prior desirability or still stands for neutral stimuli? 3) Is this learning asymmetry specific to fictive –simulated- experience or also stands for actual outcomes?

Optimism bias (II)

slide-9
SLIDE 9

Learning performances Motor bias “Conservatism”

Experimental task and contingencies and dependent variables: Data from Palminteri et al, Jneurosci, 2009; Worbe et al, Archives Gen Psy, 2011

First study

Symmetric

  • ptions values

Asymmetric option values Symmetric

  • ptions values

Data = = > <

slide-10
SLIDE 10

 The stimuli has no prior desirability  The outcomes are not hypothetical but real

First study (N=50)

Trials

5 10 15 20 25

Preferred Choice Rate

0.5 0.6 0.7 0.8 0.9 1

Conditions 1 & 4

Data

Trials

5 10 15 20 25

Correct Choice Rate

0.5 0.6 0.7 0.8 0.9 1

Conditions 2 & 3

slide-11
SLIDE 11

Formalism and predictions

PEt=R-V(A)t P(A)t=1/(1+exp((V(B)t-V(A)t)/β)) V(A)t+1=V(A)t+αPEt Decision rule: Prediction error: Learning rule: Rescorla-Wagner model (RW) P(A)t=1/(1+exp((V(B)t-V(A)t)/β)) V(A)t+1=V(A)t+α+PEt PEt=R-V(A)t V(A)t+1=V(A)t+α-PEt Rescorla-Wagner model (RW ) Possible results concerning the learning rates Standard RL Optimistic RL Pessimistic RL

slide-12
SLIDE 12

The computational and behavioral results

Model comparison Parameters Behaviour

Which model?

Bayes: Popper:

Why?

Signs of optimistic reinforcement learning  Optimism enforces

Parameters

slide-13
SLIDE 13

A microscopic analysis or optimistic and realistic behavior

Typical RW subjects Typical RW subjects

A computational explanation for developing a “preferred option” even in poorly rewarding environments

Preferred choice rate Preferred choice rate Preferred - Not preferred Value difference Preferred - Not preferred Value difference

=

slide-14
SLIDE 14

The robustness of the result

Minimum outcome: Optimistic RL is expressed because not winning is not that bad and would disappear with actual monetary punishments: Learning phase: Optimistic RL is an artifact arising from subjects “deciding” which is the best option after the first

  • utcomes:

Contingencies: Optimistic RL is an artifact arising from subjects “giving up” symmetrical low reward conditions:

slide-15
SLIDE 15

αC+ αC- Optimistic Standard αC+ αC- Hypothesis concerning the learning rate: VS Contingence reversal suppresses the asymmetry The asymmetry is robust across contingency types. VS

Limitations of the first studies  Task included only stable environments  Thus, subjects incurred not big losses by behaving optimistically

Testing the inflexibility of optimist subjects

slide-16
SLIDE 16

The second study (N=20)

The task includes a reversal learning condition (which should promote flexibility) First set: Learning is driven by positive prediction errors Second set: Learning is driven by negative prediction errors

slide-17
SLIDE 17

The computational and behavioral results

slide-18
SLIDE 18

The reversal learning curves

Slower, but flexible Quicker, but inflexible  Optimistic learning is confirmed also when there are losses  Optimistic learning is confirmed also when it is maladaptive

slide-19
SLIDE 19

Interim conclusions (I) and new questions

So far:

  • We demonstrated that even in simple task involving abstract

neutral items and direct reinforcement, subjects preferentially update their reward expectations following positive, that negative prediction errors.

  • This is true even when this behavior is disadvantageous

(reversal learning)

  • However this tendency was quite variable across subjects:

New questions: 1) Is optimistic reinforcement learning associated to interindividual differences in optimistic personality trait ? 2) Is this interindividual variability associated to specific neuroanatomical and functional brain signatures? 3) Is this computational bias influenced by individual socioeconomic environment?

slide-20
SLIDE 20

The link with optimistic personality trait

Life Orientation Test - Revised (LOT-R) Model-based correlation Behavioural correlation

External validity (I): Relation to psychometric measures of “optimism”

slide-21
SLIDE 21

The neural bases of optimistic RL

Neuroanatomical (VBM) Model-based correlation Behavioural correlation Neurophysiology (fMRI) Policy Update

External validity (II): Relation to brain signatures (Neurocomputational phenotypes)

slide-22
SLIDE 22

The effect of environmental harshness (preliminary)

Different life trajectories Bias amplitude

External validity (III): Sensitivity to socio economic status

slide-23
SLIDE 23

Extending the framework

Context (s1,…sj) Options (a1,…ai)

V(sj,ai)

Option values Selection

P(sj,ai)

Choice Probabilities

Agent Environment

Action (a) Obtained outcomes

Update 1

Forgone outcomes

2

1 Learning from direct experience (“factual”) 2 Learning from simulates experience (“counterfactual”)

Learning biases!

slide-24
SLIDE 24

Remain question (among others):

  • Is counterfactual learning also biased ?
  • This reinforcement learning bias is a valuation bias or a

confirmation bias?

New design

  • This task includes counterfactual feedbacks

The third study (N=20)

RC RU

slide-25
SLIDE 25

Formalism and predictions

V(C)t+1=V(C)t+αC+PECt PECt=RC-V(C)t V(C)t+1=V(C)t+αC-PECt V(U)t+1=V(U)t+αU+PEUt PEUt=RU-V(U)t V(U)t+1=V(U)t+αU-PEUt

Factual learning (as before) Counterfactual learning (new) αC+ αC-

Optimistic Egocentric

αU+ αU-

Allocentric Realistic

αU+ αU- αU+ αU-

Counterfactual feedback processing is unbiased

VS VS

The bias is choice independent (valuation bias) the bias is choice dependent (confirmation bias ) We know that

slide-26
SLIDE 26

Results

  • Counterfactual learning is

also biased

  • The counterfactual learning

bias is choice oriented (as a confirmation bias would be)

slide-27
SLIDE 27

Final Conclusions

 The good news/bad news (optimism) effect may be the consequence of a low level reinforcement learning bias  Factual learning extend to counterfactual learning in the form of a confirmation bias  This computational bias is highly variable in the population and this variability has external validity in term of neural bases, personality trait and environmental influences.

Abstract cues Real life VS Real

experience

Fictive

experience

VS Stable

environment

Volatile

environment

VS Reward

  • mission

Punishment

reception

VS

slide-28
SLIDE 28

Optimistic learning and “conservative” behavior

A link between optimism and the exploration/exploitation trade-off: a prerequisite to engage in exploratory behavior is that the subject must be capable to grown unsatisfied of the current state of affair. An extreme “optimist” cannot. The idea is not new.

“Optimism," said Cacambo, "What is that?" "Alas!" replied Candide, "It is the

  • bstinacy of maintaining that

everything is best when it is worst”

Voltaire (1759)

slide-29
SLIDE 29

 Acknowledge the importance of learning biases (not only decision biases) to understand normative and/or maladaptive behavior  Extend the framework to other learning module (currently

  • bservational learning: learning from other’s behavior)

 Develop tools to quantify at the individual level such bias (in both term of behavioral test and computational procedures)  computational profiling  Explore the predictive power of such computational profiling in terms of long term clinical, professional and economic

  • utcomes (longitudinal studies)

 Design naturalistic reinforcement learning-inspired behavioral interventions

Future directions

slide-30
SLIDE 30

Thank you for you attention!

Collaborators :

  • Sarah-Jayne Blakemore (University College London, London)
  • Sacha Bourgeois-Gironde (University of Paris 2, Paris)
  • Maël Lebreton (University of Amsterdam, Amsterdam)
  • Germain Lefebvre (Ecole Normale Supérieure, Paris)
  • Florent Meyniel (Neurospin, Gif/Yvette)
  • Lou Safra (Ecole Normale Supérieure, Paris)
  • Coralie Chevallier (Ecole Normale Supérieure, Paris)

Germain

slide-31
SLIDE 31

Supplementary slides

slide-32
SLIDE 32

Replication sample for counterfactual bias

Data kindly provided by Giorgio Coricelli & Mateus Joffily

slide-33
SLIDE 33

So far:

  • We demonstrated that optimistic “low level” reinforcement

learning as measure by our task/model is correlated with positive life orientation, specific brain structure and function and economic life trajectory

  • The learning bias generalizes from “factual” to “counterfactual”

learning New questions: 1) Why these biases exist ?  (self preservation against depressive realism) 2) Why this variability ?

Interim conclusions (II) and new question/idea

slide-34
SLIDE 34

Working hypothesis:

Multiple near-optimal solutions are maintained in the population to ensure adaptability

  • f the group in respect to changing (and diverse) environments.

Variability is noise

Statistically “optimal” model

  • Subj. 1:

Model*

  • Subj. 2:

Model**

  • Subj. n:

Model°*

Variability is adaptive

“Optimal” repertoire of models

  • Subj. 1:

Model 1

  • Subj. n:

Model 4

  • Subj. 2:

Model 2

  • Subj. 3:

Model 3

The good side of variability (preliminary)

slide-35
SLIDE 35

Moving from a private learning context to a social foraging context (ongoing project) to define which environmental properties may favor a mixed of populations  Introduce a cost of exploration (switching patch)  Introduce consumption of the resources (i.e. instability)

Choice

Collective reserve

Sw St Sw St St

  • 1
  • 1

Sw: Switch St: Stay

Outcome

Collective reserve

R N N R R

R: Reward N: Nothing

+1 +1 +1

P(Reward)

Consumption

Forgone Sampled

Trials Stable Instable Environment

P(Reward)

More ecological situation

slide-36
SLIDE 36

1000 generations of 50 individuals. We compare the cumulative payoff as a function of exploration cost and resource consumption rapidity. Mixed RW RW(±)

  • Mixed population can outperform both “extreme” populations, if the

environment presents a mixture of stable and instable phases.

  • Ultimately this mixture of strategies may be adaptive

Simulation results (preliminary)

“Isopayoff” line