Behavioral Neural Networks Shaowei Ke (UMich Econ) Chen Zhao (HKU - - PowerPoint PPT Presentation

behavioral neural networks
SMART_READER_LITE
LIVE PREVIEW

Behavioral Neural Networks Shaowei Ke (UMich Econ) Chen Zhao (HKU - - PowerPoint PPT Presentation

Behavioral Neural Networks Shaowei Ke (UMich Econ) Chen Zhao (HKU Econ) Zhaoran Wang (Northwestern IEMS) Sung-Lin Hsieh (UMich Econ) November 2020 Machine Learning Over the last 15 years, machine-learning models have performed well in many


slide-1
SLIDE 1

Behavioral Neural Networks

Shaowei Ke (UMich Econ) Chen Zhao (HKU Econ) Zhaoran Wang (Northwestern IEMS) Sung-Lin Hsieh (UMich Econ) November 2020

slide-2
SLIDE 2

Machine Learning

Over the last 15 years, machine-learning models have performed well in many decision problems I Product recommendation I Complex games: AlphaGo 2018 Turing Award (Bengio, Hinton, and LeCun): “conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing”

slide-3
SLIDE 3

Questions

I A statistical model that can predict well is not necessarily a good model of how people make decisions

I E.g. insights about decision making may be lost in the approximation

I But maybe some of these useful machine-learning models are indeed good models of how people make decisions? I If so, we may better understand and incorporate them into economics I If we have more choice data, it is highly likely that such machine learning models would outperform our traditional models in terms of prediction, and may even help us identify behavioral phenomena

slide-4
SLIDE 4

A Good Model of Decision-Making?

E.g. the expected utility model

  • 1. The model is characterized by reasonable axioms imposed directly on

choice behavior

  • 2. The model provides a plausible interpretation/story about how

people make choices

slide-5
SLIDE 5

This Paper

  • 1. We provide an axiomatic foundation for a class of neural-network

models applied to decision making under risk, called the neural-network expected utility (NEU) models

I The independence axiom is relaxed in a novel way consistent with experimental findings I The model provides a plausible interpretation of people’s choice behavior

  • 2. We show that simple neural-network structures, referred to as

behavioral neurons, can capture behavioral biases intuitively

  • 3. By using these behavioral neurons, we find that some simple NEU

models that are easy to interpret perform better than EU and CPT

slide-6
SLIDE 6

Neural-Network Expected Utility

slide-7
SLIDE 7

Choice Domain and Primitive

Prizes: Z = {z1, . . . , zn} I Generic prizes: x, y, z The set of lotteries: L =

  • p 2 Rn

+ : ∑n i=1 pi = 1

I Generic lotteries: p, q, r, s I Degenerate lotteries: dx Mixture: For any l 2 [0, 1], lp + (1 l)q is a lottery such that (lp + (1 l)q)i = lpi + (1 l)qi I lpq := lp + (1 l)q A decision maker has a binary relation/preference % on L

slide-8
SLIDE 8

Vector-Valued Affine Function

I t : Rw ! R e

w is affine if there exists w ⇥ e

w matrix b and e w ⇥ 1 vector g such that t(a) = ba + g for any a 2 Rw I t = (t(1), . . . , t( e

w)) is affine ) t(j)’s are affine

I A real-valued function on L is affine if and only if it is an expected utility function

slide-9
SLIDE 9

NEU Representation

A function U : L ! R is a NEU function if there exist I h, w0, w1, . . . , wh+1 2 N with w0 = n and wh+1 = 1 I qi : Rwi ! Rwi, i = 1, . . . , h, such that for any b 2 Rwi, qi(b) = (max{b1, 0}, . . . , max{bwi, 0}) I affine ti : Rwi1 ! Rwi, i = 1, . . . , h + 1, such that U(p) = th+1 qh th · · · q2 t2 q1 t1(p) We say that % has a NEU representation if it can be represented by a NEU function

slide-10
SLIDE 10

NEU Representation

p1 p2 p3 max{τ (1)

1

(·), 0} max{τ (2)

1

(·), 0} U(p) max{τ (1)

2

(·), 0} max{τ (2)

2

(·), 0} First Hidden Layer Second Hidden Layer

I A NEU function U(p) = t3 q2 t2 q1 t1(p) I ith hidden layer: qi ti I Activation function: max{·, 0} I Neuron: max{t(j)

i

, 0}

slide-11
SLIDE 11

Interpretation

p1 p2 p3 max{τ (1)

1

(·), 0} max{τ (2)

1

(·), 0} U(p) max{τ (1)

2

(·), 0} max{τ (2)

2

(·), 0} First Hidden Layer Second Hidden Layer

I The decision maker has multiple considerations toward uncertainty (expected utility functions in the first layer)

I E.g., one for the mean of prizes and one for downside risk

I She considers multiple ways to aggregate those attitudes plausible (affine functions in the second layer) I Recursively, she may continue to have multiple ways in mind to aggregate the aggregations from the previous layer

slide-12
SLIDE 12

Axiomatic Characterization

slide-13
SLIDE 13

Expected Utility Theory

Axiom (Weak Order) % is complete and transitive. Axiom (Continuity) For any p, {q : p % q} and {q : q % p} are closed. Axiom (Independence) For any l 2 (0, 1), p % q ) lpr % lqr and p q ) lpr lqr. I There are alternative ways to define independence Axiom (Bi-Independence) For any l 2 (0, 1), if p % q, then r % s ) lpr % lqs and r s ) lpr lqs. I Let p = q: Bi-Independence ) Independence I Apply Independence twice, we get Bi-Independence

slide-14
SLIDE 14

Violations of (Bi-)Independence: The Allais Paradox

First Pair Second Pair 100% $1M 87% $1M 3% $0 10% $1.5M 87% $0 13% $1M 90% $0 10% $1.5M 0.13pr 0.13qr 0.13ps 0.13qs I p = d1M, q = 3

13d0 + 10 13d1.5M,

r = d1M, s = d0

  • 1. Bias toward certainty
  • 2. 0.13qr must look sufficiently different from a risk-free lottery
slide-15
SLIDE 15

The Allais Paradox in a Nutshell (Literally)

First Pair Second Pair 100% $1M 98.7% $1M 0.3% $0.5M 1% $1.5M 98.7% $0.5M 1.3% $1M 99% $0.5M 1% $1.5M 0.013pr 0.013q0r 0.013ps0 0.013q0s0 I p = d1M, q0 = 3

13d0.5M + 10 13d1.5M,

r = d1M, s0 = d0.5M I It seems much less likely to observe significant violations of (Bi-)Independence

slide-16
SLIDE 16

Violations of (Bi-)Independence

The difference between lotteries needs to be large enough so that psychological effects apply to lotteries asymmetrically I We want to stick to (Bi-)Independence as much as possible because

  • f its normative appeal

I But if (Bi-)Independence holds locally everywhere, it holds globally I Is there a (slightly) relaxed version of (Bi-)Independence that can hold locally everywhere but not globally?

slide-17
SLIDE 17

Relaxing Independence

A subset L preserves independence with respect to p (L ? p) if for any q, r 2 L and l 2 (0, 1), q % r ) lpq % lpr and q r ) lpq lpr I L may not be convex, and p, lpq, lpr may be outside L

slide-18
SLIDE 18

Relaxing Independence

I A subset L preserves independence with respect to p if for any q, r 2 L and l 2 (0, 1), q % r ) lpq % lpr and q r ) lpq lpr I A subset L ✓ L preserves independence if for any p, q, r 2 L and l 2 (0, 1) such that lpr, lqr 2 L, q % r ) lpq % lpr and q r ) lpq lpr

slide-19
SLIDE 19

Relaxing Independence

I A neighborhood of p: an open convex set that contains p Axiom (Weak Local Independence) Every p 2 L has a neighborhood Lp such that Lp ? p. I Weak Local Independence does not mean that “independence holds locally around every p”

slide-20
SLIDE 20

Weak Local Independence

Axiom (Weak Local Independence) Every p 2 L has a neighborhood Lp such that Lp ? p. I Allows the following type of indifference curves

slide-21
SLIDE 21

Relaxing Bi-Independence

I Weak Local Independence only tells us the decision maker’s local choice behavior I Local versions of Bi-Independence can regulate the decision maker’s non-local choice behavior

slide-22
SLIDE 22

Relaxing Bi-Independence

Axiom (Weak Local Bi-Independence) If p % q, then p and q have neighborhoods Lp and Lq such that for any r 2 Lp, s 2 Lq and l 2 (0, 1), r % s ) lpr % lqs and r s ) lpr lqs. I When p = q, we obtain Weak Local Independence I Only impose bi-independence when mixing with p and q respectively I Lp does not have to be the same for different q’s

p q s r λpr λqs

slide-23
SLIDE 23

Main Theorem

Theorem

% has a NEU representation if and only if it satisfies Weak Order, Continuity, and Weak Local Bi-Independence. I EU characterizes linear functions on L I NEU characterizes continuous finite piecewise linear functions on L

slide-24
SLIDE 24

Behavioral Neurons and Empirical Analysis

slide-25
SLIDE 25

NEU and the Certainty Effect

The Allais paradox: the decision maker has a bias toward certainty

p1 p2 p3 V(p) max{p1−0.9,0} U(p) max{p2−0.9,0} max{p3−0.9,0}

I V is an expected utility function I If pi > 0.98, a neuron that captures certainty effect will be activated

slide-26
SLIDE 26

NEU and the Certainty Effect

slide-27
SLIDE 27

NEU and Reference Dependence

Kahneman and Tversky (1979): prizes are evaluated relative to a reference point; people treat gains and losses differently Ert and Erev (2013): the difference becomes insignificant when prizes don’t deviate from the reference point by much I A neuron about expected utility: V(p) I A neuron about loss aversion relative to $x with a threshold e: l min{∑

i

pi min{zi x, 0} | {z }

affine in p

, e} (loss aversion coefficient: l > 1) I U(p) is the difference between two neurons’ values I Violations of expected utility theory in the form of loss aversion only

  • ccur when losses (relative to the reference point) are significant
slide-28
SLIDE 28

Empirical Analysis

I Can the NEU model explain and predict decision makers’ choice behavior well? I Can we do so with a NEU model that is not too complicated to interpret? I How does the NEU model compare to other economic models?

slide-29
SLIDE 29

Data

I Plonsky, Apel, Erev, Ert, and Tennenholtz: Choice Prediction Competition 2018 I Experimental data of individual binary choices between monetary gambles

I Over half a million individual binary choice data points

I There are 270 different binary choice problems in total, and each participant answers 30 with each problem repeated for 25 times

I The first 30 problems: replicate 14 well known behavioral findings I Rule out problems involves ambiguity or correlated realizations I CPC training data: data of 210 problems (169 are relevant for us) I CPC testing data: data of the remaining 60 problems (45 are relevant for us)

slide-30
SLIDE 30

A Typical Binary Choice Problem

Compared to other experiments, the lotteries here are more generic

slide-31
SLIDE 31

Expected Utility (EU) Benchmark

I EU: U(p) = ∑ piu(zi) with an arbitrary u I CARA: u(z) = (1 exp{az})/a

I Some prizes are negative, so we choose CARA rather than CRRA

I For every binary choice problem, record the fraction of participants choosing each lottery I Discrete choice: U(p) + #p with #p’s following i.i.d. Gumbel distribution I Evaluation measure: mean squared error ⇥ 100 training error testing error EU 1.07 19.74 CARA 2.28 1.98

slide-32
SLIDE 32

Overfitting of EU

slide-33
SLIDE 33

Cumulative Prospect Theory Benchmark

I Parameters to be estimated:

  • 1. Reference point
  • 2. Gain region: value function’s concavity + weighting function
  • 3. Loss region: value function’s convexity + weighting function + loss

aversion coefficient

I Risk-neutral EU is a special case, but not the other CARA EUs I Overfitting is taken care of by not considering general value and weighting functions training error testing error CPT 2.255 (0.022) 1.996 (0.099) TK’s estimated CPT 4.159 (0.004) 3.686 (0.010) Risk-neutral EU 2.512 (0.000) 2.793 (0.000)

slide-34
SLIDE 34

How Do We Parametrize NEU?

I Recall that the first hidden layer’s affine functions are EU functions I Requiring that the first hidden layer be CARA EU functions may help?

slide-35
SLIDE 35

CARA NEU

I CARA EU testing error = 1.98 (std dev⇡ 0) I CARA NEU: NEU whose first layer consists of CARA EU functions

I The first hidden layer’s width: 15, 20, or 25 I The number of hidden layers above the first: 0, 1, or 2 I The width of the hidden layers above the first: 15, 20, or 25

I The best testing error from these NEU functions is 1.97

slide-36
SLIDE 36

The Problem with CARA NEU

I CARA NEU mitigates overfitting but destroys too much flexibility of NEU I What useful flexibility is removed? I E.g., neurons that capture the certainty effect and reference dependence I We could consider other behavioral neurons, and select the most useful neurons using cross-validation, but these two (most important ingredients of CPT) are sufficient for our purpose

slide-37
SLIDE 37

Behavioral NEU

We require t1 to consist of some (or all) of the following three types of “behavioral neurons” that we have introduced previously

  • 1. A neuron that is a CARA EU function

I Can allow for multiple CARA EU functions, but seems unnecessary

  • 2. Neurons that capture reference dependence

I We allow for multiple reference points and loss-aversion coefficients

  • 3. Neurons that capture the certainty effect

I We allow for two thresholds

I We can consider neurons that capture other behavioral models, but these three are sufficient to illustrate our point Then, t1 is concatenated with an otherwise standard NEU function

slide-38
SLIDE 38

Do We Need a Complex Neural Network?

I CARA EU testing error = 1.98 (with std dev⇡ 0) CARA + RD CARA + CE CARA + RD + CE 1 Hidden Layer 1.971 (0.113) 2.009 (0.006) 1.966 (0.133) 2 Hidden Layers 1.850 (0.176) 2.030 (0.022) 1.748 (0.221) >2 Hidden Layers 2.217 (0.339) 2.130 (0.038) 1.879 (0.315) I RD alone or CE alone is not sufficiently helpful I Having both RD and CE but only too few or too many layers is also not sufficiently helpful I Compared to CARA, CARA+RD+CE decreases the training error by 19%, and the testing error by 12%

slide-39
SLIDE 39

Summary

  • 1. Overfitting and predictive power: More general models may

“explain” more, but if it does not predict well, explanatory power may come from overfitting

  • 2. Domain knowledge (decision theory, behavioral economics, etc.) is

useful in improving machine-learning model’s performance when we do not have lots of data

  • 3. Reasonably complex NEU models with natural interpretation explain

and predict better than (i) too simple or too complex ones and (ii) EU and CPT

slide-40
SLIDE 40

Final Remarks

I When the dataset contains more (generic) choice problems, the NEU model will become even more useful: It may help us identify behavioral biases that are unknown to us I Endogenous selection (e.g., via lasso or ridge) of behavioral neurons I Similar to mixed logit vs logit, could we develop a method to estimate mixed NEU/random NEU models?

slide-41
SLIDE 41

Violations of (Bi-)Independence: Common Ratio Effect

First Pair Second Pair 60% $2000 40% $0 45% $2500 55% $0 12% $2000 88% $0 9% $2500 91% $0 p q 0.2pd0 0.2qd0

  • 1. Focal feature: higher winning probability vs. better prize
  • 2. The lotteries must be sufficiently different to switch the focal feature
slide-42
SLIDE 42

Violations of (Bi-)Independence: Common Ratio Effect

First Pair Second Pair 60% $2000 40% $0 45% $2500 55% $0 56% $2000 44% $0 42% $2500 58% $0 p q

14 15 pd0 14 15qd0

I It seems much less likely to observe significant violations of (Bi-)Independence I The focal feature does not change significantly

slide-43
SLIDE 43

Necessity of Parametrization

Take EU as an example: I Some prizes in the testing dataset never appear in the training dataset I If we exclude binary choice problems that involve prizes that never show up in the training dataset, EU’s testing error is 17.82 I The problem is mainly overfitting We need to parametrize the models that we want to estimate, since the dataset is not large enough

slide-44
SLIDE 44

Potential Issues of the Data

I Random incentive mechanism under non-expected utility theory I How the show-up fee is determined is not explained

I The show-up fee secretly depends on which binary choice is randomly selected to determine payment I If someone understood that she would not walk out of the lab losing money, and noticed that there is no fixed show-up fee, she may realize that the show-up fee may depend on the selected choice problem, and hence she may view negative prizes differently

I The number of different binary choice problems is small

slide-45
SLIDE 45

Training and Hyperparameter Selection

I Discrete choice: U(p) + #p with #p’s following i.i.d. Gumbel distribution

I Can be axiomatized easily based on our axioms

I Adaptive moment estimation (Adam) I L2-norm regularization I Hyperparameter optimization (mainly for width and depth)

I Leave-one-out cross-validation