SLIDE 1
Behavioral Neural Networks Shaowei Ke (UMich Econ) Chen Zhao (HKU - - PowerPoint PPT Presentation
Behavioral Neural Networks Shaowei Ke (UMich Econ) Chen Zhao (HKU - - PowerPoint PPT Presentation
Behavioral Neural Networks Shaowei Ke (UMich Econ) Chen Zhao (HKU Econ) Zhaoran Wang (Northwestern IEMS) Sung-Lin Hsieh (UMich Econ) November 2020 Machine Learning Over the last 15 years, machine-learning models have performed well in many
SLIDE 2
SLIDE 3
Questions
I A statistical model that can predict well is not necessarily a good model of how people make decisions
I E.g. insights about decision making may be lost in the approximation
I But maybe some of these useful machine-learning models are indeed good models of how people make decisions? I If so, we may better understand and incorporate them into economics I If we have more choice data, it is highly likely that such machine learning models would outperform our traditional models in terms of prediction, and may even help us identify behavioral phenomena
SLIDE 4
A Good Model of Decision-Making?
E.g. the expected utility model
- 1. The model is characterized by reasonable axioms imposed directly on
choice behavior
- 2. The model provides a plausible interpretation/story about how
people make choices
SLIDE 5
This Paper
- 1. We provide an axiomatic foundation for a class of neural-network
models applied to decision making under risk, called the neural-network expected utility (NEU) models
I The independence axiom is relaxed in a novel way consistent with experimental findings I The model provides a plausible interpretation of people’s choice behavior
- 2. We show that simple neural-network structures, referred to as
behavioral neurons, can capture behavioral biases intuitively
- 3. By using these behavioral neurons, we find that some simple NEU
models that are easy to interpret perform better than EU and CPT
SLIDE 6
Neural-Network Expected Utility
SLIDE 7
Choice Domain and Primitive
Prizes: Z = {z1, . . . , zn} I Generic prizes: x, y, z The set of lotteries: L =
- p 2 Rn
+ : ∑n i=1 pi = 1
I Generic lotteries: p, q, r, s I Degenerate lotteries: dx Mixture: For any l 2 [0, 1], lp + (1 l)q is a lottery such that (lp + (1 l)q)i = lpi + (1 l)qi I lpq := lp + (1 l)q A decision maker has a binary relation/preference % on L
SLIDE 8
Vector-Valued Affine Function
I t : Rw ! R e
w is affine if there exists w ⇥ e
w matrix b and e w ⇥ 1 vector g such that t(a) = ba + g for any a 2 Rw I t = (t(1), . . . , t( e
w)) is affine ) t(j)’s are affine
I A real-valued function on L is affine if and only if it is an expected utility function
SLIDE 9
NEU Representation
A function U : L ! R is a NEU function if there exist I h, w0, w1, . . . , wh+1 2 N with w0 = n and wh+1 = 1 I qi : Rwi ! Rwi, i = 1, . . . , h, such that for any b 2 Rwi, qi(b) = (max{b1, 0}, . . . , max{bwi, 0}) I affine ti : Rwi1 ! Rwi, i = 1, . . . , h + 1, such that U(p) = th+1 qh th · · · q2 t2 q1 t1(p) We say that % has a NEU representation if it can be represented by a NEU function
SLIDE 10
NEU Representation
p1 p2 p3 max{τ (1)
1
(·), 0} max{τ (2)
1
(·), 0} U(p) max{τ (1)
2
(·), 0} max{τ (2)
2
(·), 0} First Hidden Layer Second Hidden Layer
I A NEU function U(p) = t3 q2 t2 q1 t1(p) I ith hidden layer: qi ti I Activation function: max{·, 0} I Neuron: max{t(j)
i
, 0}
SLIDE 11
Interpretation
p1 p2 p3 max{τ (1)
1
(·), 0} max{τ (2)
1
(·), 0} U(p) max{τ (1)
2
(·), 0} max{τ (2)
2
(·), 0} First Hidden Layer Second Hidden Layer
I The decision maker has multiple considerations toward uncertainty (expected utility functions in the first layer)
I E.g., one for the mean of prizes and one for downside risk
I She considers multiple ways to aggregate those attitudes plausible (affine functions in the second layer) I Recursively, she may continue to have multiple ways in mind to aggregate the aggregations from the previous layer
SLIDE 12
Axiomatic Characterization
SLIDE 13
Expected Utility Theory
Axiom (Weak Order) % is complete and transitive. Axiom (Continuity) For any p, {q : p % q} and {q : q % p} are closed. Axiom (Independence) For any l 2 (0, 1), p % q ) lpr % lqr and p q ) lpr lqr. I There are alternative ways to define independence Axiom (Bi-Independence) For any l 2 (0, 1), if p % q, then r % s ) lpr % lqs and r s ) lpr lqs. I Let p = q: Bi-Independence ) Independence I Apply Independence twice, we get Bi-Independence
SLIDE 14
Violations of (Bi-)Independence: The Allais Paradox
First Pair Second Pair 100% $1M 87% $1M 3% $0 10% $1.5M 87% $0 13% $1M 90% $0 10% $1.5M 0.13pr 0.13qr 0.13ps 0.13qs I p = d1M, q = 3
13d0 + 10 13d1.5M,
r = d1M, s = d0
- 1. Bias toward certainty
- 2. 0.13qr must look sufficiently different from a risk-free lottery
SLIDE 15
The Allais Paradox in a Nutshell (Literally)
First Pair Second Pair 100% $1M 98.7% $1M 0.3% $0.5M 1% $1.5M 98.7% $0.5M 1.3% $1M 99% $0.5M 1% $1.5M 0.013pr 0.013q0r 0.013ps0 0.013q0s0 I p = d1M, q0 = 3
13d0.5M + 10 13d1.5M,
r = d1M, s0 = d0.5M I It seems much less likely to observe significant violations of (Bi-)Independence
SLIDE 16
Violations of (Bi-)Independence
The difference between lotteries needs to be large enough so that psychological effects apply to lotteries asymmetrically I We want to stick to (Bi-)Independence as much as possible because
- f its normative appeal
I But if (Bi-)Independence holds locally everywhere, it holds globally I Is there a (slightly) relaxed version of (Bi-)Independence that can hold locally everywhere but not globally?
SLIDE 17
Relaxing Independence
A subset L preserves independence with respect to p (L ? p) if for any q, r 2 L and l 2 (0, 1), q % r ) lpq % lpr and q r ) lpq lpr I L may not be convex, and p, lpq, lpr may be outside L
SLIDE 18
Relaxing Independence
I A subset L preserves independence with respect to p if for any q, r 2 L and l 2 (0, 1), q % r ) lpq % lpr and q r ) lpq lpr I A subset L ✓ L preserves independence if for any p, q, r 2 L and l 2 (0, 1) such that lpr, lqr 2 L, q % r ) lpq % lpr and q r ) lpq lpr
SLIDE 19
Relaxing Independence
I A neighborhood of p: an open convex set that contains p Axiom (Weak Local Independence) Every p 2 L has a neighborhood Lp such that Lp ? p. I Weak Local Independence does not mean that “independence holds locally around every p”
SLIDE 20
Weak Local Independence
Axiom (Weak Local Independence) Every p 2 L has a neighborhood Lp such that Lp ? p. I Allows the following type of indifference curves
SLIDE 21
Relaxing Bi-Independence
I Weak Local Independence only tells us the decision maker’s local choice behavior I Local versions of Bi-Independence can regulate the decision maker’s non-local choice behavior
SLIDE 22
Relaxing Bi-Independence
Axiom (Weak Local Bi-Independence) If p % q, then p and q have neighborhoods Lp and Lq such that for any r 2 Lp, s 2 Lq and l 2 (0, 1), r % s ) lpr % lqs and r s ) lpr lqs. I When p = q, we obtain Weak Local Independence I Only impose bi-independence when mixing with p and q respectively I Lp does not have to be the same for different q’s
p q s r λpr λqs
SLIDE 23
Main Theorem
Theorem
% has a NEU representation if and only if it satisfies Weak Order, Continuity, and Weak Local Bi-Independence. I EU characterizes linear functions on L I NEU characterizes continuous finite piecewise linear functions on L
SLIDE 24
Behavioral Neurons and Empirical Analysis
SLIDE 25
NEU and the Certainty Effect
The Allais paradox: the decision maker has a bias toward certainty
p1 p2 p3 V(p) max{p1−0.9,0} U(p) max{p2−0.9,0} max{p3−0.9,0}
I V is an expected utility function I If pi > 0.98, a neuron that captures certainty effect will be activated
SLIDE 26
NEU and the Certainty Effect
SLIDE 27
NEU and Reference Dependence
Kahneman and Tversky (1979): prizes are evaluated relative to a reference point; people treat gains and losses differently Ert and Erev (2013): the difference becomes insignificant when prizes don’t deviate from the reference point by much I A neuron about expected utility: V(p) I A neuron about loss aversion relative to $x with a threshold e: l min{∑
i
pi min{zi x, 0} | {z }
affine in p
, e} (loss aversion coefficient: l > 1) I U(p) is the difference between two neurons’ values I Violations of expected utility theory in the form of loss aversion only
- ccur when losses (relative to the reference point) are significant
SLIDE 28
Empirical Analysis
I Can the NEU model explain and predict decision makers’ choice behavior well? I Can we do so with a NEU model that is not too complicated to interpret? I How does the NEU model compare to other economic models?
SLIDE 29
Data
I Plonsky, Apel, Erev, Ert, and Tennenholtz: Choice Prediction Competition 2018 I Experimental data of individual binary choices between monetary gambles
I Over half a million individual binary choice data points
I There are 270 different binary choice problems in total, and each participant answers 30 with each problem repeated for 25 times
I The first 30 problems: replicate 14 well known behavioral findings I Rule out problems involves ambiguity or correlated realizations I CPC training data: data of 210 problems (169 are relevant for us) I CPC testing data: data of the remaining 60 problems (45 are relevant for us)
SLIDE 30
A Typical Binary Choice Problem
Compared to other experiments, the lotteries here are more generic
SLIDE 31
Expected Utility (EU) Benchmark
I EU: U(p) = ∑ piu(zi) with an arbitrary u I CARA: u(z) = (1 exp{az})/a
I Some prizes are negative, so we choose CARA rather than CRRA
I For every binary choice problem, record the fraction of participants choosing each lottery I Discrete choice: U(p) + #p with #p’s following i.i.d. Gumbel distribution I Evaluation measure: mean squared error ⇥ 100 training error testing error EU 1.07 19.74 CARA 2.28 1.98
SLIDE 32
Overfitting of EU
SLIDE 33
Cumulative Prospect Theory Benchmark
I Parameters to be estimated:
- 1. Reference point
- 2. Gain region: value function’s concavity + weighting function
- 3. Loss region: value function’s convexity + weighting function + loss
aversion coefficient
I Risk-neutral EU is a special case, but not the other CARA EUs I Overfitting is taken care of by not considering general value and weighting functions training error testing error CPT 2.255 (0.022) 1.996 (0.099) TK’s estimated CPT 4.159 (0.004) 3.686 (0.010) Risk-neutral EU 2.512 (0.000) 2.793 (0.000)
SLIDE 34
How Do We Parametrize NEU?
I Recall that the first hidden layer’s affine functions are EU functions I Requiring that the first hidden layer be CARA EU functions may help?
SLIDE 35
CARA NEU
I CARA EU testing error = 1.98 (std dev⇡ 0) I CARA NEU: NEU whose first layer consists of CARA EU functions
I The first hidden layer’s width: 15, 20, or 25 I The number of hidden layers above the first: 0, 1, or 2 I The width of the hidden layers above the first: 15, 20, or 25
I The best testing error from these NEU functions is 1.97
SLIDE 36
The Problem with CARA NEU
I CARA NEU mitigates overfitting but destroys too much flexibility of NEU I What useful flexibility is removed? I E.g., neurons that capture the certainty effect and reference dependence I We could consider other behavioral neurons, and select the most useful neurons using cross-validation, but these two (most important ingredients of CPT) are sufficient for our purpose
SLIDE 37
Behavioral NEU
We require t1 to consist of some (or all) of the following three types of “behavioral neurons” that we have introduced previously
- 1. A neuron that is a CARA EU function
I Can allow for multiple CARA EU functions, but seems unnecessary
- 2. Neurons that capture reference dependence
I We allow for multiple reference points and loss-aversion coefficients
- 3. Neurons that capture the certainty effect
I We allow for two thresholds
I We can consider neurons that capture other behavioral models, but these three are sufficient to illustrate our point Then, t1 is concatenated with an otherwise standard NEU function
SLIDE 38
Do We Need a Complex Neural Network?
I CARA EU testing error = 1.98 (with std dev⇡ 0) CARA + RD CARA + CE CARA + RD + CE 1 Hidden Layer 1.971 (0.113) 2.009 (0.006) 1.966 (0.133) 2 Hidden Layers 1.850 (0.176) 2.030 (0.022) 1.748 (0.221) >2 Hidden Layers 2.217 (0.339) 2.130 (0.038) 1.879 (0.315) I RD alone or CE alone is not sufficiently helpful I Having both RD and CE but only too few or too many layers is also not sufficiently helpful I Compared to CARA, CARA+RD+CE decreases the training error by 19%, and the testing error by 12%
SLIDE 39
Summary
- 1. Overfitting and predictive power: More general models may
“explain” more, but if it does not predict well, explanatory power may come from overfitting
- 2. Domain knowledge (decision theory, behavioral economics, etc.) is
useful in improving machine-learning model’s performance when we do not have lots of data
- 3. Reasonably complex NEU models with natural interpretation explain
and predict better than (i) too simple or too complex ones and (ii) EU and CPT
SLIDE 40
Final Remarks
I When the dataset contains more (generic) choice problems, the NEU model will become even more useful: It may help us identify behavioral biases that are unknown to us I Endogenous selection (e.g., via lasso or ridge) of behavioral neurons I Similar to mixed logit vs logit, could we develop a method to estimate mixed NEU/random NEU models?
SLIDE 41
Violations of (Bi-)Independence: Common Ratio Effect
First Pair Second Pair 60% $2000 40% $0 45% $2500 55% $0 12% $2000 88% $0 9% $2500 91% $0 p q 0.2pd0 0.2qd0
- 1. Focal feature: higher winning probability vs. better prize
- 2. The lotteries must be sufficiently different to switch the focal feature
SLIDE 42
Violations of (Bi-)Independence: Common Ratio Effect
First Pair Second Pair 60% $2000 40% $0 45% $2500 55% $0 56% $2000 44% $0 42% $2500 58% $0 p q
14 15 pd0 14 15qd0
I It seems much less likely to observe significant violations of (Bi-)Independence I The focal feature does not change significantly
SLIDE 43
Necessity of Parametrization
Take EU as an example: I Some prizes in the testing dataset never appear in the training dataset I If we exclude binary choice problems that involve prizes that never show up in the training dataset, EU’s testing error is 17.82 I The problem is mainly overfitting We need to parametrize the models that we want to estimate, since the dataset is not large enough
SLIDE 44
Potential Issues of the Data
I Random incentive mechanism under non-expected utility theory I How the show-up fee is determined is not explained
I The show-up fee secretly depends on which binary choice is randomly selected to determine payment I If someone understood that she would not walk out of the lab losing money, and noticed that there is no fixed show-up fee, she may realize that the show-up fee may depend on the selected choice problem, and hence she may view negative prizes differently
I The number of different binary choice problems is small
SLIDE 45