Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control
Prashanth L.A.
Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvári
University of Maryland, College Park
Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - - PowerPoint PPT Presentation
Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvri University of Maryland, College Park AI that benefjts humans
University of Maryland, College Park
Utility functions u+, u− : R → R+, u+(x) = 0 when x ≤ 0, u−(x) = 0 when x ≥ 0 Weight functions w+, w− : [0, 1] → [0, 1] with w(0) = 0, w(1) = 1
a max a 0 , a max a 0
Utility functions u+, u− : R → R+, u+(x) = 0 when x ≤ 0, u−(x) = 0 when x ≥ 0 Weight functions w+, w− : [0, 1] → [0, 1] with w(0) = 0, w(1) = 1
(a)+ = max(a, 0), (a)− = max(−a, 0)
Losses u+ −u− Gains Utility
For losses, the disutility −u− is convex, for gains, the utility u+ is concave
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 p0.69 (p0.69 + (1 − p)0.69)1/0.69
Probability p Weight w(p)
Overweight low probabilities, underweight high probabilities
Kahneman & Tversky (1979) “Prospect Theory: An analysis of decision under risk” is the second most cited paper in economics during the period, 1975-2000
C(Xθ) := ∫ +∞ w+ ( P ( u+(Xθ) > z )) dz − ∫ +∞ w− ( P ( u−(Xθ) > z )) dz Find θ∗ = arg max
θ∈Θ
C(Xθ)
n (x) = 1
n
i=1
n (x) = 1
n
i=1
n (x))dx
n (x))dx
n i 1
n (x) = 1
n
i=1
n (x) = 1
n
i=1
n (x))dx
n (x))dx
n
i=1
2
θ∈Θ
Prediction Control CPT-value Cθ Parameter θ
n+1 =
n +
θn+δn∆n n
θn−δn∆n n
n
∆n is a vector of independent Rademacher r.v.s and δn > 0 vanishes asymptotically.
x Measurement Oracle f(x) + ξ Zero mean
Simulation optimization
X, ϵ CPT Estimator C(X) + ϵ Controlled bias
CPT-value optimization
θn + − δn∆n δn∆n C
θn+δn∆n n
Prediction
C
θn−δn∆n n
Prediction
Update θn (Gradient as- cent)
Control
θn+1
mn samples mn samples
n
light controller as reference
M
i=1
10 20 1 4 7 7 18 22 17 11 6 5 Bin Frequency
10 20 1 1 9 14 21 26 19 8 Bin Frequency
6.64 16.64 26.64 36.64 46.64 20 40 1 1 1 8 52 24 12 Bin Frequency
means (no utility/weights), EUT uses utilities but no weights and CPT uses both.