Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - - PowerPoint PPT Presentation

cumulative prospect theory meets reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - - PowerPoint PPT Presentation

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvri University of Maryland, College Park AI that benefjts humans


slide-1
SLIDE 1

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control

Prashanth L.A.

Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvári

University of Maryland, College Park

slide-2
SLIDE 2

AI that benefjts humans

Reinforcement learning (RL) setting with rewards evaluated by humans

World Agent

Reward CPT

Cumulative prospect theory (CPT) captures human preferences

slide-3
SLIDE 3

CPT-value

For a given r.v. X, CPT-value C(X) is C(X) := ∫ +∞ w+ ( P ( u+(X) > z )) dz

  • Gains

− ∫ +∞ w− ( P ( u−(X) > z )) dz

  • Losses

Utility functions u+, u− : R → R+, u+(x) = 0 when x ≤ 0, u−(x) = 0 when x ≥ 0 Weight functions w+, w− : [0, 1] → [0, 1] with w(0) = 0, w(1) = 1

Connection to expected value: X X z dz X z dz X X

a max a 0 , a max a 0

slide-4
SLIDE 4

CPT-value

For a given r.v. X, CPT-value C(X) is C(X) := ∫ +∞ w+ ( P ( u+(X) > z )) dz

  • Gains

− ∫ +∞ w− ( P ( u−(X) > z )) dz

  • Losses

Utility functions u+, u− : R → R+, u+(x) = 0 when x ≤ 0, u−(x) = 0 when x ≥ 0 Weight functions w+, w− : [0, 1] → [0, 1] with w(0) = 0, w(1) = 1

Connection to expected value: C(X) = ∫ +∞ P (X > z) dz − ∫ +∞ P (−X > z) dz = E [ (X)+] − E [ (X)−]

(a)+ = max(a, 0), (a)− = max(−a, 0)

slide-5
SLIDE 5

Utility and weight functions

Utility functions

Losses u+ −u− Gains Utility

For losses, the disutility −u− is convex, for gains, the utility u+ is concave

Weight function

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 p0.69 (p0.69 + (1 − p)0.69)1/0.69

Probability p Weight w(p)

Overweight low probabilities, underweight high probabilities

slide-6
SLIDE 6

Prospect Theory

Amos Tversky Daniel Kahneman

Kahneman & Tversky (1979) “Prospect Theory: An analysis of decision under risk” is the second most cited paper in economics during the period, 1975-2000

slide-7
SLIDE 7

Our Contributions

C(Xθ) := ∫ +∞ w+ ( P ( u+(Xθ) > z )) dz − ∫ +∞ w− ( P ( u−(Xθ) > z )) dz Find θ∗ = arg max

θ∈Θ

C(Xθ)

  • CPT-value estimation using empirical distribution functions
  • SPSA-based policy gradient algorithm
  • sample complexity bounds for estimation + asymptotic

convergence of policy gradient

  • traffjc signal control application
slide-8
SLIDE 8

CPT-value estimation

Problem: Given samples X1, . . . , Xn of X, estimate C(X) := ∫ +∞ w+ ( P ( u+(X) > z )) dz − ∫ +∞ w− ( P ( u−(X) > z )) dz Nice to have: Sample complexity O ( 1/ϵ2) for accuracy ϵ

slide-9
SLIDE 9

Empirical distribution function (EDF): Given samples X1, . . . , Xn of X, ˆ F+

n (x) = 1

n

n

i=1

1(u+(Xi)≤x), and ˆ F−

n (x) = 1

n

n

i=1

1(u−(Xi)≤x) Using EDFs, the CPT-value C(X) is estimated by Cn = ∫ +∞ w+(1 − ˆ F+

n (x))dx

  • Part (I)

− ∫ +∞ w−(1 − ˆ F−

n (x))dx

  • Part (II)

Computing Part (I): Let X 1 X 2 X n denote the order-statistics Part (I)

n i 1

u X i w n 1 i n w n i n

slide-10
SLIDE 10

Empirical distribution function (EDF): Given samples X1, . . . , Xn of X, ˆ F+

n (x) = 1

n

n

i=1

1(u+(Xi)≤x), and ˆ F−

n (x) = 1

n

n

i=1

1(u−(Xi)≤x) Using EDFs, the CPT-value C(X) is estimated by Cn = ∫ +∞ w+(1 − ˆ F+

n (x))dx

  • Part (I)

− ∫ +∞ w−(1 − ˆ F−

n (x))dx

  • Part (II)

Computing Part (I): Let X[1], X[2], . . . , X[n] denote the order-statistics Part (I) =

n

i=1

u+(X[i]) ( w+ (n + 1 − i n ) −w+ (n − i n )) ,

slide-11
SLIDE 11

(A1). Weights w+, w− are Hölder continuous, i.e., |w+(x) − w+(y)| ≤ H|x − y|α, ∀x, y ∈ [0, 1] (A2). Utilities u+(X) and u−(X) are bounded above by M < ∞ Sample Complexity: Under (A1) and (A2), for any ϵ, δ > 0, we have P ( Cn − C(X)

  • ≤ ϵ

) > 1 − δ ,∀n ≥ ln (1 δ ) · 4H2M2 ϵ2/α Special Case: Lipschitz weights ( 1) Sample complexity O 1

2

for accuracy

slide-12
SLIDE 12

(A1). Weights w+, w− are Hölder continuous, i.e., |w+(x) − w+(y)| ≤ H|x − y|α, ∀x, y ∈ [0, 1] (A2). Utilities u+(X) and u−(X) are bounded above by M < ∞ Sample Complexity: Under (A1) and (A2), for any ϵ, δ > 0, we have P ( Cn − C(X)

  • ≤ ϵ

) > 1 − δ ,∀n ≥ ln (1 δ ) · 4H2M2 ϵ2/α Special Case: Lipschitz weights (α = 1) Sample complexity O ( 1/ϵ2) for accuracy ϵ

slide-13
SLIDE 13

CPT-value optimization

Find θ∗ = arg max

θ∈Θ

C(Xθ) RL application: θ = policy parameter, Xθ = return

Prediction Control CPT-value Cθ Parameter θ

Two-Stage Solution: inner stage Obtain samples

  • f Xθ and

estimate C(Xθ);

  • uter stage Update θ using

gradient ascent ∇iC(Xθ) is not given

slide-14
SLIDE 14

Update rule: θi

n+1 =

Γi ( θi

n +

γn

  • ∇iC(Xθn)

) , i = 1, . . . , d. Projection operator Step-sizes Gradient estimate Challenge: estimating ∇iC(Xθ) given only biased estimates of C(Xθ) Solution: use SPSA [Spall’92]

  • ∇iC(Xθ) = C

θn+δn∆n n

− C

θn−δn∆n n

2δn∆i

n

∆n is a vector of independent Rademacher r.v.s and δn > 0 vanishes asymptotically.

slide-15
SLIDE 15

x Measurement Oracle f(x) + ξ Zero mean

Simulation optimization

X, ϵ CPT Estimator C(X) + ϵ Controlled bias

CPT-value optimization

θn + − δn∆n δn∆n C

θn+δn∆n n

Prediction

C

θn−δn∆n n

Prediction

Update θn (Gradient as- cent)

Control

θn+1

mn samples mn samples

Figure 1: Overall fmow of CPT-SPSA How to choose mn to ignore estimation bias? Ensure 1 mα/2

n

δn → 0

slide-16
SLIDE 16

Application: Traffjc signal control

  • For any path i = 1, . . . , M, let Xi

be the delay gain

  • calculated with a pre-timed traffjc

light controller as reference

  • CPT captures the road users’

evaluation of the delay gain Xi

  • Goal: Maximize

CPT(X1, . . . , XM) =

M

i=1

µiC(Xi) µi: proportion of traffjc on path i

slide-17
SLIDE 17
  • 335.86
  • 315.86
  • 295.86
  • 275.86
  • 255.86
  • 235.86
  • 215.86
  • 195.86
  • 175.86
  • 155.86

10 20 1 4 7 7 18 22 17 11 6 5 Bin Frequency

(a) AVG-SPSA

  • 188.19
  • 175.69
  • 163.19
  • 150.69
  • 138.19
  • 125.69
  • 113.19
  • 100.69
  • 88.19
  • 75.69

10 20 1 1 9 14 21 26 19 8 Bin Frequency

(b) EUT-SPSA

  • 43.36
  • 33.36
  • 23.36
  • 13.36
  • 3.36

6.64 16.64 26.64 36.64 46.64 20 40 1 1 1 8 52 24 12 Bin Frequency

(c) CPT-SPSA

Figure 2: Histogram of CPT-value of the delay gain: AVG uses plain sample

means (no utility/weights), EUT uses utilities but no weights and CPT uses both.

slide-18
SLIDE 18

Conclusions

  • Want AI to be benefjcial to humans
  • CPT - a very popular paradigm for modeling human decisions
  • We lay the foundations for using CPT in an RL setting
  • Prediction: Sample means (TD) won’t work, but empirical

distributions do!

  • Control: No Bellman, but SPSA can be employed

Future directions:

  • Crowdsourcing experiment to validate CPT online
  • Robustness to unknown utility and weight function parameters
slide-19
SLIDE 19

Conclusions

  • Want AI to be benefjcial to humans
  • CPT - a very popular paradigm for modeling human decisions
  • We lay the foundations for using CPT in an RL setting
  • Prediction: Sample means (TD) won’t work, but empirical

distributions do!

  • Control: No Bellman, but SPSA can be employed

Future directions:

  • Crowdsourcing experiment to validate CPT online
  • Robustness to unknown utility and weight function parameters
slide-20
SLIDE 20

Thanks! Questions?