Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - PowerPoint PPT Presentation

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvári University of Maryland, College Park

AI that benefjts humans Reinforcement learning (RL) setting with rewards evaluated by humans Cumulative prospect theory (CPT) captures human preferences Reward World CPT Agent

CPT-value dz max max a 0 , a a X X z dz X 0 z dz X 0 X Connection to expected value: Losses a 0 0 dz 0 Gains For a given r.v. X, CPT-value C ( X ) is ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) u + ( X ) > z u − ( X ) > z C ( X ) := P − P � �� Utility functions u + , u − : R → R + , u + ( x ) = 0 when x ≤ 0, u − ( x ) = 0 when x ≥ 0 Weight functions w + , w − : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1

CPT-value Connection to expected value: Gains dz Losses dz 0 0 0 0 For a given r.v. X, CPT-value C ( X ) is ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) u + ( X ) > z u − ( X ) > z C ( X ) := P − P � �� Utility functions u + , u − : R → R + , u + ( x ) = 0 when x ≤ 0, u − ( x ) = 0 when x ≥ 0 Weight functions w + , w − : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1 ∫ + ∞ ∫ + ∞ C ( X ) = P ( X > z ) dz − P ( − X > z ) dz [ ( X ) + ] [ ( X ) − ] = E − E ( a ) + = max ( a , 0 ) , ( a ) − = max ( − a , 0 )

Utility and weight functions 0 Overweight low probabilities, Probability p 1 0 1 Utility functions underweight high probabilities Weight function Losses Utility Gains 0 . 8 u + Weight w ( p ) 0 . 6 p 0 . 69 0 . 4 ( p 0 . 69 + ( 1 − p ) 0 . 69 ) 1 / 0 . 69 0 . 2 − u − 0 . 2 0 . 4 0 . 6 0 . 8 For losses, the disutility − u − is convex, for gains, the utility u + is concave

Prospect Theory Amos Tversky Daniel Kahneman Kahneman & Tversky (1979) “Prospect Theory: An analysis of decision under risk” is the second most cited paper in economics during the period, 1975-2000

Our Contributions 0 convergence of policy gradient • sample complexity bounds for estimation + asymptotic • SPSA-based policy gradient algorithm • CPT-value estimation using empirical distribution functions dz • traffjc signal control application 0 ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) C ( X θ ) := u + ( X θ ) > z u − ( X θ ) > z P dz − P Find θ ∗ = arg max C ( X θ ) θ ∈ Θ

CPT-value estimation Nice to have: Sample complexity O 0 dz 0 Problem: Given samples X 1 , . . . , X n of X, estimate ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) u + ( X ) > z u − ( X ) > z C ( X ) := P dz − P ( 1 /ϵ 2 ) for accuracy ϵ

Computing Part (I): Let X 1 X 2 X n denote the order-statistics X i n 0 Part (II) 0 Part (I) n i 1 u n n X, w n and 1 i n n n w n i Part (I) Empirical distribution function (EDF): Given samples X 1 , . . . , X n of ∑ ∑ ˆ ˆ F + F − n ( x ) = 1 1 ( u + ( X i ) ≤ x ) , n ( x ) = 1 1 ( u − ( X i ) ≤ x ) i = 1 i = 1 Using EDFs, the CPT-value C ( X ) is estimated by ∫ + ∞ ∫ + ∞ w + ( 1 − ˆ w − ( 1 − ˆ F + F − C n = n ( x )) dx − n ( x )) dx � ��

n n n n 0 n n Part (I) X, 0 and Part (II) n Empirical distribution function (EDF): Given samples X 1 , . . . , X n of ∑ ∑ ˆ ˆ F + F − n ( x ) = 1 1 ( u + ( X i ) ≤ x ) , n ( x ) = 1 1 ( u − ( X i ) ≤ x ) i = 1 i = 1 Using EDFs, the CPT-value C ( X ) is estimated by ∫ + ∞ ∫ + ∞ w + ( 1 − ˆ w − ( 1 − ˆ F + F − C n = n ( x )) dx − n ( x )) dx � �� Computing Part (I): Let X [ 1 ] , X [ 2 ] , . . . , X [ n ] denote the order-statistics ( ( n + 1 − i ) ( n − i )) ∑ u + ( X [ i ] ) w + − w + Part (I) = , i = 1

Sample complexity O 1 for accuracy Special Case: Lipschitz weights ( 2 Sample Complexity: 1) (A1). Weights w + , w − are Hölder continuous, i.e., | w + ( x ) − w + ( y ) | ≤ H | x − y | α , ∀ x , y ∈ [ 0 , 1 ] (A2). Utilities u + ( X ) and u − ( X ) are bounded above by M < ∞ Under (A1) and (A2), for any ϵ, δ > 0, we have ( 1 ) (� � ) � C n − C ( X ) � ≤ ϵ P > 1 − δ , ∀ n ≥ ln · 4H 2 M 2 δ ϵ 2 /α

Sample Complexity: Sample complexity O (A1). Weights w + , w − are Hölder continuous, i.e., | w + ( x ) − w + ( y ) | ≤ H | x − y | α , ∀ x , y ∈ [ 0 , 1 ] (A2). Utilities u + ( X ) and u − ( X ) are bounded above by M < ∞ Under (A1) and (A2), for any ϵ, δ > 0, we have ( 1 ) (� � ) � C n − C ( X ) � ≤ ϵ P > 1 − δ , ∀ n ≥ ln · 4H 2 M 2 δ ϵ 2 /α Special Case: Lipschitz weights ( α = 1) ( 1 /ϵ 2 ) for accuracy ϵ

CPT-value optimization Prediction Control Two-Stage Solution: gradient ascent Find θ ∗ = arg max C ( X θ ) θ ∈ Θ RL application: θ = policy parameter, X θ = return inner stage Obtain samples of X θ and estimate C ( X θ ) ; Parameter θ CPT-value C θ outer stage Update θ using ∇ i C ( X θ ) is not given

Update rule: Projection operator n n n Solution: use SPSA [Spall’92] Gradient estimate Step-sizes asymptotically. ( ) � ∇ i C ( X θ n ) θ i n + 1 = Γ i θ i n + γ n , i = 1 , . . . , d . Challenge: estimating ∇ i C ( X θ ) given only biased estimates of C ( X θ ) θ n + δ n ∆ n θ n − δ n ∆ n ∇ i C ( X θ ) = C − C � 2 δ n ∆ i ∆ n is a vector of independent Rademacher r.v.s and δ n > 0 vanishes

x n n 1 Ensure Figure 1: Overall fmow of CPT-SPSA Control cent) (Gradient as- Prediction n Measurement Prediction Estimator CPT Oracle Zero mean CPT-value optimization Simulation optimization Controlled bias f ( x ) + ξ X , ϵ C ( X ) + ϵ δ n ∆ n θ n + δ n ∆ n + m n samples C Update θ n θ n θ n + 1 θ n − δ n ∆ n m n samples C − δ n ∆ n → 0 m α/ 2 How to choose m n to ignore estimation bias? δ n

Application: Traffjc signal control be the delay gain • calculated with a pre-timed traffjc light controller as reference • CPT captures the road users’ evaluation of the delay gain X i • Goal: Maximize • For any path i = 1 , . . . , M , let X i ∑ M CPT ( X 1 , . . . , X M ) = µ i C ( X i ) i = 1 µ i : proportion of traffjc on path i

means (no utility/weights), EUT uses utilities but no weights and CPT uses both. 20 20 1 0 1 9 14 21 26 19 8 Bin Frequency (b) EUT-SPSA 0 40 0 1 0 0 1 0 1 8 52 24 12 Bin Frequency (c) CPT-SPSA Figure 2: Histogram of CPT-value of the delay gain: AVG uses plain sample 10 0 1 5 4 7 7 18 22 10 0 20 11 6 17 Bin Frequency (a) AVG-SPSA -335.86 -315.86 -295.86 -275.86 -255.86 -235.86 -215.86 -195.86 -175.86 -155.86 -188.19 -175.69 -163.19 -150.69 -138.19 -125.69 -113.19 -100.69 -88.19 -75.69 -43.36 -33.36 -23.36 -13.36 -3.36 6.64 16.64 26.64 36.64 46.64

• Control: No Bellman, but SPSA can be employed • Robustness to unknown utility and weight function parameters Conclusions • Want AI to be benefjcial to humans • We lay the foundations for using CPT in an RL setting • Prediction: Sample means (TD) won’t work, but empirical distributions do! Future directions: • Crowdsourcing experiment to validate CPT online • CPT - a very popular paradigm for modeling human decisions

Conclusions • Want AI to be benefjcial to humans distributions do! Future directions: • Crowdsourcing experiment to validate CPT online • Robustness to unknown utility and weight function parameters • CPT - a very popular paradigm for modeling human decisions • We lay the foundations for using CPT in an RL setting • Prediction: Sample means (TD) won’t work, but empirical • Control: No Bellman, but SPSA can be employed

Thanks! Questions?

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - PowerPoint PPT Presentation

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvri University of Maryland, College Park AI that benefjts humans

Prospect of Transport Market Prospect of Transport Market Prospect of Transport Market

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Evaluation of cumulative impact of Evaluation of cumulative impact of Evaluation of cumulative

OVERVIEW OF OVERVIEW OF CUMULATIVE EFFECTS CUMULATIVE EFFECTS ASSESSMENT ASSESSMENT What is

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

UNLOCKING THE VALUE OF DATA THROUGH DATA SOURCES, CUSTOMER ATTRIBUTES & TRAINING MATTHEW

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Developing and managing dynamic business collaborations A rule based approach for modeling and

Parallel Programming and Heterogeneous Computing Shared-Memory: Concurrency & Synchronization

Income Growth in the 21st Century: Forecasts with an Overlapping Generations Model David de la

What Do We Know about Mobile Termination? Comment on Tommaso Valletti and Stephen Littlechild

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

with the Guarded Action Language Quentin Meunier, Yann Thierry-Mieg, Emmanuelle Encrenaz

Neutrino Oscillations and and Lorentz Violation Results from MiniBooNE Outline: z - LSND -

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - PowerPoint PPT Presentation

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvri University of Maryland, College Park AI that benefjts humans

Prospect of Transport Market Prospect of Transport Market Prospect of Transport Market

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Evaluation of cumulative impact of Evaluation of cumulative impact of Evaluation of cumulative

OVERVIEW OF OVERVIEW OF CUMULATIVE EFFECTS CUMULATIVE EFFECTS ASSESSMENT ASSESSMENT What is

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

UNLOCKING THE VALUE OF DATA THROUGH DATA SOURCES, CUSTOMER ATTRIBUTES &amp; TRAINING MATTHEW

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Developing and managing dynamic business collaborations A rule based approach for modeling and

Parallel Programming and Heterogeneous Computing Shared-Memory: Concurrency &amp; Synchronization

Income Growth in the 21st Century: Forecasts with an Overlapping Generations Model David de la

What Do We Know about Mobile Termination? Comment on Tommaso Valletti and Stephen Littlechild

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

with the Guarded Action Language Quentin Meunier, Yann Thierry-Mieg, Emmanuelle Encrenaz

Neutrino Oscillations and and Lorentz Violation Results from MiniBooNE Outline: z - LSND -

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

UNLOCKING THE VALUE OF DATA THROUGH DATA SOURCES, CUSTOMER ATTRIBUTES & TRAINING MATTHEW

Parallel Programming and Heterogeneous Computing Shared-Memory: Concurrency & Synchronization