Learning Models of Human Behavior using a Value Directed Approach
Jesse Hoey Computer Science Department University of Toronto http://www.cs.toronto.edu/∼jhoey/
IRIS Learning Workshop - June 9, 2004
Learning Models of Human Behavior using a Value Directed Approach - - PowerPoint PPT Presentation
Learning Models of Human Behavior using a Value Directed Approach Jesse Hoey Computer Science Department University of Toronto http://www.cs.toronto.edu/ jhoey/ IRIS Learning Workshop - June 9, 2004 Motivation: Modeling Human Behaviors
IRIS Learning Workshop - June 9, 2004
IRIS Learning Workshop - June 9, 2004 2/25
IRIS Learning Workshop - June 9, 2004 3/25
behavior
IRIS Learning Workshop - June 9, 2004 4/25
behavior utility Action Outcome
IRIS Learning Workshop - June 9, 2004 5/25
Context: don’t steal cake steal cake Action: get cake get caught utility (hunger) previous world state behavior Outcome:
IRIS Learning Workshop - June 9, 2004 6/25
IRIS Learning Workshop - June 9, 2004 7/25
IRIS Learning Workshop - June 9, 2004 8/25
Context: don’t steal cake steal cake Action: get cake get caught utility (hunger) previous world state behavior
Outcome:
D
IRIS Learning Workshop - June 9, 2004 9/25
At
a a b:a
Sa
t
A t S
t−1 t
O ΘA ΘO ΘD
b:a
most likely A =1 IRIS Learning Workshop - June 9, 2004 10/25
Zx Zw Zw Zx Zw Zw Zx Zw Zw Zx t f
1249 1250 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1187 1188 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 41 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 1 2 3 4
1249 1250 1251 1188 1189
H V W X "smile"
1187 frame
I most likely A =1
b:a
At
a a b:a
Sa
t
A t S
t−1 t
O ΘA ΘO ΘD
IRIS Learning Workshop - June 9, 2004 11/25
Zx Zw Zw Zx Zw Zw Zx Zw Zw Zx t f
1249 1250 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1187 1188 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 41 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 1 2 3 4
1249 1250 1251 1188 1189
H V W X "smile"
1187 frame
I most likely A =1
b:a
At
a a b:a
Sa
t
A t S
t−1 t
O ΘA ΘO ΘD
P (O|Ab:a) =
P (IT |WT,iAb:a)P (∇ fT |XT,jAb:a)
ΘXijknΘW jklnP (XT −1,kWT −1,l {O} 1,T −1 |Ab:a) IRIS Learning Workshop - June 9, 2004 12/25
τ
O
τ
Sa
τ−1 τ−1
O
τ−1 τ
A A
b:a b:a
S a S a
A
Θ
τ+1 τ
a
A
τ−1
Aa ΘD ΘO
Θ
Θ
Ab:a
IRIS Learning Workshop - June 9, 2004 13/25
Learning Hardcore! Nightmare! I can win! Bring it on! Decision−Analytic approach Incremental pruning Entropy approximation Factored solvers SPUDD EM for POMDPs MDP approximation plenty! Hurt me Decision Making
Decision Making Decision Making
discrete observations continuous observations unobservable state unobservable state
Decision Making
unobservable state multi−agent systems continuous observations
solution
Finding equilibria General
POMDPs Monte Carlo
IRIS Learning Workshop - June 9, 2004 14/25
τ
Sa
τ−1 τ−1 τ
A A
b:a b:a
S a S a
τ+1 τ
a
A
τ−1
Aa R R
V n+1(s) = R(s) + max
a∈A
P r(t|a, s) · V n(t)
V 0 = R
πn(s) = arg max
a∈A
P r(t|a, s) · V n(t)
15/25
2
1
IRIS Learning Workshop - June 9, 2004 16/25
IRIS Learning Workshop - June 9, 2004 17/25
R
At
b b:a t
A Ot
{"go left","stop","go right","forwards"}
robot action
{"good robot","bad robot"}
control command At
a
1 1 2 3 4 5 1 1 2 3 4 5 6 1 1 2 3 4 5 1 1 2 3 4 5 6 1 1 2 3 4 5
40 4140 41 42
1 1 2 3 4 5 1 1 2 3 4 5 6 1 1 2 3 4 5 1 1 2 3 4 5 6 1 1 2 3 4 5
51 5251 52 53
IRIS Learning Workshop - June 9, 2004 18/25
Acom Aact d1 Aact d5 d6 d2 Aact d3 Aact d4 0.60 bad 1.60 good 0.50 bad 1.50 good 0.54 bad 1.54 good 0.65 bad 1.65 good
Acom left d1 right d2 stop d3 forward d4 right forward stop d5 left forward stop d6
IRIS Learning Workshop - June 9, 2004 19/25
Acom Aact d3 Aact d1 Aact d2 Aact d4 0.54 bad 1.54 good 0.50 bad 1.50 good 0.60 bad 1.60 good 0.65 bad 1.65 good
Acom left d1 right d2 stop d3 forward d4
IRIS Learning Workshop - June 9, 2004 20/25
IRIS Learning Workshop - June 9, 2004 21/25
IRIS Learning Workshop - June 9, 2004 22/25
previous world state Context: Outcome: utility Action pompt hands washed (reward) caregiver behavior
IRIS Learning Workshop - June 9, 2004 23/25
self−occlusion
IRIS Learning Workshop - June 9, 2004 24/25
IRIS Learning Workshop - June 9, 2004 25/25