Recognizers
A study in learning how to model temporally extended behaviors
Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup
Recognizers A study in learning how to model temporally extended - - PowerPoint PPT Presentation
Recognizers A study in learning how to model temporally extended behaviors Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup Background and
A study in learning how to model temporally extended behaviors
Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
(Options [Sutton, Precup & Singh, 1999])
(Recognizers [Precup et al. 2006])
continuous state and action spaces [Precup, Sutton & Dasgupta 2001].
2
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
termination condition, and a policy.
class of policies that we are interested in learning about.
target policy by observing an agent whose behavior is governed by a different (possibly unknown) policy .
β π π b I, β, π I
3
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
unknown.
4
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
unknown.
4
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
unknown.
4
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
unknown.
4
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
receives a state and chooses an action .
generate actions.
indicates to what extent the recognizer allows action in state .
where is the recognition probability at .
t c(s, a) a s π b c π(s, a) = b(s, a)c(s, a)
µ(s) , µ(s) s
S, A, P, R st ∈ S at ∈ A b : S × A → [0, 1] c : S × A → [0, 1]
5
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
samples drawn from a behavior policy , and so we just need to calculate the appropriate weights.
Eπ{x} =
xπ(x)dx =
xπ(x) b(x) b(x)dx = Eb
b(x)
b ρ(s, a) = π(s, a) b(s, a) = c(s, a) µ(s)
6
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
estimate .
regression with the same set of features in order to estimate .
µ(s) b µ b
ˆ µ : S → [0, 1]
7
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
noisy.
entering puddle (-10 at middle).
8
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
policy is to pick actions that lead directly towards the goal state.
directly towards the goal state. Recognizer episode can be initiated everywhere, and terminates when either goal state or puddle are entered.
45◦
9
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
10 −10 −5 5
Learned Reward Model
goal is good unless you are below and to the left of the puddle.
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
1 2 3 4 5 6 7 8 9 10 −1 1 2 3 4 5 State Value Estimates Number of Steps (x 10E4) Expected Reward With Estimated Recognition Probability With Exact Recognition Probability 1 2 3 4 5 6 7 8 9 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Recognition Probability Estimates Number of Steps (x 10E4) Recognition Probability Estimated Value True Value
11
converges to the correct value, and estimating this value as we do our learning does not bias our state value estimates.
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
continuous actions (throttle and rudder angle).
velocity.
12
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
recognizers from one stream of experience.
bring ship towards desired heading.
small, smooth adjustments to the rudder, to huge, sharp adjustments.
13
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
14
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 12 14 Way off course and moving slowly Number of Steps (x 10E8) Expected Reward Small Adjustments Medium Adjustments Large Adjustments Huge Adjustments All Actions Recognized 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 12 14 Way off course and moving quickly Number of Steps (x 10E8) Expected Reward
adjustments outperform those that make large adjustments.
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
we cannot control, or do not know the behavior policy.
need to work on proofs for function approximation, but empirical results are promising.
15
Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior
difference learning with function approximation. In Proc. 18th International Conf. on Machine Learning, pages 417-424.
Off-policy learning with recognizers. In Advances in Neural Information Processing Systems 18 (NIPS*05).
semi-MDPS: A framework for temporal abstraction in reinforcement
16