Recognizers A study in learning how to model temporally extended - - PowerPoint PPT Presentation

recognizers
SMART_READER_LITE
LIVE PREVIEW

Recognizers A study in learning how to model temporally extended - - PowerPoint PPT Presentation

Recognizers A study in learning how to model temporally extended behaviors Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup Background and


slide-1
SLIDE 1

Recognizers

A study in learning how to model temporally extended behaviors

Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup

slide-2
SLIDE 2

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Background and Motivation

  • Want a flexible way to represent hierarchical knowledge.

(Options [Sutton, Precup & Singh, 1999])

  • Want an efficient way to learn about these hierarchies.

(Recognizers [Precup et al. 2006])

  • Concerned with off-policy learning in environments with

continuous state and action spaces [Precup, Sutton & Dasgupta 2001].

2

slide-3
SLIDE 3

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Terminology

  • Option: A tuple . is a set of initiation states, a

termination condition, and a policy.

  • Recognizer: A filter on actions. A recognizer specifies a

class of policies that we are interested in learning about.

  • Off-policy learning: We are interested in learning about a

target policy by observing an agent whose behavior is governed by a different (possibly unknown) policy .

β π π b I, β, π I

3

slide-4
SLIDE 4

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Example Problem

  • PuddleWorld [RL-Glue]
  • Continuous state space
  • Continuous action space
  • Goal is to do off-policy learning. Behavior policy is

unknown.

4

slide-5
SLIDE 5

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Example Problem

  • PuddleWorld [RL-Glue]
  • Continuous state space
  • Continuous action space
  • Goal is to do off-policy learning. Behavior policy is

unknown.

4

slide-6
SLIDE 6

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Example Problem

  • PuddleWorld [RL-Glue]
  • Continuous state space
  • Continuous action space
  • Goal is to do off-policy learning. Behavior policy is

unknown.

4

slide-7
SLIDE 7

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Example Problem

  • PuddleWorld [RL-Glue]
  • Continuous state space
  • Continuous action space
  • Goal is to do off-policy learning. Behavior policy is

unknown.

4

slide-8
SLIDE 8

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Recognizers: Formally

  • MDP is a tuple . At time step , an agent

receives a state and chooses an action .

  • Fixed (unknown) behavior policy , used to

generate actions.

  • Recognizer is a function , where

indicates to what extent the recognizer allows action in state .

  • Target policy generated by and

where is the recognition probability at .

t c(s, a) a s π b c π(s, a) = b(s, a)c(s, a)

  • x b(s, x)c(s, x) = b(s, a)c(s, a)

µ(s) , µ(s) s

S, A, P, R st ∈ S at ∈ A b : S × A → [0, 1] c : S × A → [0, 1]

5

slide-9
SLIDE 9

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Importance Sampling

  • Based on the following observation:
  • We are trying to learn about a target policy using

samples drawn from a behavior policy , and so we just need to calculate the appropriate weights.

  • Weights (also called corrections) given by
  • Full details of the algorithm given in Precup et al. (2006).

Eπ{x} =

  • x

xπ(x)dx =

  • x

xπ(x) b(x) b(x)dx = Eb

  • xπ(x)

b(x)

  • π

b ρ(s, a) = π(s, a) b(s, a) = c(s, a) µ(s)

6

slide-10
SLIDE 10

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Importance Sampling Correction

  • depends on .
  • If is unknown, we can use a maximum likelihood

estimate .

  • For linear function approximation, we can use logistic

regression with the same set of features in order to estimate .

µ(s) b µ b

ˆ µ : S → [0, 1]

7

slide-11
SLIDE 11

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Puddle World [RL-Glue]

  • Continuous state space, continuous actions. Movement is

noisy.

  • Positive reward for reaching goal (10), negative reward for

entering puddle (-10 at middle).

  • Start state chosen randomly in small square in lower left
  • corner. Reaching goal moves agent back to start state

8

slide-12
SLIDE 12

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Setup

  • Standard tile coding function approximation for state space.
  • Behavior policy picks actions uniformly randomly, target

policy is to pick actions that lead directly towards the goal state.

  • Binary recognizer, recognizes actions in a cone facing

directly towards the goal state. Recognizer episode can be initiated everywhere, and terminates when either goal state or puddle are entered.

45◦

9

slide-13
SLIDE 13

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Results

10 −10 −5 5

Learned Reward Model

  • This matches our intuition that moving directly towards the

goal is good unless you are below and to the left of the puddle.

slide-14
SLIDE 14

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Results

1 2 3 4 5 6 7 8 9 10 −1 1 2 3 4 5 State Value Estimates Number of Steps (x 10E4) Expected Reward With Estimated Recognition Probability With Exact Recognition Probability 1 2 3 4 5 6 7 8 9 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Recognition Probability Estimates Number of Steps (x 10E4) Recognition Probability Estimated Value True Value

11

  • We observe that the recognition probability estimate

converges to the correct value, and estimating this value as we do our learning does not bias our state value estimates.

slide-15
SLIDE 15

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 2: Ship Steering [RL-Glue]

  • Stochastic environment. 3D Continuous state space, 2D

continuous actions (throttle and rudder angle).

  • Goal is to keep a ship on a desired heading with a high

velocity.

12

slide-16
SLIDE 16

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 2: Setup

  • Goal is to demonstrate that we can learn multiple

recognizers from one stream of experience.

  • Behavior policy picks a rudder orientation randomly to

bring ship towards desired heading.

  • 4 recognizers recognize different ranges of motion, from

small, smooth adjustments to the rudder, to huge, sharp adjustments.

13

slide-17
SLIDE 17

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 2: Results

14

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 12 14 Way off course and moving slowly Number of Steps (x 10E8) Expected Reward Small Adjustments Medium Adjustments Large Adjustments Huge Adjustments All Actions Recognized 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 4 6 8 10 12 14 Way off course and moving quickly Number of Steps (x 10E8) Expected Reward

  • We can see that policies that make smaller rudder

adjustments outperform those that make large adjustments.

slide-18
SLIDE 18

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Conclusion and Future work

  • Recognizers are useful for learning about options when

we cannot control, or do not know the behavior policy.

  • Convergence has been shown for state aggregation, still

need to work on proofs for function approximation, but empirical results are promising.

  • More experiments.

15

slide-19
SLIDE 19

Recognizers NIPS’07 Workshop on Hierarchical Organization of Behavior

Questions?

  • RL-Glue, University of Alberta, http://rlai.cs.ualberta.ca/RLBB/top.html
  • Precup, D., Sutton, R.S., and Dasgupta, S. (2001). Off-policy temporal-

difference learning with function approximation. In Proc. 18th International Conf. on Machine Learning, pages 417-424.

  • Precup, D., Sutton, R.S., Paduraru, C., Koop, A., and Singh, S. (2006).

Off-policy learning with recognizers. In Advances in Neural Information Processing Systems 18 (NIPS*05).

  • Sutton, R.S., Precup, D., and Singh, S.P. (1999). Between MDPs and

semi-MDPS: A framework for temporal abstraction in reinforcement

  • learning. Artificial Intelligence, 112(1-2): 181-211.

16