[PPT] - Learning Predictive State Representations Using Non-Blind Policies PowerPoint Presentation

SLIDE 1

Learning Predictive State Representations Using Non-Blind Policies

Michael Bowling Peter McCracken Michael James James Neufeld Dana Wilkinson

University of Alberta Toyota Technical Center University of Waterloo

ICML 2006

Bowling et al. PSRs and Non-Blind Policies ICML 2006 1 / 18

SLIDE 2

Outline

1

What is a PSR? Very Brief Tutorial

2

Extracting PSRs from Data.

3

Prediction Estimators: Problem and Solution Short Punchline

4

Non-Blind Exploration Bonus

Bowling et al. PSRs and Non-Blind Policies ICML 2006 2 / 18

SLIDE 3

Decision Process

Action Observation a1, o1, a2, o2, . . . , an, on

General Form

Pr(on+1|a1, o1, . . . , an, on, an+1)

Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18

SLIDE 4

Decision Process

Action Observation a1, o1, a2, o2, . . . , an, on

Markov Decision Process

Pr(on+1|a1, o1, . . . , an, on, an+1) = Pr(on+1|on, an+1)

Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18

SLIDE 5

Decision Process

Action Observation a1, o1, a2, o2, . . . , an, on

General Form

Pr(on+1|a1, o1, . . . , an, on, an+1)

Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18

SLIDE 6

Histories, Tests, and Predictions

Notation

History(h) a1, o1, a2, o2, . . . , an, on Test(t) a1, o1, a2, o2, . . . , an, on (but in the future) Prediction p(t|h) p(a1, o1, . . . , an, on|h) ≡

n

i=1

Pr(oi|ha1, o1, . . . , ai) π(a1, o1, . . . , an, on|h) ≡

n

i=1

Pr(ai|ha1, o1, . . . , ai−1, oi−1) Pr(t|h) = p(t|h)π(t|h)

Bowling et al. PSRs and Non-Blind Policies ICML 2006 4 / 18

SLIDE 7

System Dynamics Matrix

Countable number of tests and histories. Infinite matrix of all predictions. p(t|h) t h Tests Histories

Bowling et al. PSRs and Non-Blind Policies ICML 2006 5 / 18

SLIDE 8

POMDPs

Underlying states. Histories Tests

Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

SLIDE 9

POMDPs

Underlying states. Histories Tests States

s1 s2 s3 s4

Tests

Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

SLIDE 10

POMDPs

Underlying states. Histories correspond to belief states. Histories Tests States

s1 s2 s3 s4

Tests

Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

SLIDE 11

POMDPs

Underlying states. Histories correspond to belief states. History row is a linear combination of state rows. Histories Tests States

s1 s2 s3 s4

Tests

b1 b2 b3 b4       

Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

SLIDE 12

POMDPs

Underlying states. Histories correspond to belief states. History row is a linear combination of state rows. ∴ rank(SDM) ≤ |S| Histories Tests States

s1 s2 s3 s4

Tests

b1 b2 b3 b4       

Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

SLIDE 13

Predictive State Representations

Find linearly independent tests. Tests Histories

Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

SLIDE 14

Predictive State Representations

Find linearly independent tests. “Core Tests” Q Tests Histories

q1 q2 q3

Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

SLIDE 15

Predictive State Representations

Find linearly independent tests. “Core Tests” Q Any test is a linear combination of core tests. p(t|h) = p(Q|h)mt Tests Histories

q1 q2 q3

t

                

mt

Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

SLIDE 16

Predictive State Representations

Find linearly independent tests. “Core Tests” Q Update predictions: p(Q|hao) = p(aoQ|h) p(ao|h) = p(Q|h)MaoQ p(Q|h)mao Tests Histories

q1 q2 q3

t

                

mt

Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

SLIDE 17

Extracting PSRs from Data

Bowling et al. PSRs and Non-Blind Policies ICML 2006 8 / 18

SLIDE 18

What Data?

a1, o1, a2, o2, . . . , an, on

Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18

SLIDE 19

What Data?

a1, o1, a2, o2, . . . , an, on How are actions chosen?

Unknown policy. Known policy. Controlled policy.

Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18

SLIDE 20

What Data?

a1, o1, a2, o2, . . . , an, on How are actions chosen?

Unknown policy. Known policy. Controlled policy.

Note

Existing algorithms require a particular control policy. Either: Exhaustively trying history-test pairs, or Random actions.

Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18

SLIDE 21

Extracting PSRs from Data

(James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006)

The common formula:

Find core tests. Find update parameters.

Tests Histories

Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

SLIDE 22

Extracting PSRs from Data

(James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006)

The common formula:

Find core tests. Find update parameters. Estimate part of the system dynamics matrix.

Tests Histories

Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

SLIDE 23

Extracting PSRs from Data

(James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006)

The common formula:

Find core tests. Find update parameters. Estimate part of the system dynamics matrix. Estimate a subset of predictions.

Tests Histories h t ˆ p(t|h)

Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

SLIDE 24

Extracting PSRs from Data

(James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006)

The common formula:

Find core tests. Find update parameters. Estimate part of the system dynamics matrix. Estimate a subset of predictions.

ˆ p•(t|h) = #ha1o1 . . . anon #ha1 . . . an Tests Histories h t ˆ p(t|h)

Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

SLIDE 25

Problem

E [ˆ p•(t|h)] = p(t|h) n

i=1 Pr(ai|ha1o1 . . . ai−1oi−1)

n

i=1 Pr(ai|ha1 . . . ai−1)

Definition

A policy is blind if actions are selected independent of preceeding observations. I.e., Pr(an|a1, o1 . . . an−1, on−1) = Pr(an|a1, . . . , an)

Observation

ˆ p•(t|h) is only an unbiased estimator of p(t|h) if π is blind.

Bowling et al. PSRs and Non-Blind Policies ICML 2006 11 / 18

SLIDE 26

What Data?

a1, o1, a2, o2, . . . , an, on How are actions chosen?

Unknown policy. Known policy. Controlled policy.

Bowling et al. PSRs and Non-Blind Policies ICML 2006 12 / 18

SLIDE 27

Prediction Estimators

Policy is Known

ˆ pπ(t|h) = #ht #h 1 π(t|h)

Policy is Not Known

ˆ pπ ×(t|h) =

n

i=1

#ha1o1 . . . aioi #ha1o1 . . . ai

Theorem

ˆ pπ(t|h) and ˆ pπ ×(t|h) are unbiased estimators of p(t|h).

Bowling et al. PSRs and Non-Blind Policies ICML 2006 13 / 18

SLIDE 28

Exploration

Goal

Choose actions to reduce error in the estimated system dynamics matrix.

Approach

Add intelligent exploration to James & Singh’s “reset” algorithm. Since ˆ pπ(t|h) is an unbiased estimator, we want to take actions to reduce the variance. Solve as an optimization problem.

Bowling et al. PSRs and Non-Blind Policies ICML 2006 14 / 18

SLIDE 29

Estimator Variance

V

ˆ

pπ(t|h)

#h = n
=

p(t|h) nπ(t|h) − p(t|h)2 n ≤ 1 4nπ(t|h)2 E

V
ˆ

pπ(t|h)

#h = n
k trajectories
≤

1 4k p(h)π(h)π(t|h)2

Bowling et al. PSRs and Non-Blind Policies ICML 2006 15 / 18

SLIDE 30

Exploration

Intuition

Find the policy that maximizes the worst-case (over all predictions) bound on the root expected inverse variance.

Optimization Problem

Maximize: minh,t

vi−1(h, t)−1 + 2
ki p(h)π(ht)
Subject to: Sequence form constraints on π(ht):

1

π(φ) = 1,

2

∀h, o ∈ O π(h) =

a π(hao), and

3

∀h, a ∈ A, {o, o′} ⊆ O π(hao) = π(hao′).

Bowling et al. PSRs and Non-Blind Policies ICML 2006 16 / 18

SLIDE 31

Results

1e−04 0.001 0.01 0.1

Paint

Testing Error Random Non−blind Sample Size 220000 140000 60000 1e−04 0.001 0.01 0.1 Sample Size 100000 300000 500000 Random Non−blind Testing Error

Tiger

1e−04 0.001 0.01 0.1

Float−reset

Testing Error Sample Size 100000 200000 Non−blind Random

Bowling et al. PSRs and Non-Blind Policies ICML 2006 17 / 18

SLIDE 32

Summary

Contributions

Unbiased prediction estimators for non-blind policies. Variance analysis in the case of a known policy. Estimators used in“intelligent” exploration, which was shown can speed learning.

Future Work

Better objective functions for exploration. Investigate when non-blind exploration proves helpful.

Questions?

Bowling et al. PSRs and Non-Blind Policies ICML 2006 18 / 18