Model Estimation Within Planning and Learning Alborz Geramifard - - PowerPoint PPT Presentation

model estimation within planning and learning
SMART_READER_LITE
LIVE PREVIEW

Model Estimation Within Planning and Learning Alborz Geramifard - - PowerPoint PPT Presentation

Model Estimation Within Planning and Learning Alborz Geramifard ICML W orkshop - June 2011 agf@mit.edu 1 Joint W ork Joshua Redding Joshua Joseph Jonathan How 2 +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3 +1 -1


slide-1
SLIDE 1

Model Estimation Within Planning and Learning

Alborz Geramifard ICML W

  • rkshop - June 2011

agf@mit.edu

1

slide-2
SLIDE 2

Joint W

  • rk

Joshua Joseph Joshua Redding Jonathan How

2

slide-3
SLIDE 3

3

+1

  • 1
  • 0.01

Conservative Aggressive Optimal

20% noise

slide-4
SLIDE 4

3

+1

  • 1
  • 0.01

Conservative Aggressive Optimal

20% noise

slide-5
SLIDE 5

3

+1

  • 1
  • 0.01

Conservative Aggressive Optimal

20% noise

slide-6
SLIDE 6

3

+1

  • 1
  • 0.01

Conservative Aggressive Optimal

20% noise

slide-7
SLIDE 7

Big Picture

4

Fast, safe, sub-optimal solution Estimator of the true model using a parametric form A reinforcement learning algorithm running online

P l a n n e r M

  • d

e l L e a r n e r

slide-8
SLIDE 8

A framework to integrate Goal: explore safely reduce sample complexity reach optimal solution asymptotically

Question

5

P l a n n e r M

  • d

e l L e a r n e r

slide-9
SLIDE 9

Existing Gap

6

Overly Restrictive [Heger 1994] Lack of Analytical Convergence [Geibel et al. 2005] No Safety Guarantees [Abbeel et al. 2005] Requires Planner’s V alue Function [Knox et al. 2010]

slide-10
SLIDE 10

Contributions

7

Extended our previous framework to support adaptive modeling. Empirically verified the advantage of the new approach. Discussed the limitation of our approach and provided two potential solutions.

slide-11
SLIDE 11

1

Approach

intelligent Cooperative Control Architecture (iCCA) Planner initializes the policy and regulates the exploration of the learner.

Exploit & Explore

L e a r n e r P l a n n e r

8

slide-12
SLIDE 12

1

Previous work

9

[ACC 2011]

slide-13
SLIDE 13

1

Previous work

9

[ACC 2011]

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

slide-14
SLIDE 14

1

Previous work

9

[ACC 2011]

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

Online:

L e a r n e r

slide-15
SLIDE 15

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

Online:

L e a r n e r

slide-16
SLIDE 16

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

Online:

L e a r n e r

slide-17
SLIDE 17

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

Online:

L e a r n e r

slide-18
SLIDE 18

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e

Y es

a ∼ πl

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

Online:

L e a r n e r

slide-19
SLIDE 19

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e

Y es

a ∼ πl

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

πp

Online:

L e a r n e r

slide-20
SLIDE 20

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e

Y es

a ∼ πl

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

Safe Action?

πp

Online:

L e a r n e r

slide-21
SLIDE 21

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e

Y es

a ∼ πl

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

Safe Action?

No

πp

Online:

L e a r n e r

slide-22
SLIDE 22

1

Previous work

9

[ACC 2011]

Suggest action?

R

m a x

T y p e

Y es

a ∼ πl

M

  • d

e l S t a t i c

Offline:

πp

P l a n n e r

No

a ∼ πp

Safe Action?

No Y es

πp

Online:

L e a r n e r

slide-23
SLIDE 23

1

New Approach

10

Suggest action?

R

m a x

T y p e

Y es

a ∼ πl

M

  • d

e l

πp

P l a n n e r

No

a ∼ πp

Safe Action?

No Y es

πp

Online:

L e a r n e r A d a p t i v e 2 1

slide-24
SLIDE 24

Empirical Results

11

100 learning trials with the Gridworld

Sarsa iCCA Noise = 40% Planner’s policy = conservative AM-iCCA Initial noise = 40% If noise ≤ 25% planner’s policy = aggressive else planner’s policy = conservative ε-greedy Policy

slide-25
SLIDE 25

Empirical Results

12

daptive Mo

2000 4000 6000 8000 10000 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0.5 1 Steps Return

Aggressive Policy AM-iCCA Sarsa iCCA Conservative Policy

slide-26
SLIDE 26

Extensions

13

What if the parametric form of the model can not represent the true model? Knownness is high ⇒ Ignore safety checking Estimate the value of planner policies by reflecting back on the past data.

slide-27
SLIDE 27

Contributions

14

Extended our previous framework to support adaptive modeling. Empirically verified the advantage of the new approach. Discussed the limitation of our approach and provided two potential solutions.