Model Estimation Within Planning and Learning
Alborz Geramifard ICML W
- rkshop - June 2011
agf@mit.edu
1
Model Estimation Within Planning and Learning Alborz Geramifard - - PowerPoint PPT Presentation
Model Estimation Within Planning and Learning Alborz Geramifard ICML W orkshop - June 2011 agf@mit.edu 1 Joint W ork Joshua Redding Joshua Joseph Jonathan How 2 +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3 +1 -1
Alborz Geramifard ICML W
agf@mit.edu
1
2
3
Conservative Aggressive Optimal
3
Conservative Aggressive Optimal
3
Conservative Aggressive Optimal
3
Conservative Aggressive Optimal
4
5
6
7
Exploit & Explore
8
9
[ACC 2011]
9
[ACC 2011]
M
e l S t a t i c
Offline:
P l a n n e r
9
[ACC 2011]
M
e l S t a t i c
Offline:
P l a n n e r
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e M
e l S t a t i c
Offline:
P l a n n e r
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e
Y es
a ∼ πl
M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e
Y es
a ∼ πl
M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e
Y es
a ∼ πl
M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Safe Action?
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e
Y es
a ∼ πl
M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Safe Action?
No
Online:
L e a r n e r
9
[ACC 2011]
Suggest action?
R
m a x
T y p e
Y es
a ∼ πl
M
e l S t a t i c
Offline:
P l a n n e r
No
a ∼ πp
Safe Action?
No Y es
Online:
L e a r n e r
10
Suggest action?
R
m a x
T y p e
Y es
a ∼ πl
M
e l
P l a n n e r
No
a ∼ πp
Safe Action?
No Y es
Online:
L e a r n e r A d a p t i v e 2 1
11
Sarsa iCCA Noise = 40% Planner’s policy = conservative AM-iCCA Initial noise = 40% If noise ≤ 25% planner’s policy = aggressive else planner’s policy = conservative ε-greedy Policy
12
daptive Mo
2000 4000 6000 8000 10000 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0.5 1 Steps Return
Aggressive Policy AM-iCCA Sarsa iCCA Conservative Policy
13
14