Online Learning within Cooperative Planning
Alborz Geramifard September, 2010 agf@mit.edu
1
Joint W
- rk:
Finale Doshi, Josh Redding, Nicholas Roy, Jonathan How Supported by: AFOSR
Online Learning within Cooperative Planning Alborz Geramifard - - PowerPoint PPT Presentation
Online Learning within Cooperative Planning Alborz Geramifard September, 2010 agf@mit.edu Joint W ork: Finale Doshi, Josh Redding, Nicholas Roy, Jonathan How Supported by: AFOSR 1 Problem W aypoint Obstacle Base 2 Why is this a hard
Alborz Geramifard September, 2010 agf@mit.edu
1
Joint W
Finale Doshi, Josh Redding, Nicholas Roy, Jonathan How Supported by: AFOSR
2
3
✕
✕
Limited Limited
✕
✕
4
✕
✕
Limited Limited
✕
✕
4
5
5
6
7
Planner Learner
Learner
8
Learner
9
Learner
10
Learner
10
11
Learner
11
Learner
11
Learner
10 10 10
1 1 1
12
. . .
Learner
13
Learner
13
Learner
13
Learner
Learner
14
15
Learner
15
Learner
16
Learner
16
(π) a,r
Learner
16
(π) a,r
Learner
17
Learner
17
Learner
Learner
18
Learner
18
Learner
18
Learner
18
Learner
18
Learner
19
ϕ1
ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2 ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2
ϕ1∧ϕ2∧ϕ3
ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
19
ϕ1∧ϕ2
ϕ1∧ϕ2∧ϕ3
ϕ1
ϕ2∧ϕ3 ϕ2 ϕ3
Learner
20
21
Learner
22
Learner
22
Learner
23
23
23
23
23
24
Learner
24
Learner
25
26
Learner
27
Maintenance Refuel Communication Target
Advance Retreat Loiter
Targets UAVs
fuel=10 fuel=10 fuel=10
Learner
28
2 4 6 8 10 x 10
4
50 50 100 150 200 250 300 350 400 450 Steps Return
Initial ATC SDM,Tabular Initial+iFDD
Learner
Learner
29
1 2 3
+1
4
.8 +1
5
+5 8
6
+1
7
+5 .8
8
+10 .5 10
5 10 5
2 4 6 8 10 x 10
4
10 5 5 10 15 Steps Return
Initial+iFDD ATC Initial Tabular SDM
Learner
30
Learner
31
Planner Learner
32
P l a n n e r L e a r n e r
33
P l a n n e r L e a r n e r
P l a n n e r L e a r n e r
!!!"
!##$%&'()*% +,'--%&
.#&,/
0%'&-)-1 ",1#&)(23 +%&4#&3'-5% "-',67)7 "1%-(89%2)5,% !"#$%&'()*+# ),"#+ ,'#+&-($",)#
[Redding, Geramifard, How, ACC 2010]
34
P l a n n e r L e a r n e r
!!!"
!##$%&'()*%+,%'&-%&
!#-.%-./. 0'.%1+ 0/-12%+ "23#&)(45 6!00"7
8#&21
,%'&-%& 9).: "-'2;.). "3%-(<=%4)>2%
!"#$%&'()!*#+,!"#$%&'''- !"! #(!) #"! #"" !"*"+ #(!) !
35
P l a n n e r L e a r n e r
36
37
! "!!! #!!! $!!! %!!! &!!!! !&'( !& !!'( ! !'( )*+,- .+*/01
Optimal NAC CNAC Planner
! "!!! #!!! $!!! %!!! &!!!! !&'( !& !!'( ! !'( )*+,- .+*/01
Optimal Sarsa CSarsa Planner
P l a n n e r L e a r n e r
38
P l a n n e r L e a r n e r
1 2 3
.5 [2,3] +100
4
.5 [2,3] +100
5
[3,4] +200 5 8
6
+100 .7
7
+300 .6
P l a n n e r L e a r n e r
39
40 50 60 70 80 90 100
Optimality
0% 20% 40% 60% 80% 100%
P(Crash) Planner Learner Planner + Learner Planner Learner Planner + Learner
Learner
40
Planner Learner
41
L e a r n e r
P l a n n e r L e a r n e r
42
44
Algorithm 1:Discover Input: φ(s), δt, ξ, F, ψ Output: F, ψ foreach (g, h) ∈ {(i, j)|φi(s)φj(s) = 1} do f ← g ∧ h if f / ∈ F then ψf ← ψf + |δt| if ψf > ξ then F ← F ∪ f end end end
45
Algorithm 2:Activate Features Input: φ0(s), F Output: φ(s) φ(s) ← ¯ activeInitialFeatures ← {i|φ0
i (s) = 1}
Candidates ← ℘(activeInitialFeatures) (*sorted by set size) while activeInitialFeatures = ∅ do f ← Candidates.next() if f ∈ F then activeInitialFeatures ← activeInitialFeatures −f φf(s) ← 1 end end return φ(s)
46
2 4 6 8 10 x 10
4
500 1000 1500 2000 2500 3000 Steps Balancing Steps
Initial Tabular Guassian ATC Initial+iFDD
47
! " # $ % &! '(&!
#
!&)* !& !!)* ! !)* & +,-./ 0-,123
Tabular Initial ATC initial+iFDD
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
!"#$%&!
'((! )*+$$#,
!0%.,1',2%20 34 3256 !$+*7525 '.$%,.**#, !"#$%&'()*+# ),"#+ ,'#+&-($",)#.!"""#
!''!
89)
51
52
Algorithm 1: Cooperative Natural Actor-Critic (CNAC) Input: πp, ξ Output: a a ∼ πAC(s, a) if not safe(s, a) then P(s, a) ← P(s, a) − ξ a ← πp Q(s, a) ← Q(s, a) + αδt(Q) P(s, a) ← P(s, a) + αQ(s, a)
53
Algorithm 2: safe Input: s, a Output: isSafe risk ← 0 for i ← 1 to M do t ← 1 st ∼ T p(s, a) while not constrained(st) and not isTerminal(st) and t < H do st+1 ∼ T p(st, πp(st)) t ← t + 1 risk ← risk + 1
i (constrained(st) − risk)
isSafe ← (risk < ψ)
54
Algorithm 3: Cooperative Learning Input: N, πp, s, learner Output: a a ← πp(s) πl ← learner.π knownness ← min{1, count(s,a)
N
} if rand() < knownness then a′ ∼ πl(s, a) if safe(s, a′) then a ← a′ else count(s, a) ← count(s, a) + 1 learner.update()
55
1 2 3
.5 [2,3] +100
4
.5 [2,3] +100
5
[3,4] +200 5 8
6
+100 .7
7
+300 .6 ! " # $ % &! '(&!
#
!)!! !"!! !&!! ! &!! "!! )!! #!! *!! $!! +!! ,-./0 1.-234 Actor-Critic CBBA Optimal iCCA
(a) Step based performance
30 40 50 60 70 80 90 100
Optimality iCCA CBBA Actor-Critic
(b) Optimality after training
methods,” in Symposium on Abstraction, Reformulation, and Approximation (SARA), 2005.
. Ng, “Regularization and feature selection in least-squares temporal difference learning,” in ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning. New Y
, USA: ACM, 2009, pp. 521–528.
. Stone, “ Adaptive tile coding for value function approximation,” University of Texas at Austin, Tech. Rep. AI-TR-07-339, 2007.
networks,” in In Proceedings of the Twentieth International Conference on Machine Learning. AAAI Press,2003, pp. 632– 639.
P . Abbeel and A. Y . Ng, “Exploration and apprenticeship learning in reinforcement learning,” in Proc. 21st International Conference on Machine Learning. ICML, 2005,
P . Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to chance constrained control,” JAIR, vol. 24, 2005.
Proceedings of the 11th International Conference, 1994, pp. 105–111.
57
In Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann, 1993, pp. 314–321.
Adaptive resolution model-free reinforcement learning: Decision boundary partitioning,” in In Proc. 17th International Conf. on Machine Learning. Morgan Kaufmann, 2000, pp. 783–790.
reinforcement learning,” in ECML, ser. Lecture Notes in Computer Science, J. F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, Eds., vol. 3201. Springer, 2004,
difference prediction learning with eligibility traces,” in Proceedings of the Third Conference on Artificial General Intelligence (AGI-10), Lugano, Switzerland, 2010. W . B. Knox and P . Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Proc. of 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), May 2010.
58
. How, “Incremental Feature Dependency Discovery”, International Conference on Machine Learning, 2011 (submitted)
. How, “UAV Cooperative Control with Stochastic Risk Model”, American Control Conference, 2011 (submitted)
. How, “ Actor-critic policy learning in cooperative planning,” in AIAA Guidance, Navigation, and Control Conference (GNC), 2010.
Actor-critic policy learning in cooperative planning,” in AAAI Spring Symposium Series, 2010.
An intelligent cooperative control architecture,” in American Control Conference, 2010
ehicle to T rack and Avoid Adversaries”, International Journal of Robotics Research (IJRR), 2008
60
“Coordinated Planning Under Uncertainty with Air and Ground V ehicles”, Proceedings of the 11th International Symposium on Experimental Robotics (ISER), 2008
with Linear Function Approximation and Prioritized Sweeping”, Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI), pages 528-536, 2008, [28% acceptance]
Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 379-386, 2008, [22% acceptance]
races & Convergence Analysis ”, In B. Schölkopf and J.C. Platt and T. Hofmann editors, Advances in Neural Information Processing Systems 19 (NIPS), pages 440-448, 2007. [24% acceptance]
Difference Learning”, Proceedings of 21t Conference, American Association for Artificial Intelligence (AAAI), pages 356-361, 2006. [30% acceptance]
61
. Chubak, V . Bulitko, “Biased Cost Pathfinding”, Proceedings of second Conference, Artificial Intelligence and Interactive Digital Entertainment (AIIDE) 2006.[73% acceptance]
. Nayei, R. Zamaninasab, J. Habibi, "A Hybrid Three Layer Architecture for Fire Agent Management in Rescue Simulation Environment", International Journal of Advanced Robotic Systems, V
A Nouri, R. Zamani-Nasab, J. Habibi, A. Geramifard " Task Allocation in Complex Multiagent Systems with Parallel Scheduling ", W
& its Disciplines, Kish Island, Iran, February 2004
Agents: A Set of Implemented Agents for RoboCup Rescue Simulation Environment", In Proceedings of the RoboCup Symposium, Padova, Italy, 2003.
62
64
T ransferred
65
66