actor critic policy learning in cooperative planning
play

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, - PowerPoint PPT Presentation

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi and Jonathan P. How Aerospace Controls Lab, MIT August 22, 2011 Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 1 / 1


  1. Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi and Jonathan P. How Aerospace Controls Lab, MIT August 22, 2011 Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 1 / 1

  2. Cooperative Planning Introduction Motivating Example A. Whitten, 2010 Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 2 / 1

  3. Cooperative Planning Introduction Challenges of Cooperative Planning 1 Cooperative planning uses models • E.g. vehicle dynamics, fuel use, rules of engagement, embedded strategies, desired behaviors, etc... • Models enable anticipation of likely events & prediction of resulting behavior 2 Models are approximated • Planning with stochastic models is time consuming → Model simplification • Un-modeled uncertainties, parameter uncertainties 3 Result is sub-optimal planner output • Sub-optimalities range from ǫ to catastrophic • Mismatch between actual and expected performance Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 3 / 1

  4. Cooperative Planning Introduction Open Questions 1 How can current multi-agent planners balance between robustness and performance better ? 2 How should the learning algorithms be formulated to best address the errors and uncertainties present in the multi-agent planning problem? 3 How can a learning algorithm be formulated to enable a more intelligent planner response , given stochastic models? Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 4 / 1

  5. Cooperative Planning Introduction Research Objectives Focus ◮ How can a learning algorithm be formulated to enable a more intelligent planner response , given stochastic models? Objectives ◮ Increase model fidelity to narrow the gap between expected and actual performance ◮ Increase cooperative planner performance over time Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 5 / 1

  6. Planning + Learning Framework for Cooperative Planning and Learning Two Worlds ◮ Cooperative Control • Provides fast solutions • Sub-optimal ◮ Online Learning Techniques • Handles stochastic system and unknown models • High sample complexity • Might crash the plane to learn! ◮ Can we take the best of the both worlds? Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 6 / 1

  7. Planning + Learning Framework for Cooperative Planning and Learning Best of the Both Worlds ◮ Cooperative control scheme that learns over time • Learning → Improve Sub-optimal Solutions • Fast Planning → Reduce Sample Complexity • Fast Planning → Avoid Catastrophic plans Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 7 / 1

  8. Planning + Learning Framework for Cooperative Planning and Learning A Framework for Planning + Learning i CCA disturbances Learning Algorithm Cooperative Agent/Vehicle Planner Performance Analysis World observations noise ◮ Template architecture for multi-agent planning and learning ◮ A cooperative planner coupled with learning and analysis algorithms to improve future plans • Distinct elements cut combinatorial complexity of full integration and enable decentralized planning and learning ◮ Intelligent cooperative control architecture (iCCA) Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 8 / 1

  9. Planning + Learning Framework for Cooperative Planning and Learning Merging Point ◮ Deterministic → Stochastic • Plan (Trajectory) → Policy (Behavior) ◮ Import a plan into a policy • Bias the policy for those states on the planned trajectory • Need a method to explicitly represent the policy ◮ Avoid taking actions with unsustainable outcome • Override with the safe (planned) action • Provide a virtual negative feedback Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 9 / 1

  10. Problem Description Scenario Stochastic Weapon-Target Assignment +100 ◮ Scenario: A small team of [2,3] .5 fuel-limited UAVs 1 2 3 (triangles) in a simple, +100 uncertain world cooperate 5 6 .7 to visit a set of targets 8 (circles) with stochastic [2,3] 4 5 [3,4] rewards +100 +200 .5 +300 ◮ Objective: Maximize 7 .6 collective reward ◮ Key features: • Stochastic target rewards (probability shown in nearest cloud) • Specific windows for target visit-times Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 10 / 1

  11. Problem Description Scenario Stochastic WTA Formulation under iCCA i CCA disturbances Learning Algorithm Cooperative Agent/Vehicle Planner Performance Analysis World observations noise ◮ Apply iCCA template [Redding et al, 2010] ◮ Cooperative Planner ← Consensus-Based Bundle Algorithm (CBBA) ◮ Learning Algorithm ← Actor-Critic Reinforcement Learning ◮ Performance Analysis ← Risk Assessment Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 11 / 1

  12. Problem Description Scenario Stochastic WTA Formulation under iCCA i CCA π 0 Actor-Critic Consensus RL Based π (x) a x,r(x) Bundle Agent/Vehicle Algorithm π (x) b Risk (CBBA) Analysis π (x) World observations ◮ Apply iCCA template [Redding et al, 2010] ◮ Cooperative Planner ← Consensus-Based Bundle Algorithm (CBBA) ◮ Learning Algorithm ← Actor-Critic Reinforcement Learning ◮ Performance Analysis ← Risk Assessment Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 12 / 1

  13. Problem Description Cooperative Planner Stochastic WTA Formulation under iCCA i CCA π 0 Actor-Critic Consensus RL Based π (x) a x,r(x) Bundle Agent/Vehicle Algorithm π (x) b Risk (CBBA) Analysis π (x) World observations ◮ Consensus-Based Bundle Algorithm (CBBA) • CBBA is a deterministic planner • Applying CBBA to a stochastic problem introduces sub-optimalities • CBBA provides a “plan”, which seeds an initial policy π 0 • π 0 provides contingency actions Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 13 / 1

  14. Problem Description Cooperative Planner Consensus Based Bundle Algorithm ◮ Current approach is inspired by the Consensus-Based Bundle Algorithm (CBBA) [Choi, Brunet, How TRO 2009] • Key new idea: Focus on agreement of plans Combines auction mechanism for decentralized task selection and consensus protocol for resolving conflicted selections • Note: auction without auctioneer ◮ Consensus on information & winning bids, winning agents • Situational awareness used to improve score estimates • Best bid for each task used to allocate tasks w/o conflicts y i ( j ) = what agent i thinks is the maximum bid on task j z i ( j ) = who agent i thinks bid max value on task j ◮ Distributed algorithm, but also provides a fast central solution Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 14 / 1

  15. Problem Description Cooperative Planner Consensus Based Bundle Algorithm ◮ Distributed multi-task assignment algorithm: CBBA • Each agent carries a single bundle of tasks that is populated by greedy task selection process • Consensus on marginal score of each task not overall bundle score ⇒ suboptimal, but avoids bundle enumeration ◮ Phase 1: Bundle construction Phase 1 Phase 2 • Add task that gives largest marginal score improvement Yes • Populate bundle to its full length L t (or Assigned feasibility) No ◮ Phase 2: Conflict resolution – locally exchange y , z , t i • Sophisticated decision map needed to account for marginal score dependency on previous selections • If an agent is outbid for a task in its bundle, it releases all tasks in bundle following that task Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 15 / 1

  16. Problem Description Learning Algorithm Reinforcement Learning W orld ◮ Value Function: � ∞ � � � Q π ( s, a ) = E π γ t − 1 r t � � s 0 = s, a 0 = a, � t =0 ◮ Temporal Difference TD Learning Q π ( s t , a t ) Q π ( s t , a t ) + αδ t ( Q π ) = δ t ( Q π ) r t + γQ π ( s t +1 , a t +1 ) − Q π ( s t , a t ) = Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 16 / 1

  17. Problem Description Learning Algorithm Stochastic WTA Formulation under iCCA i CCA π 0 Actor-Critic RL Consensus Based π (x) a x,r(x) Bundle Agent/Vehicle Algorithm π (x) b Risk (CBBA) Analysis π (x) World observations ◮ Actor-Critic Reinforcement Learning • Combination of two popular RL thrusts Policy search methods (Actor) Value based techniques (Critic) • Reduced variance of the policy gradient estimate • Natural Actor Critic [Bhatnagar et al. 2007] - more reduced variance • Convergence Guarantees Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 17 / 1

  18. Problem Description Learning Algorithm Actor-Critic Reinforcement Learning ◮ Explore parts of world likely to lead to better system performance ◮ Actor-critic learning: π ( s ) (actor) and Q ( s, a ) (critic) Actor handles the policy e P ( s,a ) /τ ◮ π ( s ) = b e P ( s,b ) /τ � ◮ P ( s, a ) : Preference of taking action a from state s ◮ τ ∈ [0 , ∞ ) acts as temperature (greedy → random action selection) ◮ P ( s, a ) ← P ( s, a ) + αQ ( s, a ) Redding et al (ACL) Actor-Critic Cooperative Planning August 22, 2011 18 / 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend