a composable specification language for reinforcement
play

A Composable Specification Language for Reinforcement Learning Tasks - PowerPoint PPT Presentation

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1 Control System Continuous states and actions Controller System can be probabilistic


  1. A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1

  2. Control System Ø Continuous states and actions Controller 𝑑 ∈ 𝑇 𝑏 ∈ 𝐡 Ø System can be probabilistic Ø Discrete Time System Ø Finite time horizon - T 𝑇 = Set of System States 𝐡 = Set of Control Inputs 2

  3. Reinforcement Learning Controller Neural Network 𝑑 ∈ 𝑇 𝑏 ∈ 𝐡 Ø Use neural networks to map states to Ø Use neural networks to map states to actions actions Ø Design reward function R mapping runs to Ø Design reward function R mapping runs to System rewards rewards 𝑇 = Set of System States Ø Learn NN parameters optimizing: Ø Learn NN parameters optimizing: 𝐡 = Set of Control Inputs 3

  4. Reward Functions Ø Too low-level as compared to logical specification Ø No obvious way to compose rewards 𝑆 ' : Reward function for β€œReach q” 𝑆 ( : Reward function for β€œReach p” Reward function for β€œReach q and then Reach p” ? 4

  5. Need to generate reward function from a given logical specification 5

  6. Need For Memory Ø Specification: Reach q, then Reach p, then Reach r Ø Controller maps states to actions Ø Action at p depends on the history of the run Solution: Add additional state component to 𝑠 indicate whether q has already been visited 6

  7. Need to generate reward function from a given logical specification Need to automatically infer the additional state components from the specification 7

  8. Our Framework Ø System MDP = 𝑇, 𝐡, 𝑄, π‘ˆ, 𝑑 . where 𝑄 𝑑, 𝑏, 𝑑 / = Pr 𝑑 / 𝑑, 𝑏) given as a black-box forward simulator Ø Specification πœ’ given in our task specification language Synthesizes a control policy 𝜌 βˆ— such that, 𝜌 βˆ— ∈ argmax 9 Pr[𝜍 ⊨ πœ’] 8

  9. Our Framework Product MDP System Reinforcement Learning Algorithm Reward Monitor Function Automaton Nondeterministic Specification Control Policy Task Monitor 9

  10. Task Specification Language 𝜚 ≔ achieve 𝑐 𝜚 ' ensuring 𝑐 𝜚 ' ; 𝜚 ( | 𝜚 ' or 𝜚 ( Ø Example base predicates: o 𝑂𝑓𝑏𝑠 A is satisfied if and only if the distance to q is less than 1 o 𝐡π‘₯𝑏𝑧 D is satisfied if and only if there is a positive distance to O Ø Specification for navigation example: achieve 𝑂𝑓𝑏𝑠 A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐡π‘₯𝑏𝑧 D 10

  11. Quantitative Semantics Ø Assume each base predicate 𝑐 ∈ 𝑄 is associated with a quantitative semantics, 𝑐 : 𝑇 β†’ ℝ such that, 𝑑 ⊨ 𝑐 if and only if 𝑐 𝑑 > 0 𝑂𝑓𝑏𝑠 𝑑 = 1 βˆ’ 𝑒𝑗𝑑𝑒 𝑑, π‘Ÿ o A 𝐡π‘₯𝑏𝑧 D 𝑑 = 𝑒𝑗𝑑𝑒(𝑑, 𝑃) o Ø Extend to positive Boolean combinations by, 𝑐 ' ∨ 𝑐 ( = max( 𝑐 ' , 𝑐 ( ) o 𝑐 ' ∧ 𝑐 ( = min( 𝑐 ' , 𝑐 ( ) o 11

  12. Task Monitor Ø Finite State Machine Ø Registers that store quantitative information Ø Compilation similar to NFA construction from regular expressions Task Monitor for 𝜚 = achieve 𝑐 12

  13. Task Monitor Register Updates Transition Predicate Registers Task monitor for achieve 𝑂𝑓𝑏𝑠 A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐡π‘₯𝑏𝑧 D 13 𝑣:

  14. Extended Policy Monitor state (q) Map state q to neural network System state System action Next monitor Register values transition Neural Network for state q 14

  15. Assigning Rewards Given a sequence of extended system states, 𝜍 = π‘Ÿ . , 𝑑 . , 𝑀 . β†’ β‹― β†’ (π‘Ÿ f , 𝑑 f , 𝑀 f ) what should be its reward? Ø Case 1: ( π‘Ÿ f is a final state) Reward is given by monitor Ø Case 2: ( π‘Ÿ f not a final state) Not all tasks have been completed o Suggestion 1: 𝑆(𝜍) = βˆ’βˆž o Suggestion 2: Find a reward function 𝑆′ that preserves ordering of runs w.r.t. 𝑆, 𝑆 𝜍 > 𝑆 πœβ€² implies 𝑆 / 𝜍 > 𝑆′(πœβ€²) 15

  16. Reward Shaping Given 𝜍 = π‘Ÿ . , 𝑑 . , 𝑀 . β†’ β‹― β†’ (π‘Ÿ f , 𝑑 f , 𝑀 f ) with π‘Ÿ f non-final, 𝜏 ' 𝑣 ' π‘Ÿ f Higher reward for states farther from start 𝜏 x 𝑣 x 𝑆 // (π‘Ÿ f ) (𝑑, 𝑀) = 𝐷 j + 2𝐷 m 𝑒 A n βˆ’ 𝐸 + max ⟦𝜏 p ⟧(𝑑, 𝑀) p 𝑆 / 𝜍 = max t∢A v wA n 𝑆′′(π‘Ÿ f )(𝑑 t , 𝑀 t ) Prefer runs that get close to satisfying some predicate on transitions that make progress 𝑒 A : Length of the longest path from π‘Ÿ . to π‘Ÿ without using self loops Β§ 𝐷 j : Lower bound for possible reward in any final state Β§ 𝐷 m : Upper bound on the third term for all π‘Ÿ Β§ 16

  17. Experiments Ø Implemented our approach in a tool called SPECTRL (SPECifying Tasks for RL) Ø Case study in the 2D navigation setting: o 𝑇 = ℝ ( and A = ℝ ( o Transitions given by 𝑑 tz' = 𝑑 t + 𝑏 t + 𝜁 where 𝜁 is a small gaussian noise 17

  18. 2D Navigation Tasks SPECTRL TLTL CCE 18 Learning curves for different tasks

  19. 2D Navigation Tasks Sample Complexity Curve Y-axis denotes number of sample trajectories needed to learn X-axis denotes number of nested goals 19

  20. Cartpole Learning Curve for Cartpole Spec: Go to the right and return to start position without letting the pole fall 20

  21. THANK YOU! 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend