A Composable Specification Language for Reinforcement Learning Tasks - - PowerPoint PPT Presentation

β–Ά
a composable specification language for reinforcement
SMART_READER_LITE
LIVE PREVIEW

A Composable Specification Language for Reinforcement Learning Tasks - - PowerPoint PPT Presentation

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1 Control System Continuous states and actions Controller System can be probabilistic


slide-1
SLIDE 1

Kishor Jothimurugan, Rajeev Alur, Osbert Bastani

1

A Composable Specification Language for Reinforcement Learning Tasks

slide-2
SLIDE 2

2

𝑑 ∈ 𝑇 𝑏 ∈ 𝐡

𝑇 = Set of System States 𝐡 = Set of Control Inputs

Ø Continuous states and actions Ø System can be probabilistic Ø Discrete Time Ø Finite time horizon - T

Controller System

Control System

slide-3
SLIDE 3

Ø Use neural networks to map states to actions Ø Design reward function R mapping runs to rewards Ø Learn NN parameters optimizing: Ø Use neural networks to map states to actions Ø Design reward function R mapping runs to rewards Ø Learn NN parameters optimizing:

3

Neural Network

𝑑 ∈ 𝑇 𝑏 ∈ 𝐡

Controller System 𝑇 = Set of System States 𝐡 = Set of Control Inputs

Reinforcement Learning

slide-4
SLIDE 4

Ø Too low-level as compared to logical specification Ø No obvious way to compose rewards

4

𝑆' : Reward function for β€œReach q” 𝑆( : Reward function for β€œReach p” Reward function for β€œReach q and then Reach p” ?

Reward Functions

slide-5
SLIDE 5

5

Need to generate reward function from a given logical specification

slide-6
SLIDE 6

𝑠

Ø Specification: Reach q, then Reach p, then Reach r Ø Controller maps states to actions Ø Action at p depends on the history of the run Solution: Add additional state component to indicate whether q has already been visited

6

Need For Memory

slide-7
SLIDE 7

7

Need to generate reward function from a given logical specification Need to automatically infer the additional state components from the specification

slide-8
SLIDE 8

Ø System MDP = 𝑇, 𝐡, 𝑄, π‘ˆ, 𝑑. where 𝑄 𝑑, 𝑏, 𝑑/ = Pr 𝑑/ 𝑑, 𝑏) given as a black-box forward simulator Ø Specification πœ’ given in our task specification language Synthesizes a control policy πœŒβˆ— such that, πœŒβˆ— ∈ argmax9 Pr[𝜍 ⊨ πœ’]

8

Our Framework

slide-9
SLIDE 9

Reward Function Monitor Automaton

9

Our Framework

System Specification Nondeterministic Task Monitor Product MDP Reinforcement Learning Algorithm Control Policy

slide-10
SLIDE 10

Ø Example base predicates:

  • 𝑂𝑓𝑏𝑠

A is satisfied if and only if the distance to q is less than 1

  • 𝐡π‘₯𝑏𝑧D is satisfied if and only if there is a positive distance to O

Ø Specification for navigation example: achieve 𝑂𝑓𝑏𝑠

A; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐡π‘₯𝑏𝑧D

10

Task Specification Language

𝜚 ≔ achieve 𝑐 𝜚' ensuring 𝑐 𝜚'; 𝜚( | 𝜚' or 𝜚(

slide-11
SLIDE 11

11

Ø Assume each base predicate 𝑐 ∈ 𝑄 is associated with a quantitative semantics, 𝑐 : 𝑇 β†’ ℝ such that, 𝑑 ⊨ 𝑐 if and only if 𝑐 𝑑 > 0

  • 𝑂𝑓𝑏𝑠

A

𝑑 = 1 βˆ’ 𝑒𝑗𝑑𝑒 𝑑, π‘Ÿ

  • 𝐡π‘₯𝑏𝑧D

𝑑 = 𝑒𝑗𝑑𝑒(𝑑, 𝑃) Ø Extend to positive Boolean combinations by,

  • 𝑐' ∨ 𝑐( = max( 𝑐' , 𝑐( )
  • 𝑐' ∧ 𝑐( = min( 𝑐' , 𝑐( )

Quantitative Semantics

slide-12
SLIDE 12

Ø Finite State Machine Ø Registers that store quantitative information Ø Compilation similar to NFA construction from regular expressions

12

Task Monitor

Task Monitor for 𝜚 = achieve 𝑐

slide-13
SLIDE 13

13

Task monitor for achieve 𝑂𝑓𝑏𝑠

A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐡π‘₯𝑏𝑧D

Registers Transition Predicate Register Updates 𝑣:

Task Monitor

slide-14
SLIDE 14

14

System state System action Monitor state (q) Register values Next monitor transition

Extended Policy

Map state q to neural network Neural Network for state q

slide-15
SLIDE 15

15

Given a sequence of extended system states, 𝜍 = π‘Ÿ., 𝑑., 𝑀. β†’ β‹― β†’ (π‘Ÿf, 𝑑f, 𝑀f) what should be its reward? Ø Case 1: (π‘Ÿf is a final state) Reward is given by monitor Ø Case 2: (π‘Ÿf not a final state) Not all tasks have been completed

  • Suggestion 1: 𝑆(𝜍) = βˆ’βˆž
  • Suggestion 2: Find a reward function 𝑆′ that preserves ordering of runs

w.r.t. 𝑆, 𝑆 𝜍 > 𝑆 πœβ€² implies 𝑆/ 𝜍 > 𝑆′(πœβ€²)

Assigning Rewards

slide-16
SLIDE 16

16

Given 𝜍 = π‘Ÿ., 𝑑., 𝑀. β†’ β‹― β†’ (π‘Ÿf, 𝑑f, 𝑀f) with π‘Ÿf non-final, 𝑆//(π‘Ÿf) (𝑑, 𝑀) = 𝐷j + 2𝐷m 𝑒An βˆ’ 𝐸 + max

p

⟦𝜏p⟧(𝑑, 𝑀) 𝑆/ 𝜍 = max

t∢AvwAn𝑆′′(π‘Ÿf)(𝑑t, 𝑀t)

Β§ 𝑒A : Length of the longest path from π‘Ÿ. to π‘Ÿ without using self loops Β§ 𝐷j : Lower bound for possible reward in any final state Β§ 𝐷m : Upper bound on the third term for all π‘Ÿ

π‘Ÿf

𝜏' 𝑣' 𝜏x 𝑣x

Higher reward for states farther from start Prefer runs that get close to satisfying some predicate on transitions that make progress

Reward Shaping

slide-17
SLIDE 17

Ø Implemented our approach in a tool called SPECTRL (SPECifying Tasks for RL) Ø Case study in the 2D navigation setting:

  • 𝑇 = ℝ( and A = ℝ(
  • Transitions given by 𝑑tz' = 𝑑t + 𝑏t + 𝜁 where 𝜁 is a small gaussian noise

17

Experiments

slide-18
SLIDE 18

18

2D Navigation Tasks

SPECTRL TLTL CCE Learning curves for different tasks

slide-19
SLIDE 19

19

2D Navigation Tasks

Sample Complexity Curve Y-axis denotes number of sample trajectories needed to learn X-axis denotes number of nested goals

slide-20
SLIDE 20

20

Cartpole

Learning Curve for Cartpole Spec: Go to the right and return to start position without letting the pole fall

slide-21
SLIDE 21

21

THANK YOU!