Kishor Jothimurugan, Rajeev Alur, Osbert Bastani
A Composable Specification Language for Reinforcement Learning Tasks - - PowerPoint PPT Presentation
A Composable Specification Language for Reinforcement Learning Tasks - - PowerPoint PPT Presentation
A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1 Control System Continuous states and actions Controller System can be probabilistic
2
π‘ β π π β π΅
π = Set of System States π΅ = Set of Control Inputs
Γ Continuous states and actions Γ System can be probabilistic Γ Discrete Time Γ Finite time horizon - T
Controller System
Control System
Γ Use neural networks to map states to actions Γ Design reward function R mapping runs to rewards Γ Learn NN parameters optimizing: Γ Use neural networks to map states to actions Γ Design reward function R mapping runs to rewards Γ Learn NN parameters optimizing:
3
Neural Network
π‘ β π π β π΅
Controller System π = Set of System States π΅ = Set of Control Inputs
Reinforcement Learning
Γ Too low-level as compared to logical specification Γ No obvious way to compose rewards
4
π' : Reward function for βReach qβ π( : Reward function for βReach pβ Reward function for βReach q and then Reach pβ ?
Reward Functions
5
Need to generate reward function from a given logical specification
π
Γ Specification: Reach q, then Reach p, then Reach r Γ Controller maps states to actions Γ Action at p depends on the history of the run Solution: Add additional state component to indicate whether q has already been visited
6
Need For Memory
7
Need to generate reward function from a given logical specification Need to automatically infer the additional state components from the specification
Γ System MDP = π, π΅, π, π, π‘. where π π‘, π, π‘/ = Pr π‘/ π‘, π) given as a black-box forward simulator Γ Specification π given in our task specification language Synthesizes a control policy πβ such that, πβ β argmax9 Pr[π β¨ π]
8
Our Framework
Reward Function Monitor Automaton
9
Our Framework
System Specification Nondeterministic Task Monitor Product MDP Reinforcement Learning Algorithm Control Policy
Γ Example base predicates:
- ππππ
A is satisfied if and only if the distance to q is less than 1
- π΅π₯ππ§D is satisfied if and only if there is a positive distance to O
Γ Specification for navigation example: achieve ππππ
A; achieve ππππ K ensuring π΅π₯ππ§D
10
Task Specification Language
π β achieve π π' ensuring π π'; π( | π' or π(
11
Γ Assume each base predicate π β π is associated with a quantitative semantics, π : π β β such that, π‘ β¨ π if and only if π π‘ > 0
- ππππ
A
π‘ = 1 β πππ‘π’ π‘, π
- π΅π₯ππ§D
π‘ = πππ‘π’(π‘, π) Γ Extend to positive Boolean combinations by,
- π' β¨ π( = max( π' , π( )
- π' β§ π( = min( π' , π( )
Quantitative Semantics
Γ Finite State Machine Γ Registers that store quantitative information Γ Compilation similar to NFA construction from regular expressions
12
Task Monitor
Task Monitor for π = achieve π
13
Task monitor for achieve ππππ
A ; achieve ππππ K ensuring π΅π₯ππ§D
Registers Transition Predicate Register Updates π£:
Task Monitor
14
System state System action Monitor state (q) Register values Next monitor transition
Extended Policy
Map state q to neural network Neural Network for state q
15
Given a sequence of extended system states, π = π., π‘., π€. β β― β (πf, π‘f, π€f) what should be its reward? Γ Case 1: (πf is a final state) Reward is given by monitor Γ Case 2: (πf not a final state) Not all tasks have been completed
- Suggestion 1: π(π) = ββ
- Suggestion 2: Find a reward function πβ² that preserves ordering of runs
w.r.t. π, π π > π πβ² implies π/ π > πβ²(πβ²)
Assigning Rewards
16
Given π = π., π‘., π€. β β― β (πf, π‘f, π€f) with πf non-final, π//(πf) (π‘, π€) = π·j + 2π·m πAn β πΈ + max
p
β¦πpβ§(π‘, π€) π/ π = max
tβΆAvwAnπβ²β²(πf)(π‘t, π€t)
Β§ πA : Length of the longest path from π. to π without using self loops Β§ π·j : Lower bound for possible reward in any final state Β§ π·m : Upper bound on the third term for all π
πf
π' π£' πx π£x
Higher reward for states farther from start Prefer runs that get close to satisfying some predicate on transitions that make progress
Reward Shaping
Γ Implemented our approach in a tool called SPECTRL (SPECifying Tasks for RL) Γ Case study in the 2D navigation setting:
- π = β( and A = β(
- Transitions given by π‘tz' = π‘t + πt + π where π is a small gaussian noise
17
Experiments
18
2D Navigation Tasks
SPECTRL TLTL CCE Learning curves for different tasks
19
2D Navigation Tasks
Sample Complexity Curve Y-axis denotes number of sample trajectories needed to learn X-axis denotes number of nested goals
20
Cartpole
Learning Curve for Cartpole Spec: Go to the right and return to start position without letting the pole fall
21