for Knowledge Transfer in Reinforcement Learning Benjamin Rosman - - PowerPoint PPT Presentation

β–Ά
for knowledge transfer in
SMART_READER_LITE
LIVE PREVIEW

for Knowledge Transfer in Reinforcement Learning Benjamin Rosman - - PowerPoint PPT Presentation

Structured Representations for Knowledge Transfer in Reinforcement Learning Benjamin Rosman Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of


slide-1
SLIDE 1

Structured Representations for Knowledge Transfer in Reinforcement Learning

Benjamin Rosman

Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of the Witwatersrand South Africa

slide-2
SLIDE 2

Robots solving complex tasks Large high- dimensional action and state spaces Many different task instances

slide-3
SLIDE 3
  • Reinforcement learning (RL)

Behaviour learning

Action a State s Reward r

slide-4
SLIDE 4
  • 𝑁 = 𝑇, 𝐡, π‘ˆ, 𝑆

Markov decision process (MDP)

𝑑0 𝑑1 𝑑2 𝑏0 𝑏1 𝑏0 𝑏1 𝑏0 𝑏1

1.0 0.7 0.3 1.0 1.0 0.9 0.1 0.5 0.5

  • 1

1 0.1

  • 0.3

Learn optimal policy: πœŒβˆ— ∢ 𝑇 β†’ 𝐡

slide-5
SLIDE 5

5

  • Can’t just rely on immediate rewards
  • Define value functions:
  • π‘ŠπœŒ 𝑑 = 𝐹𝜌 𝑆𝑒 𝑑𝑒 = 𝑑}
  • π‘…πœŒ 𝑑, 𝑏 = 𝐹𝜌 𝑆𝑒 𝑑𝑒 = 𝑑, 𝑏𝑒 = 𝑏}
  • V* (Q*) is a proxy for Ο€*

𝑑 𝜌 𝑑 𝑏 𝜌

Looking into the future

slide-6
SLIDE 6

6

  • Random

policy:

  • Optimal:

Value functions example

slide-7
SLIDE 7

7

  • So: solve a large system of nonlinear value function

equations (Bellman equations)

  • Optimal control problem
  • But: transitions P & rewards R aren’t known!
  • RL learning is trial-and-error learning to find an optimal

policy from experience

  • Exploration vs exploitation

RL algorithms

slide-8
SLIDE 8

8 100 99 98 97 96

Exploring

slide-9
SLIDE 9

9

Learned value function

slide-10
SLIDE 10

10

  • Initialise 𝑅(𝑑, 𝑏) arbitrarily
  • Repeat (for each episode):
  • Initialise 𝑑
  • Repeat (for each step of episode):

1. Choose 𝑏 from 𝑑 (πœ—-greedy policy from 𝑅)

  • 𝑏 ← ࡝

arg max

𝑏

𝑅(𝑑, 𝑏) π‘₯. π‘ž. 1 βˆ’ πœ— π‘ π‘π‘œπ‘’π‘π‘› π‘₯. π‘ž. πœ—

2. Take action 𝑏, observe 𝑠, 𝑑′ 3. Update estimate of 𝑅

  • 𝑅 𝑑, 𝑏 ← 𝑅 𝑑, 𝑏 + 𝛽 𝑠 + 𝛿 max

𝑏′ 𝑅 𝑑′, 𝑏′ βˆ’ 𝑅(𝑑, 𝑏)

  • 𝑑 ← 𝑑′
  • Until 𝑑 is terminal

exploit explore learn

immediate reward estimated future reward

An algorithm: Q-learning

slide-11
SLIDE 11

Solving tasks

slide-12
SLIDE 12
  • How does this help us solve other problems?

Generalising solutions? ?

slide-13
SLIDE 13
  • Sub-behaviours: options 𝑝 = βŸ¨π½π‘, πœŒπ‘, π›Ύπ‘βŸ©
  • Policy + initiation and termination conditions
  • Abstract away low level actions
  • Does not affect the state space

Hierarchical RL

𝐽𝑝 βŠ† 𝑇 𝛾𝑝: 𝑇 β†’ [0,1] πœŒπ‘: 𝑇 β†’ 𝐡

slide-14
SLIDE 14
  • Aim: learn an abstract representation of the

environment

  • Use with task-level planners
  • Based on agent behaviours (skills / options)
  • General: don’t need to be relearned for every

new task

Abstracting states

Steven James (in collaboration with George Konidaris)

  • S. James, B. Rosman, G. Konidaris. Learning to Plan with Portable Symbols. ICML/IJCAI/AAMAS 2018

Workshop on Planning and Learning, July 2018.

  • S. James, B. Rosman, G. Konidaris. Learning Portable Abstract Representations for High-Level Planning.

Under review.

slide-15
SLIDE 15
  • Learn the preconditions
  • Classification problem:
  • 𝑄 can execute skill? current_state)
  • Learn the effects
  • Density estimation:
  • 𝑄 next_state current_state, skill)
  • Possible if options are subgoal i.e.

𝑄 next_state current_state, skill)

β€œSYMBOLS”

Requirements: planning with skills

slide-16
SLIDE 16
  • 𝑄 next_state current_state, skill)
  • Partition skills to ensure property holds
  • e.g. β€œwalk to nearest door”

Subgoal options

slide-17
SLIDE 17

Generating symbols from skills

[Konidaris, 2018]

  • Results in abstract MDP/propositional PPDDL
  • But 𝑄(𝑑 ∈ 𝐽𝑝) and 𝑄 𝑑′ 𝑝) are distributions/symbols
  • ver state space particular to current task
  • e.g. grounded in a specific set of xy-coordinates
slide-18
SLIDE 18
  • Need a representation that

facilitates transfer

  • Assume agent has sensors

which provide it with (lossy)

  • bservations
  • Augment the state space with

action-centric observations

  • Agent space
  • e.g. robot navigating a building
  • State space: xy-coordinates
  • Agent space: video camera

Towards portability

slide-19
SLIDE 19
  • Learning symbols in agent space
  • Portable!
  • But: non-Markov and insufficient for planning
  • Add the subgoal partition labels to rules
  • General abstract symbols + grounding β†’ portable

rules

+

Portable symbols

slide-20
SLIDE 20
  • Learn abstract symbols
  • Learning linking functions:
  • Mapping partition numbers from options to

their effects

  • This gives us a factored MDP or a PPDDL

representation

  • Provably sufficient for planning

Grounding symbols

slide-21
SLIDE 21

USING AGENT-SPACE DATA USING STATE-SPACE DATA

Learning grounded symbols

slide-22
SLIDE 22

The treasure game

slide-23
SLIDE 23

Agent and problem space

  • State space: 𝑦𝑧-position of

agent, key and treasure, angle of levers and state of lock

  • Agent space: 9 adjacent

cells about the agent

slide-24
SLIDE 24

Skills

  • Options:
  • GoLeft, GoRight
  • JumpLeft, JumpRight
  • DownRight, DownLeft
  • Interact
  • ClimbLadder,

DescendLadder

slide-25
SLIDE 25

Learning portable rules

  • Cluster to create subgoal agent-space options
  • Use SVM and KDE to estimate preconditions and

effects

  • Learned rules can be transferred between tasks

DescendLadder Rule Interact1 Rule

slide-26
SLIDE 26

Grounding rules

  • Partition options in state space to get partition numbers
  • Learn grounded rule instances: linking

+

1 2 3

slide-27
SLIDE 27
slide-28
SLIDE 28

Interact1: Interact3: Precondition: Negative effect: Positive effect:

Partitioned rules

slide-29
SLIDE 29

Experiments

  • Require fewer samples in subsequent tasks
slide-30
SLIDE 30
  • Learn abstract rules and their groundings
  • Transfer between domain instances
  • Just by learning linking functions
  • But what if there is additional structure?
  • In particular, there are many rule instances

(objects of interest)?

Ofir Marom

Ofir Marom and Benjamin Rosman. Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning. NIPS, 2018.

Portable rules

slide-31
SLIDE 31

Example: Sokoban

slide-32
SLIDE 32

Sokoban (legal move)

slide-33
SLIDE 33

Sokoban (legal move)

slide-34
SLIDE 34

Sokoban (illegal move)

slide-35
SLIDE 35

Sokoban (goal)

slide-36
SLIDE 36
  • Poor scalability
  • 100s of boxes?
  • Transferability?
  • Effects of actions depend on interactions further

away, complicating a mapping to agent space

Representations

𝑑 = (π‘π‘•π‘“π‘œπ‘’π‘¦ = 3, π‘π‘•π‘“π‘œπ‘’π‘§ = 4, 𝑐𝑝𝑦1𝑦 = 4, 𝑐𝑝𝑦1𝑧 = 4, 𝑐𝑝𝑦2𝑦 = 3, 𝑐𝑝𝑦2𝑧 = 2)

slide-37
SLIDE 37
  • Consider objects explicitly
  • Object classes have attributes
  • Relationships based on formal logic:

Object-oriented representations

slide-38
SLIDE 38
  • Describe transition rules using schemas
  • Propositional Object-Oriented MDPs
  • Provably efficient to learn (KWIK bounds)

Propositional OO-MDPs

𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„ŽπΉπ‘π‘‘π‘’ π‘„π‘“π‘ π‘‘π‘π‘œ, π‘‹π‘π‘šπ‘š β‡’ π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑦 ← π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑦 + 0

[Duik, 2010]

slide-39
SLIDE 39
  • Propositional OO-MDPs
  • Compact representation
  • Efficient learning of rules

Benefits

slide-40
SLIDE 40
  • Propositional OO-MDPs are efficient, but restrictive

Limitations

𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Žπ‘‹π‘“π‘‘π‘’ 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„ŽπΉπ‘π‘‘π‘’ 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

slide-41
SLIDE 41
  • Propositional OO-MDPs are efficient, but restrictive
  • Restriction that preconditions are propositional
  • Can’t refer to the same box

Limitations

𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Žπ‘‹π‘“π‘‘π‘’ 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„ŽπΉπ‘π‘‘π‘’ 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

slide-42
SLIDE 42
  • Propositional OO-MDPs are efficient, but restrictive
  • Restriction that preconditions are propositional
  • Can’t refer to the same box

Limitations

𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Žπ‘‹π‘“π‘‘π‘’ 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„ŽπΉπ‘π‘‘π‘’ 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

Ground instances! But then relearn dynamics for box1, box2, etc.

slide-43
SLIDE 43
  • Deictic predicates instead of propositions
  • Grounded only with respect to a central deictic
  • bject (β€œme” or β€œthis”)
  • Relates to other non-grounded objects
  • Transition dynamics of 𝐢𝑝𝑦. 𝑦 depends on grounded

𝑐𝑝𝑦 object

  • Also provably efficient

Deictic OO-MDPs

𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Žπ‘‹π‘“π‘‘π‘’ 𝑐𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„ŽπΉπ‘π‘‘π‘’ 𝑐𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝑐𝑝𝑦. 𝑦 ← 𝑐𝑝𝑦. 𝑦 + 0

slide-44
SLIDE 44
  • Learning from experience:
  • For each action, how do attributes change?
  • KWIK framework
  • Propositional OO-MDPs: DOORMAX algorithm
  • Transition dynamics for each attribute and

action must be representable as a binary tree

  • Effects at the leaf nodes
  • Each possible effect can occur at most at one

leaf, except for a failure condition (globally nothing changes)

Learning the dynamics

slide-45
SLIDE 45

Learning the dynamics

π‘ž1 𝜚 π‘ž2 π‘ž3 𝜚 𝑠

1

𝑠

2

slide-46
SLIDE 46

Example: action = North

π‘’π‘π‘£π‘‘β„Žπ‘‚π‘π‘ π‘’β„Ž(π‘„π‘“π‘ π‘‘π‘π‘œ, π‘‹π‘π‘šπ‘š) π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑧 ← π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑧 + 1 𝜚

  • Given:
  • Each effect can only occur once on the tree
  • Global failure condition
  • Deterministic effects
  • Learn from common elements in state propositions

(experience)

slide-47
SLIDE 47
  • We adapt the DOORMAX algorithm to deictic OO-

MDPs

  • Remove global failure condition
  • Bound the number of times a condition can
  • ccur
  • Can still be learned efficiently

π‘’π‘π‘£π‘‘β„Žπ‘‚π‘π‘ π‘’β„Ž(π‘„π‘“π‘ π‘‘π‘π‘œ, π‘‹π‘π‘šπ‘š) π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑧 + 1 𝜚 π‘’π‘π‘£π‘‘β„Žπ‘‚π‘π‘ π‘’β„Ž(π‘žπ‘“π‘ π‘‘π‘π‘œ, 𝑑𝑒𝑏𝑒𝑓) π‘žπ‘“π‘ π‘‘π‘π‘œ. 𝑧 + 1 π‘žπ‘“π‘ π‘‘π‘π‘œ. 𝑧 + 0

DOORMAX for deictic OO-MDPs

slide-48
SLIDE 48

Experiments

  • Zero-shot transfer: one run of value iteration

~8𝑙 states ~1𝑁 states

slide-49
SLIDE 49

Experiments

  • Taxi domain
  • Multi-passengers
  • Only one in the

taxi at a time

  • On executing a pickup

action

  • Change in_taxi

attribute of correct passenger

slide-50
SLIDE 50

Experiments

1 passenger 2 passengers 3 passengers 4 passengers

slide-51
SLIDE 51

Take away thoughts

  • Reinforcement learning gives us a powerful tool for

learning behaviours, but extra work is required for generalisation

  • Reasoning in an agent-centric manner:
  • Symbol-based view on skills
  • Enable knowledge reuse
  • Reasoning in an object-centric manner:
  • Learn models of local object interactions
  • Efficient learning and transfer
slide-52
SLIDE 52

Thank you!

www.benjaminrosman.com – www.raillab.org

Funded by