for knowledge transfer in
play

for Knowledge Transfer in Reinforcement Learning Benjamin Rosman - PowerPoint PPT Presentation

Structured Representations for Knowledge Transfer in Reinforcement Learning Benjamin Rosman Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of


  1. Structured Representations for Knowledge Transfer in Reinforcement Learning Benjamin Rosman Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of the Witwatersrand South Africa

  2. Robots solving complex tasks Large high- dimensional action and state spaces Many different task instances

  3. Behaviour learning β€’ Reinforcement learning (RL) Action a Reward r State s

  4. Markov decision process (MDP) β€’ 𝑁 = 𝑇, 𝐡, π‘ˆ, 𝑆 Learn optimal policy: 𝑏 0 𝜌 βˆ— ∢ 𝑇 β†’ 𝐡 -1 𝑑 2 1.0 0.3 0.1 0.7 0.5 𝑏 0 𝑏 1 1 0.1 0.9 -0.3 𝑑 0 1.0 𝑏 1 𝑏 0 0.5 𝑏 1 𝑑 1 1.0

  5. Looking into the future β€’ Can’t just rely on immediate rewards β€’ Define value functions : 𝜌 β€’ π‘Š 𝜌 𝑑 = 𝐹 𝜌 𝑆 𝑒 𝑑 𝑒 = 𝑑} 𝑑 𝜌 β€’ 𝑅 𝜌 𝑑, 𝑏 = 𝐹 𝜌 𝑆 𝑒 𝑑 𝑒 = 𝑑, 𝑏 𝑒 = 𝑏} 𝑏 𝑑 β€’ V* (Q*) is a proxy for Ο€ * 5

  6. Value functions example β€’ Random policy: β€’ Optimal: 6

  7. RL algorithms β€’ So: solve a large system of nonlinear value function equations (Bellman equations) β€’ Optimal control problem β€’ But: transitions P & rewards R aren’t known! β€’ RL learning is trial-and-error learning to find an optimal policy from experience β€’ Exploration vs exploitation 7

  8. Exploring 99 100 98 97 96 8

  9. Learned value function 9

  10. An algorithm: Q-learning β€’ Initialise 𝑅(𝑑, 𝑏) arbitrarily β€’ Repeat (for each episode): β€’ Initialise 𝑑 β€’ Repeat (for each step of episode): Choose 𝑏 from 𝑑 ( πœ— -greedy policy from 𝑅 ) 1. arg max 𝑅(𝑑, 𝑏) π‘₯. π‘ž. 1 βˆ’ πœ— exploit β€’ 𝑏 ← ࡝ 𝑏 explore π‘ π‘π‘œπ‘’π‘π‘› π‘₯. π‘ž. πœ— Take action 𝑏 , observe 𝑠, 𝑑′ 2. Update estimate of 𝑅 3. 𝑏′ 𝑅 𝑑 β€² , 𝑏 β€² βˆ’ 𝑅(𝑑, 𝑏) β€’ 𝑅 𝑑, 𝑏 ← 𝑅 𝑑, 𝑏 + 𝛽 𝑠 + 𝛿 max learn β€’ 𝑑 ← 𝑑′ β€’ Until 𝑑 is terminal estimated immediate future reward reward 10

  11. Solving tasks

  12. Generalising solutions? ? β€’ How does this help us solve other problems?

  13. Hierarchical RL β€’ Sub-behaviours: options 𝑝 = ⟨𝐽 𝑝 , 𝜌 𝑝 , 𝛾 𝑝 ⟩ β€’ Policy + initiation and termination conditions 𝜌 𝑝 : 𝑇 β†’ 𝐡 𝛾 𝑝 : 𝑇 β†’ [0,1] 𝐽 𝑝 βŠ† 𝑇 β€’ Abstract away low level actions β€’ Does not affect the state space

  14. Abstracting states β€’ Aim : learn an abstract representation of the environment β€’ Use with task-level planners β€’ Based on agent behaviours (skills / options) β€’ General : don’t need to be relearned for every new task Steven James (in collaboration with George Konidaris) S. James, B. Rosman, G. Konidaris. Learning to Plan with Portable Symbols. ICML/IJCAI/AAMAS 2018 Workshop on Planning and Learning, July 2018. S. James, B. Rosman, G. Konidaris. Learning Portable Abstract Representations for High-Level Planning. Under review.

  15. Requirements: planning with skills β€’ Learn the preconditions β€’ Classification problem: β€’ 𝑄 can execute skill? current_state) β€œSYMBOLS” β€’ Learn the effects β€’ Density estimation: β€’ 𝑄 next_state current_state, skill) β€’ Possible if options are subgoal i.e. 𝑄 next_state current_state, skill)

  16. Subgoal options β€’ 𝑄 next_state current_state, skill) β€’ Partition skills to ensure property holds β€’ e.g. β€œwalk to nearest door”

  17. Generating symbols from skills [Konidaris, 2018] β€’ Results in abstract MDP/propositional PPDDL β€’ But 𝑄(𝑑 ∈ 𝐽 𝑝 ) and 𝑄 𝑑 β€² 𝑝) are distributions/symbols over state space particular to current task β€’ e.g. grounded in a specific set of xy-coordinates

  18. Towards portability β€’ Need a representation that facilitates transfer β€’ Assume agent has sensors which provide it with (lossy) observations β€’ Augment the state space with action-centric observations β€’ Agent space β€’ e.g. robot navigating a building β€’ State space: xy-coordinates β€’ Agent space: video camera

  19. Portable symbols β€’ Learning symbols in agent space β€’ Portable! β€’ But: non-Markov and insufficient for planning β€’ Add the subgoal partition labels to rules β€’ General abstract symbols + grounding β†’ portable rules +

  20. Grounding symbols β€’ Learn abstract symbols β€’ Learning linking functions: β€’ Mapping partition numbers from options to their effects β€’ This gives us a factored MDP or a PPDDL representation β€’ Provably sufficient for planning

  21. Learning grounded symbols USING AGENT-SPACE DATA USING STATE-SPACE DATA

  22. The treasure game

  23. Agent and problem space β€’ State space: 𝑦𝑧 -position of agent, key and treasure, angle of levers and state of lock β€’ Agent space: 9 adjacent cells about the agent

  24. Skills β€’ Options: β€’ GoLeft, GoRight β€’ JumpLeft, JumpRight β€’ DownRight, DownLeft β€’ Interact β€’ ClimbLadder, DescendLadder

  25. Learning portable rules β€’ Cluster to create subgoal agent-space options β€’ Use SVM and KDE to estimate preconditions and effects β€’ Learned rules can be transferred between tasks Interact1 Rule DescendLadder Rule

  26. Grounding rules β€’ Partition options in state space to get partition numbers β€’ Learn grounded rule instances: linking 1 2 + 3

  27. Partitioned rules Precondition: Negative effect: Positive effect: Interact 1 : Interact 3 :

  28. Experiments β€’ Require fewer samples in subsequent tasks

  29. Portable rules β€’ Learn abstract rules and their groundings β€’ Transfer between domain instances β€’ Just by learning linking functions β€’ But what if there is additional structure? β€’ In particular, there are many rule instances (objects of interest)? Ofir Marom Ofir Marom and Benjamin Rosman. Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning. NIPS, 2018.

  30. Example: Sokoban

  31. Sokoban (legal move)

  32. Sokoban (legal move)

  33. Sokoban (illegal move)

  34. Sokoban (goal)

  35. Representations 𝑑 = (π‘π‘•π‘“π‘œπ‘’ 𝑦 = 3, π‘π‘•π‘“π‘œπ‘’ 𝑧 = 4, 𝑐𝑝𝑦1 𝑦 = 4, 𝑐𝑝𝑦1 𝑧 = 4, 𝑐𝑝𝑦2 𝑦 = 3, 𝑐𝑝𝑦2 𝑧 = 2) β€’ Poor scalability β€’ 100s of boxes? β€’ Transferability? β€’ Effects of actions depend on interactions further away, complicating a mapping to agent space

  36. Object-oriented representations β€’ Consider objects explicitly β€’ Object classes have attributes β€’ Relationships based on formal logic:

  37. Propositional OO-MDPs [Duik, 2010] β€’ Describe transition rules using schemas β€’ Propositional Object-Oriented MDPs β€’ Provably efficient to learn (KWIK bounds) 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 π‘„π‘“π‘ π‘‘π‘π‘œ, π‘‹π‘π‘šπ‘š β‡’ π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑦 ← π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑦 + 0

  38. Benefits β€’ Propositional OO-MDPs β€’ Compact representation β€’ Efficient learning of rules

  39. Limitations β€’ Propositional OO-MDPs are efficient, but restrictive 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

  40. Limitations β€’ Propositional OO-MDPs are efficient, but restrictive β€’ Restriction that preconditions are propositional β€’ Can’t refer to the same box 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

  41. Limitations β€’ Propositional OO-MDPs are efficient, but restrictive β€’ Restriction that preconditions are propositional β€’ Can’t refer to the same box 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ? Ground instances! But then relearn dynamics for box1, box2, etc.

  42. Deictic OO-MDPs β€’ Deictic predicates instead of propositions β€’ Grounded only with respect to a central deictic object ( β€œ me ” or β€œ this ” ) β€’ Relates to other non-grounded objects β€’ Transition dynamics of 𝐢𝑝𝑦. 𝑦 depends on grounded 𝑐𝑝𝑦 object 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝑐𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝑐𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝑐𝑝𝑦. 𝑦 ← 𝑐𝑝𝑦. 𝑦 + 0 β€’ Also provably efficient

  43. Learning the dynamics β€’ Learning from experience: β€’ For each action, how do attributes change? β€’ KWIK framework β€’ Propositional OO-MDPs: DOORMAX algorithm β€’ Transition dynamics for each attribute and action must be representable as a binary tree β€’ Effects at the leaf nodes β€’ Each possible effect can occur at most at one leaf, except for a failure condition (globally nothing changes)

  44. Learning the dynamics π‘ž 1 𝜚 π‘ž 2 π‘ž 3 𝜚 𝑠 𝑠 1 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend