Subgoals in Hierarchical Reinforcement Learning Tianren Tang Tian - - PowerPoint PPT Presentation

subgoals in hierarchical reinforcement
SMART_READER_LITE
LIVE PREVIEW

Subgoals in Hierarchical Reinforcement Learning Tianren Tang Tian - - PowerPoint PPT Presentation

Generating Adjacency-Constrained Subgoals in Hierarchical Reinforcement Learning Tianren Tang Tian Tan Shangqi Guo Xiaolin Hu Feng Chen Background Goal-Conditional HRL High policy suffers from non-stationary problem From MARL's


slide-1
SLIDE 1

Generating Adjacency-Constrained Subgoals in Hierarchical Reinforcement Learning

Tianren Tang Shangqi Guo Tian Tan Xiaolin Hu Feng Chen

slide-2
SLIDE 2

Background

  • Goal-Conditional HRL
  • High policy suffers from non-stationary problem
  • From MARL's perspective, agent's policy is influenced by other agents
  • Another Perspective
  • Usually the action space for high policy is too large, therefore its action which

is sub-goal for low policy usually unreachable

  • Intuitively, action space reduction or action elimination
  • Drawbacks:
  • no similiar literature shows how to do space reduction
  • Reduction or elimination may cause sub-optimal
slide-3
SLIDE 3

Intuition

  • Restrict space into k-step adajecent region
slide-4
SLIDE 4

Theoretical Analysis

  • Shortest Transition Time
  • For optimal policy 𝜌∗
  • where 𝜒−1: 𝐻 → 𝑇 is a mapping from goal to state s
slide-5
SLIDE 5

Theoretical Analysis

  • k-step adjacent region of s is defined:
  • Theorem 1:
  • there is always a surrogate goal 𝑕’ ∈ 𝐻𝐵 that 𝜌∗(𝑏∗|𝑡, 𝑕’) = 𝜌∗(𝑏∗|𝑡, 𝑕)
  • Theorem 2:
  • 𝑕’ ∈ 𝐻𝐵, 𝑅∗(𝑡, 𝑕’) = 𝑅∗(𝑡, 𝑕)
slide-6
SLIDE 6

Theoretical Optimizations

  • Original optimization objective

where 𝜐∗ = (𝑡0. . . 𝑡𝑈𝐿), 𝜍∗ = (𝑕0. . . 𝑕(𝑈−1)𝐿)

  • Relax above equations:
slide-7
SLIDE 7

HRL with Adjacency Constraint

  • Adjacent Matrix approximation
  • Contrasitive Loss
slide-8
SLIDE 8

Final Optimization Objective

  • With a learned adjacency network
slide-9
SLIDE 9

Algorithm

slide-10
SLIDE 10

Experiment Environment

  • Discrete & Continuous
  • Result
slide-11
SLIDE 11

Abalation Study

  • Difference:
  • HRAC-O: HRAC with perfect adajency matrix from environment
  • NegReward: Relabel reward to negative and bound critic function
slide-12
SLIDE 12

Visualization

slide-13
SLIDE 13

Summary

  • Although Intuition is easy, this paper is overall good.