Rewards with Manifold Analysis Amitay Bar , Ronen Talmon and Ron Meir - - PowerPoint PPT Presentation

β–Ά
rewards with manifold analysis
SMART_READER_LITE
LIVE PREVIEW

Rewards with Manifold Analysis Amitay Bar , Ronen Talmon and Ron Meir - - PowerPoint PPT Presentation

Option Discovery in the Absence of Rewards with Manifold Analysis Amitay Bar , Ronen Talmon and Ron Meir Viterbi Faculty of Electrical Engineering Technion - Israel Institute of Technology Option Discovery ry We address the problem of


slide-1
SLIDE 1

Option Discovery in the Absence of Rewards with Manifold Analysis

Amitay Bar, Ronen Talmon and Ron Meir

Viterbi Faculty of Electrical Engineering Technion - Israel Institute of Technology

slide-2
SLIDE 2

Option Discovery ry

  • We address the problem of option discovery
  • Options (a.k.a. skills) are a predefined sequence of primitive actions

[Sutton et al. β€˜99]

  • Options were shown to improve both learning and exploration
  • Setting
  • Not associated with any specific task
  • Acquired without receiving any reward
  • Important and challenging problem in RL
slide-3
SLIDE 3

Contribution

  • A new approach to option discovery with theoretical foundation
  • Based on manifold analysis
  • The analysis includes novel results in manifold learning
  • We propose an algorithm for option discovery
  • Outperforms competing options
slide-4
SLIDE 4

Graph Based Approach

  • The finite domain is represented by a graph [Mahadevan β€˜07]
  • Nodes - the states (𝕋 is the set of states)
  • Edges - according to the state’s connectivity
  • The graph is a discrete representation of a manifold

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7

M – Adjacency matrix

1 2 3 4 5 6 7 1 2 3 4 5 6 7

D – Degree matrix

2 2 3 4 2 2 1

State=Node

slide-5
SLIDE 5

The Proposed Algorithm

  • 1. Compute the random walk matrix 𝑿 =

1 2 𝑱 βˆ’ π‘΅π‘¬βˆ’1

  • 2. Apply EVD to 𝑿 and obtain its left and right eigenvectors πœšπ‘— ,

ΰ·¨ πœšπ‘— , and its eigenvalues πœ•π‘—

  • 3. Construct
  • 4. Find the local maxima of 𝑔

𝑒 𝑑 , denoted as 𝑑𝑝 𝑗

βŠ‚ 𝕋

  • 5. For each local maximum, 𝑑𝑝

𝑗 , build an option leading to it

𝑔

𝑒: 𝕋 β†’ ℝ , 𝑔 𝑒 𝑑 =

σ𝑗β‰₯2 πœ•π‘—

π‘’πœšπ‘— 𝑑 ΰ·¨

πœšπ‘—

2 To be motivated later

𝑔

𝑒 allows the identification of goal states

slide-6
SLIDE 6

Demonstrating the Score Function

  • 4Rooms [Sutton et al. β€˜99]
  • The local maxima of 𝑔

𝑒 𝑑 are at states that are β€œfar away” from all

  • ther states
  • Corner states and bottleneck states

𝑔

𝑒 𝑑 =

෍

𝑗β‰₯2

πœ•π‘—

π‘’πœšπ‘— 𝑑 ΰ·¨

πœšπ‘—

2

f13 𝑑 f4 𝑑

Low pass filter effect

slide-7
SLIDE 7

Experimental Results - Learning

Diffusion options (t=4) Eigenoptions Random walk Normalized visitation during learning

* Further results in _paper

  • Q learning

[Watkins and Dayan, β€˜92]

  • Eigenoptions

[Machado et al. β€˜17]

slide-8
SLIDE 8

Experimental Results - Exploration

  • Exploration
  • Median number of steps between every two states [Machado et al. β€˜17]

[Machado et al. ’17] [Jinnai et al. β€˜19]

slide-9
SLIDE 9

Theoretical Analysis

  • We use manifold learning results and concepts
  • Diffusion distance [Coifman and Laffon β€˜06]
  • New concept – considering the entire spectrum [Cheng and Mishne β€˜18]
  • Comparison to existing work - eigenoptions [Machado et al. β€˜17] and

cover options [Jinnai et al. β€˜19]

  • Use only the principal components instead of all/many
  • Consider only one eigenvector at a time, instead of incorporating them

together

slide-10
SLIDE 10

Diffusion Distance

  • Consider 𝑿𝑒 =

β‹― β‹―

Euclidean distance Diffusion distance

π’™π‘š

𝑒

𝐸𝑒 𝑑, 𝑑′ = 𝒙𝑑

𝑒 βˆ’ 𝒙𝑑′ 𝑒

𝑿 = 1 2 𝑱 βˆ’ π‘΅π‘¬βˆ’1

slide-11
SLIDE 11

Properties of

  • f the Score Function

Proposition 1

The function 𝑔

𝑒: 𝕋 β†’ ℝ can be expressed as

  • 𝐸𝑒

2 𝑑, 𝑑′ π‘‘β€²βˆˆπ•‹ is the average diffusion distance between state 𝑑 and all

  • ther states

*See ICML paper for the proof

𝑔

𝑒 𝑑 = 𝐸𝑒 2 𝑑, 𝑑′ π‘‘β€²βˆˆπ•‹ + π‘‘π‘π‘œπ‘‘π‘’

slide-12
SLIDE 12

Properties of

  • f the Score Function

Proposition 1

The function 𝑔

𝑒: 𝕋 β†’ ℝ can be expressed as

  • Option discovery: max 𝑔

𝑒 𝑑 = max 𝐸𝑒 2 𝑑, 𝑑′ π‘‘β€²βˆˆπ•‹

Exploration benefits

  • Agent visits different regions
  • Avoiding the dithering effect of random walk

𝑔

𝑒 𝑑 = 𝐸𝑒 2 𝑑, 𝑑′ π‘‘β€²βˆˆπ•‹ + π‘‘π‘π‘œπ‘‘π‘’

*See ICML paper for the proof

slide-13
SLIDE 13

Properties of

  • f the Score Function

Proposition 2

Relates 𝑔

𝑒 𝑑 to 𝝆0, the stationary distribution of the graph

  • PageRank algorithm [Page et al. ’99, Kleinberg β€˜99]

Exploration benefits

  • Diffusion options lead to states for which 𝝆0 𝑑 is small
  • Rarely visited by an uninformed random walk

𝑔

𝑒 𝑑 =

𝒒𝑒

𝑑 βˆ’ 𝝆0 2

𝑔

𝑒 𝑑 ≀ πœ•2 2𝑒 1 𝝆0 𝑑 βˆ’ 1 *See ICML paper for the proof

slide-14
SLIDE 14

Ext xtensions and and Scaling Up Up

  • Extending diffusion options to stochastic domains
  • Stochastic domains β†’ can lead to asymmetric matrices
  • We use polar decomposition on the graph Laplacian [Mhaskar β€˜18]
  • Scaling up to large scale domains/function approximation case
  • [Wu et al. β€˜19], [Jinnai et al. β€˜20]
  • See ICML paper for further discussion and results
slide-15
SLIDE 15

Summary ry

  • We introduced theoretically motivated options
  • Analysis based on concepts from manifold learning
  • Diffusion options encourage exploration
  • Lead to distant states in term of diffusion distance
  • Compensate for low stationary distribution values
  • Empirically demonstrated improved performance
  • Both learning and exploration
slide-16
SLIDE 16

Thank you

β€œOption Discovery in the Absence of Rewards with Manifold Analysis”,

  • A. Bar, R. Talmon and R. Meir, ICML 2020