rewards with manifold analysis
play

Rewards with Manifold Analysis Amitay Bar , Ronen Talmon and Ron Meir - PowerPoint PPT Presentation

Option Discovery in the Absence of Rewards with Manifold Analysis Amitay Bar , Ronen Talmon and Ron Meir Viterbi Faculty of Electrical Engineering Technion - Israel Institute of Technology Option Discovery ry We address the problem of


  1. Option Discovery in the Absence of Rewards with Manifold Analysis Amitay Bar , Ronen Talmon and Ron Meir Viterbi Faculty of Electrical Engineering Technion - Israel Institute of Technology

  2. Option Discovery ry • We address the problem of option discovery • Options (a.k.a. skills) are a predefined sequence of primitive actions [Sutton et al. ‘ 99] • Options were shown to improve both learning and exploration • Setting • Not associated with any specific task • Acquired without receiving any reward • Important and challenging problem in RL

  3. Contribution • A new approach to option discovery with theoretical foundation • Based on manifold analysis • The analysis includes novel results in manifold learning • We propose an algorithm for option discovery • Outperforms competing options

  4. Graph Based Approach • The finite domain is represented by a graph [Mahadevan ‘07] • Nodes - the states (𝕋 is the set of states) • Edges - according to the state’s connectivity • The graph is a discrete representation of a manifold State=Node M – Adjacency matrix D – Degree matrix 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 1 0 1 1 0 0 0 0 2 0 0 0 0 0 0 2 2 1 0 0 1 0 0 0 0 2 0 0 0 0 0 3 3 1 0 0 1 1 0 0 0 0 3 0 0 0 0 4 4 0 1 1 0 0 1 1 0 0 0 4 0 0 0 0 0 1 0 0 1 0 5 5 0 0 0 0 2 0 0 0 0 0 1 1 0 0 6 6 0 0 0 0 0 2 0 0 0 0 1 0 0 0 7 7 0 0 0 0 0 0 1

  5. The Proposed Algorithm 1 2 𝑱 − 𝑵𝑬 −1 1. Compute the random walk matrix 𝑿 = ෨ 2. Apply EVD to 𝑿 and obtain its left and right eigenvectors 𝜚 𝑗 , 𝜚 𝑗 , and its eigenvalues 𝜕 𝑗 2 To be motivated 𝑢 𝜚 𝑗 𝑡 ෨ σ 𝑗≥2 𝜕 𝑗 𝑔 𝑢 : 𝕋 → ℝ , 𝑔 𝑢 𝑡 = 𝜚 𝑗 3. Construct later 𝑗 4. Find the local maxima of 𝑔 𝑢 𝑡 , denoted as 𝑡 𝑝 ⊂ 𝕋 𝑗 , build an option leading to it 5. For each local maximum, 𝑡 𝑝 𝑔 𝑢 allows the identification of goal states

  6. Demonstrating the Score Function 2 𝑢 𝜚 𝑗 𝑡 ෨ 𝑔 𝑢 𝑡 = ෍ 𝜕 𝑗 𝜚 𝑗 • 4Rooms [Sutton et al. ‘99] 𝑗≥2 • The local maxima of 𝑔 𝑢 𝑡 are at states that are “far away” from all other states • Corner states and bottleneck states Low pass filter effect f 13 𝑡 f 4 𝑡

  7. Experimental Results - Learning • Q learning [Watkins and Dayan, ‘ 92] • Eigenoptions [Machado et al. ‘ 17] Normalized visitation during learning * Further results in _paper Diffusion options (t=4) Eigenoptions Random walk

  8. Experimental Results - Exploration • Exploration • Median number of steps between every two states [Machado et al. ‘ 17] [Machado et al. ’17] [Jinnai et al. ‘19]

  9. Theoretical Analysis • We use manifold learning results and concepts • Diffusion distance [Coifman and Laffon ‘06] • New concept – considering the entire spectrum [Cheng and Mishne ‘18] • Comparison to existing work - eigenoptions [Machado et al. ‘17] and cover options [Jinnai et al. ‘19] • Use only the principal components instead of all/many • Consider only one eigenvector at a time, instead of incorporating them together

  10. Diffusion Distance 𝑿 = 1 2 𝑱 − 𝑵𝑬 −1 𝑢 𝒙 𝑚 𝑢 − 𝒙 𝑡′ • Consider 𝑿 𝑢 = ⋯ ⋯ 𝑢 𝐸 𝑢 𝑡, 𝑡′ = 𝒙 𝑡 Euclidean distance Diffusion distance

  11. Properties of of the Score Function Proposition 1 The function 𝑔 𝑢 : 𝕋 → ℝ can be expressed as 2 𝑡, 𝑡 ′ 𝑔 𝑢 𝑡 = 𝐸 𝑢 𝑡′∈𝕋 + 𝑑𝑝𝑜𝑡𝑢 2 𝑡, 𝑡 ′ • 𝐸 𝑢 𝑡′∈𝕋 is the average diffusion distance between state 𝑡 and all other states *See ICML paper for the proof

  12. Properties of of the Score Function Proposition 1 The function 𝑔 𝑢 : 𝕋 → ℝ can be expressed as 2 𝑡, 𝑡 ′ 𝑔 𝑢 𝑡 = 𝐸 𝑢 𝑡′∈𝕋 + 𝑑𝑝𝑜𝑡𝑢 2 𝑡, 𝑡 ′ • Option discovery: max 𝑔 𝑢 𝑡 = max 𝐸 𝑢 𝑡′∈𝕋 Exploration benefits • Agent visits different regions • Avoiding the dithering effect of random walk *See ICML paper for the proof

  13. Properties of of the Score Function Proposition 2 Relates 𝑔 𝑢 𝑡 to 𝝆 0 , the stationary distribution of the graph 2 𝑡 − 𝝆 0 1 2𝑢 𝑔 𝑢 𝑡 ≤ 𝜕 2 𝝆 0 𝑡 − 1 𝑔 𝑢 𝑡 = 𝒒 𝑢 • PageRank algorithm [Page et al. ’ 99, Kleinberg ‘ 99] Exploration benefits • Diffusion options lead to states for which 𝝆 0 𝑡 is small • Rarely visited by an uninformed random walk *See ICML paper for the proof

  14. Ext xtensions and and Scaling Up Up • Extending diffusion options to stochastic domains • Stochastic domains → can lead to asymmetric matrices • We use polar decomposition on the graph Laplacian [Mhaskar ‘18] • Scaling up to large scale domains/function approximation case • [Wu et al. ‘19], [Jinnai et al. ‘20] • See ICML paper for further discussion and results

  15. Summary ry • We introduced theoretically motivated options • Analysis based on concepts from manifold learning • Diffusion options encourage exploration • Lead to distant states in term of diffusion distance • Compensate for low stationary distribution values • Empirically demonstrated improved performance • Both learning and exploration

  16. Thank you “ Option Discovery in the Absence of Rewards with Manifold Analysis ” , A. Bar, R. Talmon and R. Meir, ICML 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend