reward shaping in episodic reinforcement learning
play

Reward Shaping in Episodic Reinforcement Learning Marek Grze s - PowerPoint PPT Presentation

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017 S ao Paulo, May 812 Motivating Reward Shaping Reinforcement Learning Agent reward state action r t a t s t r t+ 1 Environment s t+ 1 [Sutt


  1. Reward Shaping in Episodic Reinforcement Learning Marek Grze´ s Canterbury, UK AAMAS 2017 S˜ ao Paulo, May 8–12

  2. Motivating Reward Shaping

  3. Reinforcement Learning Agent reward state action r t a t s t r t+ 1 Environment s t+ 1 [Sutt 98] Temporal credit assignment problem

  4. Deep Reinforcement Learning state s t+1 action a t reward r t+1

  5. Challenges state s t+1 action a t reward r t+1 ◮ Temporal credit assignment problem ◮ In games, we can just generate more data for reinforcement learning ◮ However, ‘more learning’ in neural networks can be a challenge ... (see next slide)

  6. Contradictory Objectives http://www.deeplearningbook.org ◮ Easy to overfit ◮ Early stopping is a potential regulariser, but we need a lot of training to address the temporal-credit assignment problem ◮ Conclusion: It can be useful to mitigate the temporal credit assignment problem using reward shaping!

  7. Reward Shaping ◮ � s t , a t , s t +1 , r t +1 � ◮ r t +1 goes to Q-learning, SARSA, R-max etc. ◮ r t +1 + F ( s t , a t , s t +1 ) ◮ where F ( s t , a t , s t +1 ) = γ Φ( s t +1 ) − Φ( s t )

  8. Policy Invariance under Reward Transformations Potential-based reward shaping is necessary and sufficient to guarantee policy invariance [Ng 99] Straightforward to show in infinite-horizon MDPs [Asmu 08] Investigating episodic learning leads to new insights

  9. Problematic Example in Single-agent RL Φ=1000 Φ=10 g 1 g 2 r=0 r=100 a 1 a 2 s i Φ=0 [Grze 10] ◮ F ( s , goal ) = 0 in my PhD thesis ◮ [Ng 99] required F ( goal , · ) = 0 ◮ Φ( goal ) = 0 is what is necessary

  10. Multi-agent Learning and Nash Equilibria *,* x 4 +10 a,a;b,b x 2 a,* a,b;b,a x 1 x 5 start − 10 *,* b,* x 3 x 6 +9 *,* *,* [Bout 99, Devl 11]

  11. Multi-agent Learning and Nash Equilibria *,* x 4 Φ( x 2 ) = 0 +10 Φ( x 4 ) = 0 a,a;b,b Φ( x 1 ) = 0 x 2 a,* a,b;b,a Φ( x 5 ) = M x 1 x 5 start − 10 *,* b,* x 3 x 6 Φ( x 6 ) = 0 +9 Φ( x 3 ) = 0 *,* *,* When M is sufficiently large, we have a new Nash Equlibrium.

  12. PAC-MDP Reinforcement Learning and R-max Optimism in AI and Optimisation ◮ A* ◮ Branch-and-Bound ◮ R-max and optimistic potential functions [Asmu 08]

  13. PAC-MDP Reinforcement Learning and R-max Optimism in AI and Optimisation ◮ A* ◮ Branch-and-Bound ◮ R-max and optimistic potential functions [Asmu 08] Sufficient conditions for R-max ◮ ∀ s ∈ Goals Φ( s ) = 0 ◮ ∀ s ∈ Known Φ( s ) = C where C is an arbitrary number ◮ ∀ s ∈ Unknown Φ( s ) ≥ 0 ◮ where Goals ∩ Known ∩ Unknown = ∅

  14. MDP Planning: Infinite-horizon ◮ MDP solutions methods: linear programming ◮ F ( s , a , s ′ ) = γ Φ( s ′ ) − Φ( s ) ◮ The impact of reward shaping: � λ ( s , a ) T ( s , a , s ′ ) F ( s , a , s ′ ) = − � Φ( s ′ ) µ ( s ′ ) s , a , s ′ s ′

  15. MDP Planning: Finite-Horizon � � � λ ( s , a ) T ( s , a , s ′ ) F ( s , a , s ′ ) a ∈ A s ∈ S \ G s ′ ∈ S � � � � Φ( s ′ ) � λ ( s , a ) T ( s , a , s ′ ) = s ′ ∈ G s ∈ S \ G a ∈ A

  16. References I [Asmu 08] J. Asmuth, M. L. Littman, and R. Zinkov. “Potential-based Shaping in Model-based Reinforcement Learning”. In: Proceedings of AAAI , 2008. [Bout 99] C. Boutilier. “Sequential Optimality and Coordination in Multiagent Systems”. In: Proceedings of the International Joint Conferrence on Artificial Intelligencekue , pp. 478–485, 1999. [Devl 11] S. Devlin and D. Kudenko. “Theoretical Considerations of Potential-Based Reward Shaping for Multi-Agent Systems”. In: Proceedings of AAMAS , 2011. [Grze 10] M. Grzes. Improving exploration in reinforcement learning through domain knowledge and parameter analysis . PhD thesis, University of York, 2010. [Ng 99] A. Y. Ng, D. Harada, and S. J. Russell. “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proceedings of the 16th International Conference on Machine Learning , pp. 278–287, 1999. [Sutt 98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend