Reward Shaping in Episodic Reinforcement Learning Marek Grze s - PowerPoint PPT Presentation

Reward Shaping in Episodic Reinforcement Learning Marek Grze´ s Canterbury, UK AAMAS 2017 S˜ ao Paulo, May 8–12

Motivating Reward Shaping

Reinforcement Learning Agent reward state action r t a t s t r t+ 1 Environment s t+ 1 [Sutt 98] Temporal credit assignment problem

Deep Reinforcement Learning state s t+1 action a t reward r t+1

Challenges state s t+1 action a t reward r t+1 ◮ Temporal credit assignment problem ◮ In games, we can just generate more data for reinforcement learning ◮ However, ‘more learning’ in neural networks can be a challenge ... (see next slide)

Contradictory Objectives http://www.deeplearningbook.org ◮ Easy to overfit ◮ Early stopping is a potential regulariser, but we need a lot of training to address the temporal-credit assignment problem ◮ Conclusion: It can be useful to mitigate the temporal credit assignment problem using reward shaping!

Reward Shaping ◮ � s t , a t , s t +1 , r t +1 � ◮ r t +1 goes to Q-learning, SARSA, R-max etc. ◮ r t +1 + F ( s t , a t , s t +1 ) ◮ where F ( s t , a t , s t +1 ) = γ Φ( s t +1 ) − Φ( s t )

Policy Invariance under Reward Transformations Potential-based reward shaping is necessary and sufficient to guarantee policy invariance [Ng 99] Straightforward to show in infinite-horizon MDPs [Asmu 08] Investigating episodic learning leads to new insights

Problematic Example in Single-agent RL Φ=1000 Φ=10 g 1 g 2 r=0 r=100 a 1 a 2 s i Φ=0 [Grze 10] ◮ F ( s , goal ) = 0 in my PhD thesis ◮ [Ng 99] required F ( goal , · ) = 0 ◮ Φ( goal ) = 0 is what is necessary

Multi-agent Learning and Nash Equilibria *,* x 4 +10 a,a;b,b x 2 a,* a,b;b,a x 1 x 5 start − 10 *,* b,* x 3 x 6 +9 *,* *,* [Bout 99, Devl 11]

Multi-agent Learning and Nash Equilibria *,* x 4 Φ( x 2 ) = 0 +10 Φ( x 4 ) = 0 a,a;b,b Φ( x 1 ) = 0 x 2 a,* a,b;b,a Φ( x 5 ) = M x 1 x 5 start − 10 *,* b,* x 3 x 6 Φ( x 6 ) = 0 +9 Φ( x 3 ) = 0 *,* *,* When M is sufficiently large, we have a new Nash Equlibrium.

PAC-MDP Reinforcement Learning and R-max Optimism in AI and Optimisation ◮ A* ◮ Branch-and-Bound ◮ R-max and optimistic potential functions [Asmu 08]

PAC-MDP Reinforcement Learning and R-max Optimism in AI and Optimisation ◮ A* ◮ Branch-and-Bound ◮ R-max and optimistic potential functions [Asmu 08] Sufficient conditions for R-max ◮ ∀ s ∈ Goals Φ( s ) = 0 ◮ ∀ s ∈ Known Φ( s ) = C where C is an arbitrary number ◮ ∀ s ∈ Unknown Φ( s ) ≥ 0 ◮ where Goals ∩ Known ∩ Unknown = ∅

MDP Planning: Infinite-horizon ◮ MDP solutions methods: linear programming ◮ F ( s , a , s ′ ) = γ Φ( s ′ ) − Φ( s ) ◮ The impact of reward shaping: � λ ( s , a ) T ( s , a , s ′ ) F ( s , a , s ′ ) = − � Φ( s ′ ) µ ( s ′ ) s , a , s ′ s ′

MDP Planning: Finite-Horizon � � � λ ( s , a ) T ( s , a , s ′ ) F ( s , a , s ′ ) a ∈ A s ∈ S \ G s ′ ∈ S � � � � Φ( s ′ ) � λ ( s , a ) T ( s , a , s ′ ) = s ′ ∈ G s ∈ S \ G a ∈ A

References I [Asmu 08] J. Asmuth, M. L. Littman, and R. Zinkov. “Potential-based Shaping in Model-based Reinforcement Learning”. In: Proceedings of AAAI , 2008. [Bout 99] C. Boutilier. “Sequential Optimality and Coordination in Multiagent Systems”. In: Proceedings of the International Joint Conferrence on Artificial Intelligencekue , pp. 478–485, 1999. [Devl 11] S. Devlin and D. Kudenko. “Theoretical Considerations of Potential-Based Reward Shaping for Multi-Agent Systems”. In: Proceedings of AAMAS , 2011. [Grze 10] M. Grzes. Improving exploration in reinforcement learning through domain knowledge and parameter analysis . PhD thesis, University of York, 2010. [Ng 99] A. Y. Ng, D. Harada, and S. J. Russell. “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proceedings of the 16th International Conference on Machine Learning , pp. 278–287, 1999. [Sutt 98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998.

Reward Shaping in Episodic Reinforcement Learning Marek Grze s - PowerPoint PPT Presentation

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017 S ao Paulo, May 812 Motivating Reward Shaping Reinforcement Learning Agent reward state action r t a t s t r t+ 1 Environment s t+ 1 [Sutt

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

Shaping the Future Mobile Shaping the Future Mobile Shaping the Future Mobile Shaping the Future

Using Natural Language for Reward Shaping in Reinforcement Learning Prasoon Goyal , Scott Niekum

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring

Year 11 Core GCSE Support 2017 'Shaping Futures' 'Shaping Futures' Three way Partnership

Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional

Episodic Memory for Virtual Humans and Virtual Humans for Episodic Memory Cyril Brom et al.

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Inverse Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Todays

Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade

Protective Effects of Optimism on Onset of Post-Deployment Pain in U.S. Army Personnel Afton L.

Constant Propagation on SSA form Advanced Compiler Techniques 2005 Erik Stenman Virtutech

(PS-1971) The Planning Fallacy and its Effect on Realistic Project Schedules Jeffrey A. Valdahl

On the Benefits of Being Optimistic and Relaxed Petr Kuznetsov INFRES, Tlcom ParisTech

Secular Humanism The Road So Far The Agenda What is Secular Humanism? Why is it so

The Push/Pull Model of Transactions Matthew Parkinson Eric Koskinen IBM Research, New York

End of Summer Optimism Economy and Public Health seen getting better Social Order,