Reward Shaping in Episodic Reinforcement Learning Marek Grze s - - PowerPoint PPT Presentation

reward shaping in episodic reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reward Shaping in Episodic Reinforcement Learning Marek Grze s - - PowerPoint PPT Presentation

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017 S ao Paulo, May 812 Motivating Reward Shaping Reinforcement Learning Agent reward state action r t a t s t r t+ 1 Environment s t+ 1 [Sutt


slide-1
SLIDE 1

Reward Shaping in Episodic Reinforcement Learning

Marek Grze´ s

Canterbury, UK AAMAS 2017 S˜ ao Paulo, May 8–12

slide-2
SLIDE 2

Motivating Reward Shaping

slide-3
SLIDE 3

Reinforcement Learning

Agent Environment

action

at st

reward

rt rt+1 st+1

state

[Sutt 98] Temporal credit assignment problem

slide-4
SLIDE 4

Deep Reinforcement Learning

action at reward rt+1 state st+1

slide-5
SLIDE 5

Challenges

action at reward rt+1 state st+1

◮ Temporal credit assignment problem ◮ In games, we can just generate more data for reinforcement

learning

◮ However, ‘more learning’ in neural networks can be a

challenge ... (see next slide)

slide-6
SLIDE 6

Contradictory Objectives

http://www.deeplearningbook.org

◮ Easy to overfit ◮ Early stopping is a potential regulariser, but we need a lot of

training to address the temporal-credit assignment problem

◮ Conclusion: It can be useful to mitigate the temporal credit

assignment problem using reward shaping!

slide-7
SLIDE 7

Reward Shaping

◮ st, at, st+1, rt+1 ◮ rt+1 goes to Q-learning, SARSA, R-max etc. ◮ rt+1 + F(st, at, st+1) ◮ where F(st, at, st+1) = γΦ(st+1) − Φ(st)

slide-8
SLIDE 8

Policy Invariance under Reward Transformations

Potential-based reward shaping is necessary and sufficient to guarantee policy invariance [Ng 99] Straightforward to show in infinite-horizon MDPs [Asmu 08] Investigating episodic learning leads to new insights

slide-9
SLIDE 9

Problematic Example in Single-agent RL

g1 g2 si a1 a2 r=0 r=100 Φ=10 Φ=1000 Φ=0

[Grze 10]

◮ F(s, goal) = 0 in my PhD thesis ◮ [Ng 99] required F(goal, ·) = 0 ◮ Φ(goal) = 0 is what is necessary

slide-10
SLIDE 10

Multi-agent Learning and Nash Equilibria

x1 start x2 x3 x4 +10 x5 −10 x6 +9 a,* b,* a,a;b,b a,b;b,a *,* *,* *,* *,*

[Bout 99, Devl 11]

slide-11
SLIDE 11

Multi-agent Learning and Nash Equilibria

x1 start x2 x3 x4 +10 x5 −10 x6 +9 a,* b,* a,a;b,b a,b;b,a *,* *,* *,* *,*

Φ(x2) = 0 Φ(x1) = 0 Φ(x3) = 0 Φ(x4) = 0 Φ(x5) = M Φ(x6) = 0 When M is sufficiently large, we have a new Nash Equlibrium.

slide-12
SLIDE 12

PAC-MDP Reinforcement Learning and R-max

Optimism in AI and Optimisation

◮ A* ◮ Branch-and-Bound ◮ R-max and optimistic potential functions [Asmu 08]

slide-13
SLIDE 13

PAC-MDP Reinforcement Learning and R-max

Optimism in AI and Optimisation

◮ A* ◮ Branch-and-Bound ◮ R-max and optimistic potential functions [Asmu 08]

Sufficient conditions for R-max

◮ ∀s∈GoalsΦ(s) = 0 ◮ ∀s∈KnownΦ(s) = C where C is an arbitrary number ◮ ∀s∈UnknownΦ(s) ≥ 0 ◮ where Goals ∩ Known ∩ Unknown = ∅

slide-14
SLIDE 14

MDP Planning: Infinite-horizon

◮ MDP solutions methods: linear programming ◮ F(s, a, s′) = γΦ(s′) − Φ(s) ◮ The impact of reward shaping:

  • s,a,s′

λ(s, a)T(s, a, s′)F(s, a, s′) = −

  • s′

Φ(s′)µ(s′)

slide-15
SLIDE 15

MDP Planning: Finite-Horizon

  • s∈S\G
  • a∈A
  • s′∈S

λ(s, a)T(s, a, s′)F(s, a, s′) =

  • s′∈G

Φ(s′)

s∈S\G

  • a∈A

λ(s, a)T(s, a, s′)

slide-16
SLIDE 16

References I

[Asmu 08]

  • J. Asmuth, M. L. Littman, and R. Zinkov.

“Potential-based Shaping in Model-based Reinforcement Learning”. In: Proceedings of AAAI, 2008. [Bout 99]

  • C. Boutilier.

“Sequential Optimality and Coordination in Multiagent Systems”. In: Proceedings of the International Joint Conferrence on Artificial Intelligencekue, pp. 478–485, 1999. [Devl 11]

  • S. Devlin and D. Kudenko.

“Theoretical Considerations of Potential-Based Reward Shaping for Multi-Agent Systems”. In: Proceedings of AAMAS, 2011. [Grze 10]

  • M. Grzes.

Improving exploration in reinforcement learning through domain knowledge and parameter analysis. PhD thesis, University of York, 2010. [Ng 99]

  • A. Y. Ng, D. Harada, and S. J. Russell.

“Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proceedings of the 16th International Conference on Machine Learning, pp. 278–287, 1999. [Sutt 98]

  • R. S. Sutton and A. G. Barto.

Reinforcement Learning: An Introduction. MIT Press, 1998.