Near Optimal Exploration-Exploitation in Non-Communicating Markov - - PowerPoint PPT Presentation

near optimal exploration exploitation in non
SMART_READER_LITE
LIVE PREVIEW

Near Optimal Exploration-Exploitation in Non-Communicating Markov - - PowerPoint PPT Presentation

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit Matteo Pirotta Alessandro Lazaric SequeL INRIA Lille FAIR Facebook Paris NeurIPS 2018, Montreal, December 5th


slide-1
SLIDE 1

NeurIPS 2018, Montreal, December 5th

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Ronan Fruit† Matteo Pirotta∗ Alessandro Lazaric∗

†SequeL – INRIA Lille ∗FAIR – Facebook Paris

slide-2
SLIDE 2

NeurIPS 2018, Montreal, December 5th

Exploration–exploitation in RL with Misspecified State Space

Ronan Fruit† Matteo Pirotta∗ Alessandro Lazaric∗

†SequeL – INRIA Lille ∗FAIR – Facebook Paris

slide-3
SLIDE 3

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-4
SLIDE 4

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Intuitive state space: set of plausible configurations of wall, ball and paddle

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-5
SLIDE 5

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Intuitive state space: set of plausible configurations of wall, ball and paddle

initial state s1

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-6
SLIDE 6

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Intuitive state space: set of plausible configurations of wall, ball and paddle

initial state s1 Plausible state after some time...

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-7
SLIDE 7

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Intuitive state space: set of plausible configurations of wall, ball and paddle

initial state s1 Plausible state after some time... Non reachable from s1

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-8
SLIDE 8

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Intuitive state space: set of plausible configurations of wall, ball and paddle

initial state s1 Plausible state after some time... Non reachable from s1 Cannot be observed!

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-9
SLIDE 9

TUCRL

Misspecified states: Examples

1 Breakout [Mnih et al., 2015]

Intuitive state space: set of plausible configurations of wall, ball and paddle

initial state s1 Plausible state after some time... Non reachable from s1 Cannot be observed!

Misspecified state space = ∃ states non-observable from initial state + difficult to exclude explicitly from the state space

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 1/5

slide-10
SLIDE 10

TUCRL

Why is exploration more challenging with a misspecified state space?

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-11
SLIDE 11

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-12
SLIDE 12

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s a0, a1, r0 = 0 r1 = 1 2

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-13
SLIDE 13

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s a0, a1, r0 = 0 r1 = 1 2 Optimism (UCB, etc.) = Optimal Strategy

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-14
SLIDE 14

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Not reachable from s

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-15
SLIDE 15

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+

0 = 1 = rmax

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-16
SLIDE 16

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-17
SLIDE 17

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-18
SLIDE 18

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-19
SLIDE 19

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-20
SLIDE 20

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Problem: The action played keeps changing: it is a0 half of the time and a1 the

  • ther half =

⇒ linear regret!

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-21
SLIDE 21

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s Why not ignore s′? a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Problem: The action played keeps changing: it is a0 half of the time and a1 the

  • ther half =

⇒ linear regret!

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-22
SLIDE 22

TUCRL

Why is exploration more challenging with a misspecified state space?

All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism ✟ ✟ ❍ ❍ Not reachable from s Why not ignore s′? linear regret if s′ is reachable a0, r+

0 = 1 = rmax

? ?

Example 1 of Ortner [2008]

Problem: The action played keeps changing: it is a0 half of the time and a1 the

  • ther half =

⇒ linear regret!

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 2/5

slide-23
SLIDE 23

TUCRL

Our work

Regret of existing methods:

  • O
  • D S

√ AT

  • Diameter

Total number of states Misspecified state space ⇐ ⇒ D = +∞ (infinite diameter) TUCRL: first algorithm able to adapt to the reachable part of the MDP Regret:

  • O
  • D c S c√

AT

  • Reachable diameter

Number of reachable states

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 3/5

slide-24
SLIDE 24

TUCRL

Come to see our poster # 161 !

Exploration–exploitation in RL with Misspecified State Space - R. Fruit

SequeL - 4/5