NeurIPS 2018, Montreal, December 5th
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
Ronan Fruit† Matteo Pirotta∗ Alessandro Lazaric∗
†SequeL – INRIA Lille ∗FAIR – Facebook Paris
Near Optimal Exploration-Exploitation in Non-Communicating Markov - - PowerPoint PPT Presentation
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit Matteo Pirotta Alessandro Lazaric SequeL INRIA Lille FAIR Facebook Paris NeurIPS 2018, Montreal, December 5th
NeurIPS 2018, Montreal, December 5th
Ronan Fruit† Matteo Pirotta∗ Alessandro Lazaric∗
†SequeL – INRIA Lille ∗FAIR – Facebook Paris
NeurIPS 2018, Montreal, December 5th
Ronan Fruit† Matteo Pirotta∗ Alessandro Lazaric∗
†SequeL – INRIA Lille ∗FAIR – Facebook Paris
TUCRL
1 Breakout [Mnih et al., 2015]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
1 Breakout [Mnih et al., 2015]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
1 Breakout [Mnih et al., 2015]
initial state s1
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
1 Breakout [Mnih et al., 2015]
initial state s1 Plausible state after some time...
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
1 Breakout [Mnih et al., 2015]
initial state s1 Plausible state after some time... Non reachable from s1
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
1 Breakout [Mnih et al., 2015]
initial state s1 Plausible state after some time... Non reachable from s1 Cannot be observed!
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
1 Breakout [Mnih et al., 2015]
initial state s1 Plausible state after some time... Non reachable from s1 Cannot be observed!
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 1/5
TUCRL
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s a0, a1, r0 = 0 r1 = 1 2
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s a0, a1, r0 = 0 r1 = 1 2 Optimism (UCB, etc.) = Optimal Strategy
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Not reachable from s
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+
0 = 1 = rmax
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Problem: The action played keeps changing: it is a0 half of the time and a1 the
⇒ linear regret!
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism Not reachable from s Why not ignore s′? a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Problem: The action played keeps changing: it is a0 half of the time and a1 the
⇒ linear regret!
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified s s′ a0, a1, r0 = 0 r1 = 1 2 Optimism ✟ ✟ ❍ ❍ Not reachable from s Why not ignore s′? linear regret if s′ is reachable a0, r+
0 = 1 = rmax
? ?
Example 1 of Ortner [2008]
Problem: The action played keeps changing: it is a0 half of the time and a1 the
⇒ linear regret!
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 2/5
TUCRL
Regret of existing methods:
Total number of states Misspecified state space ⇐ ⇒ D = +∞ (infinite diameter) TUCRL: first algorithm able to adapt to the reachable part of the MDP Regret:
Number of reachable states
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 3/5
TUCRL
Exploration–exploitation in RL with Misspecified State Space - R. Fruit
SequeL - 4/5