Near Optimal Exploration-Exploitation in Non-Communicating Markov - PowerPoint PPT Presentation

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit † Matteo Pirotta ∗ Alessandro Lazaric ∗ † SequeL – INRIA Lille ∗ FAIR – Facebook Paris NeurIPS 2018, Montreal, December 5th

Exploration–exploitation in RL with Misspecified State Space Ronan Fruit † Matteo Pirotta ∗ Alessandro Lazaric ∗ † SequeL – INRIA Lille ∗ FAIR – Facebook Paris NeurIPS 2018, Montreal, December 5th

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Non reachable from s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Cannot be observed! initial state s 1 Non reachable from s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Cannot be observed! initial state s 1 Non reachable from s 1 Plausible state after some time... Misspecified state space = ∃ states non-observable from initial state + difficult to exclude explicitly from the state space Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

TUCRL Why is exploration more challenging with a misspecified state space? Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 s r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Optimism (UCB, etc.) = Optimal Strategy s r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s s s ′ r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s a 0 , r + s s ′ 0 = 1 = r max r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + Why not ignore s ′ ? s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 ❍ Not reachable from s ✟ ✟ ❍ ? a 0 , r + Why not ignore s ′ ? s s ′ 0 = 1 = r max linear regret if s ′ is reachable ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

Near Optimal Exploration-Exploitation in Non-Communicating Markov - PowerPoint PPT Presentation

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit Matteo Pirotta Alessandro Lazaric SequeL INRIA Lille FAIR Facebook Paris NeurIPS 2018, Montreal, December 5th

Java Card Applet Firewall Java Card Applet Firewall Exploration and Exploitation Exploration and

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

Organizational Structure, Exploration, and Exploitation on the ELICIT Experimental Platform Allan

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

Near Mine Exploration Maintaining the Base Load Justin Osborne General Manager Near Mine

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Inverse problems and control optimal in non-linear mechanics C. Stolz 1 2 Introduction

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Exploration and Function Approximation CMU 10703 Katerina Fragkiadaki This lecture Exploration

Tactical Exploitation Tactical Exploitation the other way to pen-test the other way to

exploitation initiatives Monday, 10 December 2018 Biowaste treatment and exploitation SYMBIOSIS

An Introduction to Elder Abuse for Professionals: Financial Exploitation NCEA Financial

WiFi Exploitation: How passive interception leads to active exploitation SecTor Canada Solomon

Exploitation techniques for NT kernel Introduction General concepts Internals Adrien

Resilience & Optimism During a Crisis HELLO! I am Karen Maher I am an experienced HR

Timothy Cohen with Daniel Phalen and Aaron Pierce arXiv:1001.3408 Michigan Center for

SpeakUp Newpor t F isc al Year 2020-21 Adopted Budget July 8, 2020 1 Over view The FY

Entrepreneurial Mindset Entrepreneurs cause entrepreneurship. Market opportunities,

Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali The University

Technology and Inequality: reasons for concern, reasons for optimism Mark Stabile Stone Chaired

The Push/Pull Model of Transactions Matthew Parkinson Eric Koskinen Yale University, New Haven

Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems Andrey Brito 1 ,

Sambuz

Useful Links

Newsletter

Mail Us

Near Optimal Exploration-Exploitation in Non-Communicating Markov - PowerPoint PPT Presentation

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit Matteo Pirotta Alessandro Lazaric SequeL INRIA Lille FAIR Facebook Paris NeurIPS 2018, Montreal, December 5th

Java Card Applet Firewall Java Card Applet Firewall Exploration and Exploitation Exploration and

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

Organizational Structure, Exploration, and Exploitation on the ELICIT Experimental Platform Allan

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

Near Mine Exploration Maintaining the Base Load Justin Osborne General Manager Near Mine

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Inverse problems and control optimal in non-linear mechanics C. Stolz 1 2 Introduction

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Exploration and Function Approximation CMU 10703 Katerina Fragkiadaki This lecture Exploration

Tactical Exploitation Tactical Exploitation the other way to pen-test the other way to

exploitation initiatives Monday, 10 December 2018 Biowaste treatment and exploitation SYMBIOSIS

An Introduction to Elder Abuse for Professionals: Financial Exploitation NCEA Financial

WiFi Exploitation: How passive interception leads to active exploitation SecTor Canada Solomon

Exploitation techniques for NT kernel Introduction General concepts Internals Adrien

Resilience &amp; Optimism During a Crisis HELLO! I am Karen Maher I am an experienced HR

Timothy Cohen with Daniel Phalen and Aaron Pierce arXiv:1001.3408 Michigan Center for

SpeakUp Newpor t F isc al Year 2020-21 Adopted Budget July 8, 2020 1 Over view The FY

Entrepreneurial Mindset Entrepreneurs cause entrepreneurship. Market opportunities,

Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali The University

Technology and Inequality: reasons for concern, reasons for optimism Mark Stabile Stone Chaired

The Push/Pull Model of Transactions Matthew Parkinson Eric Koskinen Yale University, New Haven

Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems Andrey Brito 1 ,

Sambuz

Useful Links

Newsletter

Mail Us

Resilience & Optimism During a Crisis HELLO! I am Karen Maher I am an experienced HR