Multiple-Step Greedy Policies in Online and Approximate - PowerPoint PPT Presentation

... Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning Neural Information Processing Systems, December ’18 Yonathan Efroni 1 Gal Dalal 1 Bruno Scherrer 2 Shie Mannor 1 1 Department of Electrical Engineering, Technion, Israel 2 INRIA, Villers les Nancy, France 1 / 11

... Motivation: Impressive Empirical Success Multiple-step lookahead policies in RL give state-of-the-art-performance. 2 / 11

... Motivation: Impressive Empirical Success Multiple-step lookahead policies in RL give state-of-the-art-performance. ◮ Model Predictive Control (MPC) in RL Negenborn et al. (2005); Ernst et al. (2009); Zhang et al. (2016); Tamar et al. (2017); Nagabandi et al. (2018), and many more... 2 / 11

... Motivation: Impressive Empirical Success Multiple-step lookahead policies in RL give state-of-the-art-performance. ◮ Model Predictive Control (MPC) in RL Negenborn et al. (2005); Ernst et al. (2009); Zhang et al. (2016); Tamar et al. (2017); Nagabandi et al. (2018), and many more... ◮ Monte Carlo Tree Search (MCTS) in RL Tesauro and Galperin (1997); Baxter et al. (1999); Sheppard (2002); Veness et al. (2009); Lai (2015); Silver et al. (2017); Amos et al. (2018), and many more... 2 / 11

... Motivation: Although the Impressive Empirical Success... 3 / 11

... Motivation: Although the Impressive Empirical Success... Theory on how to combine multiple-step lookahead policies in RL is scarce. 3 / 11

... Motivation: Although the Impressive Empirical Success... Theory on how to combine multiple-step lookahead policies in RL is scarce. Bertsekas and Tsitsiklis (1995); Efroni et al. (2018) : Multiple-step greedy policies at the improvement stage of Policy Iteration. 3 / 11

... Motivation: Although the Impressive Empirical Success... Theory on how to combine multiple-step lookahead policies in RL is scarce. Bertsekas and Tsitsiklis (1995); Efroni et al. (2018) : Multiple-step greedy policies at the improvement stage of Policy Iteration. Here : Extend to online and approximate RL. 3 / 11

... Multiple-Step Greedy Policies: h - Greedy Policy h -Greedy Policy w.r.t. v π : 4 / 11

... Multiple-Step Greedy Policies: h - Greedy Policy h -Greedy Policy w.r.t. v π : Optimal first action in h -horizon γ -discounted Markov Decision Process, total reward � h − 1 t =0 γ t r ( s t , π t ( s t )) + γ h v π ( s h ) . 4 / 11

... Multiple-Step Greedy Policies: h - Greedy Policy h -Greedy Policy w.r.t. v π : Optimal first action in h -horizon γ -discounted Markov Decision Process, total reward � h − 1 t =0 γ t r ( s t , π t ( s t )) + γ h v π ( s h ) . s 0 r ( s 0 , π 0 ( s 0 )) γr ( s 1 , π 1 ( s 1 )) γ 2 v π ( s 2 ) h = 2 -Greedy Policy as a Tree Search 4 / 11

... Multiple-Step Greedy Policies: h - Greedy Policy h -Greedy Policy w.r.t. v π : Optimal first action in h -horizon γ -discounted Markov Decision Process, total reward � h − 1 t =0 γ t r ( s t , π t ( s t )) + γ h v π ( s h ) . s 0 r ( s 0 , π 0 ( s 0 )) Path with max. total reward γr ( s 1 , π 1 ( s 1 )) γ 2 v π ( s 2 ) h = 2 -Greedy Policy as a Tree Search 4 / 11

... Multiple-Step Greedy Policies: h - Greedy Policy h -Greedy Policy w.r.t. v π : Optimal first action in h -horizon γ -discounted Markov Decision Process, total reward � h − 1 t =0 γ t r ( s t , π t ( s t )) + γ h v π ( s h ) . h -greedy policy: Left s 0 r ( s 0 , π 0 ( s 0 )) Path with max. total reward γr ( s 1 , π 1 ( s 1 )) γ 2 v π ( s 2 ) h = 2 -Greedy Policy as a Tree Search 4 / 11

... Multiple-Step Greedy Policies: κ - Greedy Policy κ -Greedy Policy w.r.t v π : Optimal action when P r ( Solve the h -horizon MDP ) = (1 − κ ) κ h − 1 . 5 / 11

... Multiple-Step Greedy Policies: κ - Greedy Policy κ -Greedy Policy w.r.t v π : Optimal action when P r ( Solve the h -horizon MDP ) = (1 − κ ) κ h − 1 . P r ( h =2)= P r ( h =1)= P r ( h =3)= (1 − κ ) κ 2 (1 − κ ) (1 − κ ) κ + + 5 / 11

... 1-Step Greedy Policies and Soft Updates Soft update using a 1-step greedy policy improves policy. 6 / 11

... 1-Step Greedy Policies and Soft Updates Soft update using a 1-step greedy policy improves policy. A bit formally, ◮ Let π be a policy, 7 / 11

... 1-Step Greedy Policies and Soft Updates Soft update using a 1-step greedy policy improves policy. A bit formally, ◮ Let π be a policy, ◮ π G 1 1-step greedy policy w.r.t. v π . 7 / 11

... 1-Step Greedy Policies and Soft Updates Soft update using a 1-step greedy policy improves policy. A bit formally, ◮ Let π be a policy, ◮ π G 1 1-step greedy policy w.r.t. v π . Then, ∀ α ∈ [0 , 1] , (1 − α ) π + απ G 1 , is always better than π . 7 / 11

... 1-Step Greedy Policies and Soft Updates Soft update using a 1-step greedy policy improves policy. A bit formally, ◮ Let π be a policy, ◮ π G 1 1-step greedy policy w.r.t. v π . Then, ∀ α ∈ [0 , 1] , (1 − α ) π + απ G 1 , is always better than π . Important fact in: Two-timescale online PI (Konda and Borkar (1999)), Conservative PI (Kakade and Langford (2002)), TRPO (Schulman et al. (2015)), and many more... 7 / 11

... Negative Result on Multiple-Step Greedy Policies Soft update using a multiple-step greedy policy does not necessarily improves policy. 8 / 11

... Negative Result on Multiple-Step Greedy Policies Soft update using a multiple-step-greedy-policy does not necessarily improves policy. Necessary and sufficient condition: α is large enough. 9 / 11

... Negative Result on Multiple-Step Greedy Policies Soft update using a multiple-step-greedy-policy does not necessarily improves policy. Necessary and sufficient condition: α is large enough. Theorem 1 Let π G h and π G κ be the h -greedy and κ -greedy policies w.r.t. v π . Then. 9 / 11

... Negative Result on Multiple-Step Greedy Policies Soft update using a multiple-step-greedy-policy does not necessarily improves policy. Necessary and sufficient condition: α is large enough. Theorem 1 Let π G h and π G κ be the h -greedy and κ -greedy policies w.r.t. v π . Then. ◮ (1 − α ) π + απ G h is always better than π for h > 1 iff α = 1 . 9 / 11

... Negative Result on Multiple-Step Greedy Policies Soft update using a multiple-step-greedy-policy does not necessarily improves policy. Necessary and sufficient condition: α is large enough. Theorem 1 Let π G h and π G κ be the h -greedy and κ -greedy policies w.r.t. v π . Then. ◮ (1 − α ) π + απ G h is always better than π for h > 1 iff α = 1 . ◮ (1 − α ) π + απ G κ is always better than π iff α ≥ κ . 9 / 11

... How to Circumvent the Problem? (and have Theoretical Guarantees) 10 / 11

... How to Circumvent the Problem? (and have Theoretical Guarantees) Give ‘natural’ solutions to the problem with theoretical guarantees: 10 / 11

... How to Circumvent the Problem? (and have Theoretical Guarantees) Give ‘natural’ solutions to the problem with theoretical guarantees: ◮ Two-timescale, online, multiple-step PI. 10 / 11

... How to Circumvent the Problem? (and have Theoretical Guarantees) Give ‘natural’ solutions to the problem with theoretical guarantees: ◮ Two-timescale, online, multiple-step PI. ◮ Approximate multiple-step PI methods. 10 / 11

... How to Circumvent the Problem? (and have Theoretical Guarantees) Give ‘natural’ solutions to the problem with theoretical guarantees: ◮ Two-timescale, online, multiple-step PI. ◮ Approximate multiple-step PI methods. Open Problem: More techniques to circumvent the problem. 10 / 11

... Take Home Messages ◮ Important difference between multiple- and 1-step greedy methods. 11 / 11

... Take Home Messages ◮ Important difference between multiple- and 1-step greedy methods. ◮ Multiple-step PI has theoretical benefits (more discussion at the poster session) . 11 / 11

... Take Home Messages ◮ Important difference between multiple- and 1-step greedy methods. ◮ Multiple-step PI has theoretical benefits (more discussion at the poster session) . ◮ Further study should be devoted. 11 / 11

... Amos, B., Dario Jimenez Rodriguez, I., Sacks, J., Boots J., B., and Kolter, Z. (2018). Differentiable mpc for end-to-end planning and control. Advances in Neural Information Processing Systems . Baxter, J., Tridgell, A., and Weaver, L. (1999). Tdleaf (lambda): Combining temporal difference learning with game-tree search. arXiv preprint cs/9901001 . Bertsekas, D. P. and Tsitsiklis, J. N. (1995). Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on , volume 1. IEEE. Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S. (2018). Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning , pages 1386–1395. Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L. (2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , 39(2):517–529. Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning , pages 267–274. 11 / 11

Multiple-Step Greedy Policies in Online and Approximate - PowerPoint PPT Presentation

... Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning Neural Information Processing Systems, December 18 Yonathan Efroni 1 Gal Dalal 1 Bruno Scherrer 2 Shie Mannor 1 1 Department of Electrical Engineering,

Greedy On-Line Planning - abstract overview: what is greedy on-line planning? Part 1: - greedy

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

CS 170 Section 4 Greedy Algorithms I Owen Jow | owenjow@berkeley.edu Agenda Greedy

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Greedy algorithms Greedy algorithms Find the best solution to a local problem and (hope) it

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

Greedy Algorithms Pedro Ribeiro DCC/FCUP 2018/2019 Pedro Ribeiro (DCC/FCUP) Greedy Algorithms

Greedy routing Greedy routing Other variations on greedy criterion Introduce

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Greedy Algorithms Algorithm Theory WS 2013/14 Fabian Kuhn Greedy Algorithms No clear

Transformational Priors Over Grammars Jason Eisner Jason Eisner Johns Hopkins University July

Friable values of polynomials Greg Martin University of British Columbia Second Canada-France

Friable values of polynomials Greg Martin University of British Columbia PIMS/SFU/UBC Number

Complexity and the Earth System John Shepherd With Tom Anderson, Bob Marsh, Andrew Yool &

Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi

Welcome Remarks: Latha Sukumar and Adam Shamoon KEYNOTE: Min Sook Lee Art and Social Change

Leveled Reader Collection at the Chelsea District Library With a generous grant from MIS Cares,

From simple word counts to collocates and keywords Jonathan Culpeper, Lancaster University, UK

Multiple-Step Greedy Policies in Online and Approximate - PowerPoint PPT Presentation

... Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning Neural Information Processing Systems, December 18 Yonathan Efroni 1 Gal Dalal 1 Bruno Scherrer 2 Shie Mannor 1 1 Department of Electrical Engineering,

Greedy On-Line Planning - abstract overview: what is greedy on-line planning? Part 1: - greedy

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

CS 170 Section 4 Greedy Algorithms I Owen Jow | owenjow@berkeley.edu Agenda Greedy

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Greedy algorithms Greedy algorithms Find the best solution to a local problem and (hope) it

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

Greedy Algorithms Pedro Ribeiro DCC/FCUP 2018/2019 Pedro Ribeiro (DCC/FCUP) Greedy Algorithms

Greedy routing Greedy routing Other variations on greedy criterion Introduce

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Greedy Algorithms Algorithm Theory WS 2013/14 Fabian Kuhn Greedy Algorithms No clear

Transformational Priors Over Grammars Jason Eisner Jason Eisner Johns Hopkins University July

Friable values of polynomials Greg Martin University of British Columbia Second Canada-France

Friable values of polynomials Greg Martin University of British Columbia PIMS/SFU/UBC Number

Complexity and the Earth System John Shepherd With Tom Anderson, Bob Marsh, Andrew Yool &amp;

Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi

Welcome Remarks: Latha Sukumar and Adam Shamoon KEYNOTE: Min Sook Lee Art and Social Change

Leveled Reader Collection at the Chelsea District Library With a generous grant from MIS Cares,

From simple word counts to collocates and keywords Jonathan Culpeper, Lancaster University, UK

Complexity and the Earth System John Shepherd With Tom Anderson, Bob Marsh, Andrew Yool &