Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic - PowerPoint PPT Presentation

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University

Gen Li Yuting Wei Yuejie Chi Yuantao Gu Tsinghua EE CMU Statistics CMU ECE Tsinghua EE “Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020

Reinforcement learning (RL) 3/ 33

RL challenges • Unknown or changing environments • Delayed rewards • Enormous state and action space 4/ 33

Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads 5/ 33

Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads Calls for in-depth understanding about sample efficiency of RL algorithms 5/ 33

This talk: a classical example — Q-learning

Background: Markov decision processes

Markov decision process (MDP) • S : state space • A : action space 8/ 33

Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward 8/ 33

Markov decision process (MDP) • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r ( s, a ) : cheese, electricity shocks, cats 9/ 33

Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) 10/ 33

Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) • P ( ·| s, a ) : unknown transition probabilities 10/ 33

Value function Value of policy π : long-term discounted reward � ∞ � � � V π ( s ) := E γ t r t � s 0 = s ∀ s ∈ S : t =0 11/ 33

Value function Value of policy π : long-term discounted reward � ∞ � � � V π ( s ) := E γ t r t � s 0 = s ∀ s ∈ S : t =0 • γ ∈ [0 , 1) : discount factor • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π 11/ 33

Action-value function (a.k.a. Q-function) Q-function of policy π � ∞ � � � Q π ( s, a ) := E γ t r t � s 0 = s, a 0 = a ∀ ( s, a ) ∈ S × A : t =0 • ( ✟ a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π ✟ 12/ 33

Optimal policy and optimal value 13/ 33

Optimal policy and optimal value • optimal policy π ⋆ : maximizing value 13/ 33

Optimal policy and optimal value • optimal policy π ⋆ : maximizing value • optimal value / Q function: V ⋆ := V π ⋆ , Q ⋆ := Q π ⋆ 13/ 33

Need to learn optimal value / policy from data samples

Markovian samples and behavior policy Observed : { s t , a t , r t } t ≥ 0 generated by behavior policy π b � �� Markovian trajectory Goal : learn optimal value V ⋆ and Q ⋆ based on sample trajectory 15/ 33

Markovian samples and behavior policy Key quantities of sample trajectory • minimum state-action occupancy probability µ min := min µ π b ( s, a ) � �� stationary distribution • mixing time: t mix 15/ 33

Asynchronous Q-learning (on Markovian samples)

Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build empirical estimate � P for P 2. planning based on empirical � P 17/ 33

Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build empirical estimate � P for P 2. planning based on empirical � P Model-free approach — learning w/o modeling & estimating environment explicitly 17/ 33

Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) � �� Robbins & Monro ’51 18/ 33

Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� s ′ ∼ P ( ·| s,a ) � �� immediate reward next state’s value • one-step look-ahead 19/ 33

Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� s ′ ∼ P ( ·| s,a ) � �� immediate reward next state’s value • one-step look-ahead Bellman equation: Q ⋆ is unique solution to T ( Q ⋆ ) = Q ⋆ Richard Bellman 19/ 33

Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) t ≥ 0 , � �� only update ( s t ,a t ) -th entry 20/ 33

Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) t ≥ 0 , � �� only update ( s t ,a t ) -th entry Q ( s t +1 , a ′ ) T t ( Q )( s t , a t ) = r ( s t , a t ) + γ max a ′ � Q ( s ′ , a ′ ) � T ( Q )( s, a ) = r ( s, a ) + γ max E a ′ s ′ ∼ P ( ·| s,a ) 20/ 33

Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration 21/ 33

Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent 21/ 33

Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent • off-policy: target policy π ⋆ � = behavior policy π b 21/ 33

A highly incomplete list of prior work • Watkins, Dayan ’92 • Tsitsiklis ’94 • Jaakkola, Jordan, Singh ’94 • Szepesv´ ari ’98 • Kearns, Singh ’99 • Borkar, Meyn ’00 • Even-Dar, Mansour ’03 • Beck, Srikant ’12 • Chi, Zhu, Bubeck, Jordan ’18 • Shah, Xie ’18 • Lee, He ’18 • Wainwright ’19 • Chen, Zhang, Doan, Maguluri, Clarke ’19 • Yang, Wang ’19 • Du, Lee, Mahajan, Wang ’20 • Chen, Maguluri, Shakkottai, Shanmugam ’20 • Qu, Wierman ’20 • Devraj, Meyn ’20 • Weng, Gupta, He, Ying, Srikant ’20 • ... 22/ 33

What is sample complexity of (async) Q-learning?

Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? paper sample complexity learning rate 1 1 − γ ( t cover ) 1 Even-Dar & Mansour ’03 linear: (1 − γ ) 4 ε 2 t � � 1 ω + � t cover � 1 t 1+3 ω t ω , ω ∈ ( 1 1 1 − ω Even-Dar & Mansour ’03 cover poly: 2 , 1) (1 − γ ) 4 ε 2 1 − γ t 3 cover |S||A| Beck & Srikant ’12 constant (1 − γ ) 5 ε 2 t mix Qu & Wierman ’20 rescaled linear µ 2 min (1 − γ ) 5 ε 2 24/ 33

Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min 24/ 33

Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 24/ 33

Main result: ℓ ∞ -based sample complexity Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) 25/ 33

Main result: ℓ ∞ -based sample complexity Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • Improves upon prior art by at least |S||A| ! t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 25/ 33

Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs 26/ 33

Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 26/ 33

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic - PowerPoint PPT Presentation

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University Gen Li Yuting Wei Yuejie Chi Yuantao Gu Tsinghua EE CMU Statistics CMU ECE Tsinghua EE Sample

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

Background Background Text Complexity Text Complexity Text Complexity Sowmya V.B., Sowmya

Kolmogorov Complexity of Categories Complexity Programing Language Kolmogorov Noson S.

IN 5210 Complexity Theory Complexity Complexity: Socio-technical (Internet, globalization)

Communication Complexity Lecture 23 Computing with remote inputs 1 Communication Complexity

Framingham Public Schools Back to School Updates September 2, 2020 REMOTE LEARNING In our goal

Session objectives At the end of the session, you should be able to: identify the key issues

Technology use in higher education Technology is ubiquitous. Top-down push to increase

VR-Based Toolsets for Instructors H.K. Yaqub 1 1 Unity3D Developer, BMT, Bath, UK Abstract This

Tips and tools to teach online by the E-learning Community of Practice Catherine Braithwaite,

FALL PLANNING BELLEVUE SCHOOL DISTRICT Fall Planning 2020 FAMIL ILIES FORUM August 6, 2020

Cathy VanHeirseele Instructor, Department of English Kennesaw State University Education Model

Welcome to the 2020-2021 School Year Agenda - CDL and Edmentum - CDL Schedules and home learning