demystifying the efficiency of reinforcement learning a
play

Demystifying the efficiency of reinforcement learning: A few recent - PowerPoint PPT Presentation

Demystifying the efficiency of reinforcement learning: A few recent stories Yuxin Chen EE, Princeton University Acknowledgement 2/ 74 Reinforcement learning (RL) 3/ 74 RL challenges In RL, an agent learns by interacting with an environment


  1. Proof ideas Elementary decomposition: � V ⋆ − � V π ⋆ � + � � π ⋆ � + � � π ⋆ � π ⋆ = V π ⋆ − � π ⋆ − V � V ⋆ − V � V � V � � V ⋆ − � V π ⋆ � + 0 + � � π ⋆ � π ⋆ − V � V � ≤ • Step 1: control V π − � V π for a fixed π ( Bernstein inequality + high-order decomposition ) V � π ⋆ − V � • Step 2: extend it to control � π ⋆ ( decouple statistical dependence ) 28/ 74

  2. Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 29/ 74

  3. Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � 29/ 74

  4. Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19) 29/ 74

  5. Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19) |S| • break sample size barrier (1 − γ ) 2 in prior work (Agarwal et al. ’19, Pananjady & Wainwright ’19, Khamaru et al. ’20) 29/ 74

  6. π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 2: a leave-one-out argument to decouple stat. dependency — inspired by Agarwal et al. ’19 but different . . . 30/ 74

  7. π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 2: a leave-one-out argument to decouple stat. dependency — inspired by Agarwal et al. ’19 but different . . . Caveat: requires the optimal policy to stand out from other policies 30/ 74

  8. π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 3: tie-breaking via perturbation π ⋆ • perturb rewards r by a tiny bit = ⇒ � p 31/ 74

  9. Summary Model-based RL is minimax optimal and does not suffer from a sample size barrier! 32/ 74

  10. Summary Model-based RL is minimax optimal and does not suffer from a sample size barrier! future directions • finite-horizon episodic MDPs • Markov games 32/ 74

  11. Story 2: sample complexity of (asynchronous) Q-learning on Markovian samples Gen Li Yuantao Gu Yuting Wei Yuejie Chi Tsinghua EE Tsinghua EE CMU Stats CMU ECE

  12. Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on empirical � P Model-free approach — learning w/o modeling & estimating environment explicitly 34/ 74

  13. A classical example: Q-learning on Markovian samples

  14. Markovian samples and behavior policy Observed : { s t , a t , r t } t ≥ 0 generated by behavior policy π b � �� � Markovian trajectory Goal : learn optimal value V ⋆ and Q ⋆ based on sample trajectory 36/ 74

  15. Markovian samples and behavior policy Key quantities of sample trajectory • minimum state-action occupancy probability µ min := min µ π b ( s, a ) � �� � stationary distribution • mixing time: t mix 36/ 74

  16. Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) � �� � Robbins & Monro ’51 37/ 74

  17. Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead 38/ 74

  18. Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead Bellman equation: Q ⋆ is unique solution to T ( Q ⋆ ) = Q ⋆ Richard Bellman 38/ 74

  19. Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) , t ≥ 0 � �� � only update ( s t ,a t ) -th entry 39/ 74

  20. Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) , t ≥ 0 � �� � only update ( s t ,a t ) -th entry Q ( s t +1 , a ′ ) T t ( Q )( s t , a t ) = r ( s t , a t ) + γ max a ′ � Q ( s ′ , a ′ ) � T ( Q )( s, a ) = r ( s, a ) + γ max E a ′ s ′ ∼ P ( ·| s,a ) 39/ 74

  21. Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration 40/ 74

  22. Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent 40/ 74

  23. Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent • off-policy: target policy π ⋆ � = behavior policy π b 40/ 74

  24. A highly incomplete list of prior work • Watkins, Dayan ’92 • Tsitsiklis ’94 • Jaakkola, Jordan, Singh ’94 • Szepesv´ ari ’98 • Kearns, Singh ’99 • Borkar, Meyn ’00 • Even-Dar, Mansour ’03 • Beck, Srikant ’12 • Chi, Zhu, Bubeck, Jordan ’18 • Shah, Xie ’18 • Lee, He ’18 • Wainwright ’19 • Chen, Zhang, Doan, Maguluri, Clarke ’19 • Yang, Wang ’19 • Du, Lee, Mahajan, Wang ’20 • Chen, Maguluri, Shakkottai, Shanmugam ’20 • Qu, Wierman ’20 • Devraj, Meyn ’20 • Weng, Gupta, He, Ying, Srikant ’20 • ... 41/ 74

  25. What is sample complexity of (async) Q-learning?

  26. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? paper sample complexity learning rate 1 1 − γ ( t cover ) 1 Even-Dar & Mansour ’03 linear: (1 − γ ) 4 ε 2 t � � 1 ω + � t cover � 1 t 1+3 ω t ω , ω ∈ ( 1 1 1 − ω Even-Dar & Mansour ’03 cover poly: 2 , 1) (1 − γ ) 4 ε 2 1 − γ t 3 cover |S||A| Beck & Srikant ’12 constant (1 − γ ) 5 ε 2 t mix Qu & Wierman ’20 rescaled linear µ 2 min (1 − γ ) 5 ε 2 43/ 74

  27. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min 43/ 74

  28. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 43/ 74

  29. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 43/ 74

  30. Main result: ℓ ∞ -based sample complexity Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) 44/ 74

  31. Main result: ℓ ∞ -based sample complexity Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • Improves upon prior art by at least |S||A| ! t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 44/ 74

  32. Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs 45/ 74

  33. Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 45/ 74

  34. Learning rates � (1 − γ ) 4 ε 2 � 1 Our choice: constant stepsize η t ≡ min , γ 2 t mix 1 µ min (1 − γ ) • Qu & Wierman ’20: rescaled linear η t = 1 t +max { µ min (1 − γ ) ,t mix } • Beck & Srikant ’12: constant η t ≡ (1 − γ ) 4 ε 2 |S||A| t 2 cover � �� � too conservative • Even-Dar & Mansour ’03: polynomial η t = t − ω ( ω ∈ ( 1 2 , 1] ) 46/ 74

  35. Minimax lower bound minimax lower bound asyn Q-learning (Azar et al. ’13) (ignoring dependency on t mix ) 1 1 µ min (1 − γ ) 3 ε 2 µ min (1 − γ ) 5 ε 2 47/ 74

  36. Minimax lower bound minimax lower bound asyn Q-learning (Azar et al. ’13) (ignoring dependency on t mix ) 1 1 µ min (1 − γ ) 3 ε 2 µ min (1 − γ ) 5 ε 2 1 Can we improve dependency on discount complexity 1 − γ ? 47/ 74

  37. One strategy: variance reduction — inspired by Johnson & Zhang ’13, Wainwright ’19 Variance-reduced Q-learning updates � � T t ( Q t − 1 ) −T t ( Q ) + � Q t ( s t , a t ) = (1 − η ) Q t − 1 ( s t , a t ) + η T ( Q ) ( s t , a t ) � �� � use Q to help reduce variability • Q : some reference Q-estimate • � T : empirical Bellman operator (using a batch of samples) 48/ 74

  38. Variance-reduced Q-learning — inspired by Johnson & Zhang ’13, Sidford et al. ’18, Wainwright ’19 for each epoch 1. update Q and � T ( Q ) 2. run variance-reduced Q-learning updates 49/ 74

  39. Main result: ℓ ∞ -based sample complexity Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1 , sample complexity for (async) variance-reduced Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most on the order of 1 t mix µ min (1 − γ ) 3 ε 2 + µ min (1 − γ ) � ✘✘ � ✘ (1 − γ ) 4 (1 − γ ) 2 1 • more aggressive learning rates: η t ≡ min , γ 2 t mix 50/ 74

  40. Main result: ℓ ∞ -based sample complexity Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1 , sample complexity for (async) variance-reduced Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most on the order of 1 t mix µ min (1 − γ ) 3 ε 2 + µ min (1 − γ ) � ✘✘ � ✘ (1 − γ ) 4 (1 − γ ) 2 1 • more aggressive learning rates: η t ≡ min , γ 2 t mix • minimax-optimal for 0 < ε ≤ 1 50/ 74

  41. Summary Sharpens finite-sample understanding of Q-learning on Markovian data 51/ 74

  42. Summary Sharpens finite-sample understanding of Q-learning on Markovian data future directions • function approximation • on-policy algorithms like SARSA • general Markov-chain-based optimization algorithms 51/ 74

  43. Story 3: fast global convergence of entropy-regularized natural policy gradient (NPG) methods Shicong Cen Chen Cheng Yuejie Chi Yuting Wei CMU ECE Stanford Stats CMU Stats CMU ECE

  44. Policy optimization: a major contributor to these successes 53/ 74

  45. Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π 54/ 74

  46. Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) 54/ 74

  47. Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) V π θ ( ρ ) := E s ∼ ρ [ V π θ ( s )] maximize θ 54/ 74

  48. Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) V π θ ( ρ ) := E s ∼ ρ [ V π θ ( s )] maximize θ PG method (Sutton et al. ’00) θ ( t +1) = θ ( t ) + η ∇ θ V π ( t ) θ ( ρ ) , t = 0 , 1 , · · · • η : learning rate 54/ 74

  49. Booster 1: natural policy gradient (NPG) precondition gradients to improve search directions ... Natural Gradient = ⇒ NPG method (Kakade ’02) ρ ) † ∇ θ V π ( t ) θ ( t +1) = θ ( t ) + η ( F θ θ ( ρ ) , t = 0 , 1 , · · · �� � ⊤ � �� • F θ : Fisher info matrix ρ := E ∇ θ log π θ ( a | s ) ∇ θ log π θ ( a | s ) 55/ 74

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend