 
              Breaking the sample size barrier in reinforcement learning via model-based methods � �� � “plug-in” Yuxin Chen EE, Princeton University
Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE CMU Statistics “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020
Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE Berkeley Stat Ph.D. “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020
Reinforcement learning (RL) 4/ 38
RL challenges In RL, an agent learns by interacting with an environment • unknown or changing environments • delayed rewards or feedback • enormous state and action space • nonconvexity 5/ 38
Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads 6/ 38
Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads Calls for design of sample-efficient RL algorithms! 6/ 38
Background: Markov decision processes
Markov decision process (MDP) • S : state space • A : action space 8/ 38
Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward 8/ 38
Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) 9/ 38
Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) • P ( ·| s, a ) : unknown transition probabilities 9/ 38
Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 10/ 38
Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π 10/ 38
Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π • γ ∈ [0 , 1) : discount factor ◦ take γ → 1 to approximate long-horizon MDPs 10/ 38
Optimal policy and optimal values • Optimal policy π ⋆ : maximizing the value function 11/ 38
Optimal policy and optimal values • Optimal policy π ⋆ : maximizing the value function • Optimal values: V ⋆ := V π ⋆ 11/ 38
When the model is known . . . MDP specification b b b π ? planning b b planning oracle e . g . policy iteration truth: P P r r Planning: computing the optimal policy π ⋆ given MDP specification 12/ 38
When the model is unknown . . . Need to learn optimal policy from samples w/o model specification 13/ 38
This talk: RL with a generative model / simulator — Kearns, Singh ’99 For each state-action pair ( s, a ) , collect N samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N 14/ 38
Question: how many samples are sufficient to learn an ε -optimal policy � ? � ��
Question: how many samples are sufficient to learn an ε -optimal policy ? � �� � ∀ s : V � π ( s ) ≥ V ⋆ ( s ) − ε
An incomplete list of prior art • Kearns & Singh ’99 • Kakade ’03 • Kearns, Mansour & Ng ’02 • Azar, Munos & Kappen ’12 • Azar, Munos, Ghavamzadeh & Kappen ’13 • Sidford, Wang, Wu, Yang & Ye ’18 • Sidford, Wang, Wu & Ye ’18 • Wang ’17 • Agarwal, Kakade & Yang ’19 • Wainwright ’19a • Wainwright ’19b • Pananjady & Wainwright ’20 • Yang & Wang ’19 • Khamaru, Pananjady, Ruan, Wainwright & Jordan ’20 • Mou, Li, Wainwright, Bartlett & Jordan ’20 • . . . 16/ 38
An even shorter list of prior art algorithm sample size range sample complexity ε -range � |S| 2 |A| empirical QVI 1 |S||A| √ (0 , (1 − γ ) |S| ] (1 − γ ) 2 , ∞ ) (1 − γ ) 3 ε 2 Azar et al. ’13 � |S||A| � � sublinear randomized VI |S||A| 1 (1 − γ ) 2 , ∞ ) 0 , (1 − γ ) 4 ε 2 Sidford et al. ’18a 1 − γ � |S||A| variance-reduced QVI |S||A| (1 − γ ) 3 , ∞ ) (0 , 1] (1 − γ ) 3 ε 2 Sidford et al. ’18b � |S||A| empirical MDP + planning |S||A| 1 (1 − γ ) 2 , ∞ ) (0 , √ 1 − γ ] (1 − γ ) 3 ε 2 Agarwal et al. ’19 — see also Wainwright ’19 (for estimating optimal values) 17/ 38
18/ 38
18/ 38
|S||A| All prior theory requires sample size > (1 − γ ) 2 � �� � sample size barrier 18/ 38
Is it possible to close the gap?
Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P 20/ 38
Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly 20/ 38
Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly 20/ 38
Model estimation Sampling: for each ( s, a ) , collect N ind. samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N 21/ 38
Model estimation Sampling: for each ( s, a ) , collect N ind. samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N N � P ( s ′ | s, a ) by 1 Empirical estimates: estimate � 1 { s ′ ( i ) = s ′ } N i =1 � �� � empirical frequency 21/ 38
Model-based (plug-in) estimator — Azar et al. ’13, Agarwal et al. ’19, Pananjady et al. ’20 P empirical MDP b b b π ? b planning b b planning oracle e . g . policy iteration empirical ‚ P r P P r Planning based on the empirical MDP with slightly perturbed rewards 22/ 38
Our method: plug-in estimator + perturbation — Li, Wei, Chi, Gu, Chen ’20 P empirical MDP rds perturb b rewards p b b π ? b planning b b p planning oracle e . g . policy iteration empirical ‚ P rd: r p empirical ‚ P r P r P r P Run planning algorithms based on the empirical MDP 22/ 38
Challenges in the sample-starved regime truth: empirical estimate: � P ∈ R |S||A|×|S| P • Can’t recover P faithfully if sample size ≪ |S| 2 |A| ! 23/ 38
Challenges in the sample-starved regime truth: empirical estimate: � P ∈ R |S||A|×|S| P • Can’t recover P faithfully if sample size ≪ |S| 2 |A| ! • Can we trust our policy estimate when reliable model estimation is infeasible? 23/ 38
Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 24/ 38
Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 � � iterations p : obtained by empirical QVI or PI within � 1 π ⋆ • � O 1 − γ 24/ 38
Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 � � iterations p : obtained by empirical QVI or PI within � 1 π ⋆ • � O 1 − γ • Minimax lower bound: � |S||A| Ω( (1 − γ ) 3 ε 2 ) (Azar et al. ’13) 24/ 38
25/ 38
Analysis
Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r 27/ 38
Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r • � V π : estimate of value function under policy π V π = ( I − � ◦ Bellman equation: � P π ) − 1 r 27/ 38
Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r • � V π : estimate of value function under policy π V π = ( I − � ◦ Bellman equation: � P π ) − 1 r • π ⋆ : optimal policy w.r.t. true value function π ⋆ : optimal policy w.r.t. empirical value function • � 27/ 38
Recommend
More recommend