 
              Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020
Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE
Reinforcement learning (RL) 3 / 34
RL challenges • Unknown or changing environment • Credit assignment problem • Enormous state and action space 4 / 34
Provable e ffi ciency • Collecting samples might be expensive or impossible: sample e ffi ciency • Training deep RL algorithms might take long time: computational e ffi ciency 5 / 34
This talk Question: can we design sample- and computation-e ffi cient RL algorithms? —– inspired by numerous prior work [Kearns and Singh, 1999, Sidford et al., 2018a, Agarwal et al., 2019]... 6 / 34
Background: Markov decision processes 7 / 34
Markov decision process (MDP) • S : state space • A : action space 8 / 34
Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward 8 / 34
Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward • ⇡ ( ·| s ): policy (or action selection rule) 8 / 34
Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward • ⇡ ( ·| s ): policy (or action selection rule) • P ( ·| s , a ): unknown transition probabilities 8 / 34
Help the mouse! 9 / 34
Help the mouse! • state space S : positions in the maze 9 / 34
Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right 9 / 34
Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r : cheese, electricity shocks, cats 9 / 34
Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r : cheese, electricity shocks, cats • policy ⇡ ( ·| s ): the way to find cheese 9 / 34
Value function Value function of policy ⇡ : long-term discounted reward " 1 # X � � s 0 = s � t r t V ⇡ ( s ) := E 8 s 2 S : t =0 10 / 34
Value function Value function of policy ⇡ : long-term discounted reward " 1 # X � � s 0 = s � t r t V ⇡ ( s ) := E 8 s 2 S : t =0 • � 2 [0 , 1): discount factor • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ 10 / 34
Action-value function (a.k.a. Q-function) Q-function of policy ⇡ " 1 # X � � s 0 = s , a 0 = a � t r t 8 ( s , a ) 2 S ⇥ A : Q ⇡ ( s , a ) := E t =0 • ( � a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ � 11 / 34
Action-value function (a.k.a. Q-function) Q-function of policy ⇡ " 1 # X � � s 0 = s , a 0 = a � t r t 8 ( s , a ) 2 S ⇥ A : Q ⇡ ( s , a ) := E t =0 • ( � a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ � 11 / 34
Optimal policy 12 / 34
Optimal policy • optimal policy ⇡ ? : maximizing value function 12 / 34
Optimal policy • optimal policy ⇡ ? : maximizing value function • optimal value / Q function: V ? := V ⇡ ? ; Q ? := Q ⇡ ? 12 / 34
Practically, learn the optimal policy from data samples . . .
This talk: sampling from a generative model 14 / 34
This talk: sampling from a generative model For each state-action pair ( s , a ), collect N samples { ( s , a , s 0 ( i ) ) } 1  i  N 14 / 34
This talk: sampling from a generative model For each state-action pair ( s , a ), collect N samples { ( s , a , s 0 ( i ) ) } 1  i  N How many samples are su ffi cient to learn an " -optimal policy? 14 / 34
An incomplete list of prior art • [Kearns and Singh, 1999] • [Kakade, 2003] • [Kearns et al., 2002] • [Azar et al., 2012] • [Azar et al., 2013] • [Sidford et al., 2018a] • [Sidford et al., 2018b] • [Wang, 2019] • [Agarwal et al., 2019] • [Wainwright, 2019a] • [Wainwright, 2019b] • [Pananjady and Wainwright, 2019] • [Yang and Wang, 2019] • [Khamaru et al., 2020] • [Mou et al., 2020] • . . . 15 / 34
An even shorter list of prior art algorithm sample size range sample complexity " -range ⇥ |S| 2 |A| Empirical QVI 1 |S||A| p (0 , (1 − � ) |S| ] (1 − � ) 2 , 1 ) (1 − � ) 3 " 2 [Azar et al., 2013] ⇥ |S||A| � ⇤ Sublinear randomized VI |S||A| 1 (1 − � ) 2 , 1 ) 0 , (1 − � ) 4 " 2 1 − � [Sidford et al., 2018b] ⇥ |S||A| Variance-reduced QVI |S||A| (1 − � ) 3 , 1 ) (0 , 1] (1 − � ) 3 " 2 [Sidford et al., 2018a] ⇥ |S||A| Randomized primal-dual |S||A| 1 (0 , 1 − � ] (1 − � ) 2 , 1 ) (1 − � ) 4 " 2 [Wang, 2019] ⇥ |S||A| Empirical MDP + planning |S||A| 1 (1 − � ) 2 , 1 ) (0 , √ 1 − � ] (1 − � ) 3 " 2 [Agarwal et al., 2019] • # states |S| , # actions |A| important parameters = ) 1 • the discounted complexity 1 � � 1 • approximation error " 2 (0 , 1 � � ] 16 / 34
17 / 34
17 / 34
|S||A| All prior theory requires sample size > (1 � � ) 2 | {z } sample size barrier 17 / 34
This talk: break the sample complexity barrier 18 / 34
Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P 19 / 34
Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P Model-free approach — learning w/o constructing a model explicitly 19 / 34
Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P Model-free approach — learning w/o constructing a model explicitly 19 / 34
Model estimation Sampling: for each ( s , a ), collect N ind. samples { ( s , a , s 0 ( i ) ) } 1  i  N 20 / 34
Model estimation Sampling: for each ( s , a ), collect N ind. samples { ( s , a , s 0 ( i ) ) } 1  i  N N X P ( s 0 | s , a ) by 1 Empirical estimates: estimate b 1 { s 0 ( i ) = s 0 } N i =1 | {z } empirical frequency 20 / 34
Our method: plug-in estimator + perturbation 21 / 34
Our method: plug-in estimator + perturbation 21 / 34
Our method: plug-in estimator + perturbation 21 / 34
Our method: plug-in estimator + perturbation 21 / 34
Challenges in the sample-starved regime empirical estimate: b truth: P 2 R |S||A| ⇥ |S| P • can’t recover P faithfully if sample size ⌧ |S| 2 |A| ! 22 / 34
Challenges in the sample-starved regime empirical estimate: b truth: P 2 R |S||A| ⇥ |S| P • can’t recover P faithfully if sample size ⌧ |S| 2 |A| ! Can we trust our policy estimate when reliable model estimation is infeasible? 22 / 34
Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 23 / 34
Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 � � p : obtained by empirical QVI or PI within e 1 • b iterations ⇡ ? O 1 � � 23 / 34
Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 � � p : obtained by empirical QVI or PI within e 1 • b iterations ⇡ ? O 1 � � • minimax lower bound: e |S||A| Ω ( (1 � � ) 3 " 2 ) [Azar et al., 2013] 23 / 34
24 / 34
A sketch of the main proof ingredients 25 / 34
Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] 26 / 34
Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] • b V ⇡ : estimate of value function under policy ⇡ I Bellman equation: b V = ( I � � b P ⇡ ) − 1 r 26 / 34
Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] • b V ⇡ : estimate of value function under policy ⇡ I Bellman equation: b V = ( I � � b P ⇡ ) − 1 r • ⇡ ? : optimal policy w.r.t. true value function • b ⇡ ? : optimal policy w.r.t. empirical value function 26 / 34
Recommend
More recommend