Breaking the Sample Size Barrier in Model-Based Reinforcement - PowerPoint PPT Presentation

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020

Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE

Reinforcement learning (RL) 3 / 34

RL challenges • Unknown or changing environment • Credit assignment problem • Enormous state and action space 4 / 34

Provable e ffi ciency • Collecting samples might be expensive or impossible: sample e ffi ciency • Training deep RL algorithms might take long time: computational e ffi ciency 5 / 34

This talk Question: can we design sample- and computation-e ffi cient RL algorithms? —– inspired by numerous prior work [Kearns and Singh, 1999, Sidford et al., 2018a, Agarwal et al., 2019]... 6 / 34

Background: Markov decision processes 7 / 34

Markov decision process (MDP) • S : state space • A : action space 8 / 34

Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward 8 / 34

Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward • ⇡ ( ·| s ): policy (or action selection rule) 8 / 34

Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward • ⇡ ( ·| s ): policy (or action selection rule) • P ( ·| s , a ): unknown transition probabilities 8 / 34

Help the mouse! 9 / 34

Help the mouse! • state space S : positions in the maze 9 / 34

Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right 9 / 34

Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r : cheese, electricity shocks, cats 9 / 34

Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r : cheese, electricity shocks, cats • policy ⇡ ( ·| s ): the way to find cheese 9 / 34

Value function Value function of policy ⇡ : long-term discounted reward " 1 # X � � s 0 = s � t r t V ⇡ ( s ) := E 8 s 2 S : t =0 10 / 34

Value function Value function of policy ⇡ : long-term discounted reward " 1 # X � � s 0 = s � t r t V ⇡ ( s ) := E 8 s 2 S : t =0 • � 2 [0 , 1): discount factor • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ 10 / 34

Action-value function (a.k.a. Q-function) Q-function of policy ⇡ " 1 # X � � s 0 = s , a 0 = a � t r t 8 ( s , a ) 2 S ⇥ A : Q ⇡ ( s , a ) := E t =0 • ( � a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ � 11 / 34

Optimal policy 12 / 34

Optimal policy • optimal policy ⇡ ? : maximizing value function 12 / 34

Optimal policy • optimal policy ⇡ ? : maximizing value function • optimal value / Q function: V ? := V ⇡ ? ; Q ? := Q ⇡ ? 12 / 34

Practically, learn the optimal policy from data samples . . .

This talk: sampling from a generative model 14 / 34

This talk: sampling from a generative model For each state-action pair ( s , a ), collect N samples { ( s , a , s 0 ( i ) ) } 1  i  N 14 / 34

This talk: sampling from a generative model For each state-action pair ( s , a ), collect N samples { ( s , a , s 0 ( i ) ) } 1  i  N How many samples are su ffi cient to learn an " -optimal policy? 14 / 34

An incomplete list of prior art • [Kearns and Singh, 1999] • [Kakade, 2003] • [Kearns et al., 2002] • [Azar et al., 2012] • [Azar et al., 2013] • [Sidford et al., 2018a] • [Sidford et al., 2018b] • [Wang, 2019] • [Agarwal et al., 2019] • [Wainwright, 2019a] • [Wainwright, 2019b] • [Pananjady and Wainwright, 2019] • [Yang and Wang, 2019] • [Khamaru et al., 2020] • [Mou et al., 2020] • . . . 15 / 34

An even shorter list of prior art algorithm sample size range sample complexity " -range ⇥ |S| 2 |A| Empirical QVI 1 |S||A| p (0 , (1 − � ) |S| ] (1 − � ) 2 , 1 ) (1 − � ) 3 " 2 [Azar et al., 2013] ⇥ |S||A| � ⇤ Sublinear randomized VI |S||A| 1 (1 − � ) 2 , 1 ) 0 , (1 − � ) 4 " 2 1 − � [Sidford et al., 2018b] ⇥ |S||A| Variance-reduced QVI |S||A| (1 − � ) 3 , 1 ) (0 , 1] (1 − � ) 3 " 2 [Sidford et al., 2018a] ⇥ |S||A| Randomized primal-dual |S||A| 1 (0 , 1 − � ] (1 − � ) 2 , 1 ) (1 − � ) 4 " 2 [Wang, 2019] ⇥ |S||A| Empirical MDP + planning |S||A| 1 (1 − � ) 2 , 1 ) (0 , √ 1 − � ] (1 − � ) 3 " 2 [Agarwal et al., 2019] • # states |S| , # actions |A| important parameters = ) 1 • the discounted complexity 1 � � 1 • approximation error " 2 (0 , 1 � � ] 16 / 34

17 / 34

|S||A| All prior theory requires sample size > (1 � � ) 2 | {z } sample size barrier 17 / 34

This talk: break the sample complexity barrier 18 / 34

Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P 19 / 34

Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P Model-free approach — learning w/o constructing a model explicitly 19 / 34

Model estimation Sampling: for each ( s , a ), collect N ind. samples { ( s , a , s 0 ( i ) ) } 1  i  N 20 / 34

Model estimation Sampling: for each ( s , a ), collect N ind. samples { ( s , a , s 0 ( i ) ) } 1  i  N N X P ( s 0 | s , a ) by 1 Empirical estimates: estimate b 1 { s 0 ( i ) = s 0 } N i =1 | {z } empirical frequency 20 / 34

Our method: plug-in estimator + perturbation 21 / 34

Challenges in the sample-starved regime empirical estimate: b truth: P 2 R |S||A| ⇥ |S| P • can’t recover P faithfully if sample size ⌧ |S| 2 |A| ! 22 / 34

Challenges in the sample-starved regime empirical estimate: b truth: P 2 R |S||A| ⇥ |S| P • can’t recover P faithfully if sample size ⌧ |S| 2 |A| ! Can we trust our policy estimate when reliable model estimation is infeasible? 22 / 34

Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 23 / 34

Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 � � p : obtained by empirical QVI or PI within e 1 • b iterations ⇡ ? O 1 � � 23 / 34

Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 � � p : obtained by empirical QVI or PI within e 1 • b iterations ⇡ ? O 1 � � • minimax lower bound: e |S||A| Ω ( (1 � � ) 3 " 2 ) [Azar et al., 2013] 23 / 34

24 / 34

A sketch of the main proof ingredients 25 / 34

Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] 26 / 34

Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] • b V ⇡ : estimate of value function under policy ⇡ I Bellman equation: b V = ( I � � b P ⇡ ) − 1 r 26 / 34

Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] • b V ⇡ : estimate of value function under policy ⇡ I Bellman equation: b V = ( I � � b P ⇡ ) − 1 r • ⇡ ? : optimal policy w.r.t. true value function • b ⇡ ? : optimal policy w.r.t. empirical value function 26 / 34

Breaking the Sample Size Barrier in Model-Based Reinforcement - PowerPoint PPT Presentation

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020 Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE Reinforcement learning (RL) 3 /

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

Breaking the sample size barrier in reinforcement learning via model-based methods

BAKER BRICK BAKER BRICK BARRIER BARRIER BAKER BRICK BARRIER The Easy Solution to Stained

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Sample Size Power, Sample Size, and the FDR How many observations do we need? Depends on

Air Barrier and Insulation Installation Component Guide COMPONENT AIR BARRIER CRITERIA

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

developing a MPA network in the Great Barrier Reef Jon Day Great Barrier Reef Marine Park

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

So Sorting g Out Lipsch chitz Funct ction Approximation Cem Anil* James Lucas* Roger Grosse

Chapter 2 Section 3 MA1032 Data, Functions & Graphs Sidney Butler Michigan Technological

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

Intersecting two planes in R 3 . Suppose we have two planes with normal vectors n 1 , n 2

Discrete time Markov chains Today: Short recap of probability theory Markov chain

Rotor-routing, smoothing kernels, and reduction of variance: breaking the O(1/n) barrier Jim

Physics 2D Lecture Slides Lecture 9 : Jan 19th 2005 Vivek Sharma UCSD Physics Definition

Risk thinking and nuclear power Cathryn Carson Societal Risks

Sambuz

Useful Links

Newsletter

Mail Us

Breaking the Sample Size Barrier in Model-Based Reinforcement - PowerPoint PPT Presentation

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020 Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE Reinforcement learning (RL) 3 /

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

Breaking the sample size barrier in reinforcement learning via model-based methods

BAKER BRICK BAKER BRICK BARRIER BARRIER BAKER BRICK BARRIER The Easy Solution to Stained

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Sample Size Power, Sample Size, and the FDR How many observations do we need? Depends on

Air Barrier and Insulation Installation Component Guide COMPONENT AIR BARRIER CRITERIA

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

developing a MPA network in the Great Barrier Reef Jon Day Great Barrier Reef Marine Park

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

So Sorting g Out Lipsch chitz Funct ction Approximation Cem Anil* James Lucas* Roger Grosse

Chapter 2 Section 3 MA1032 Data, Functions &amp; Graphs Sidney Butler Michigan Technological

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

Intersecting two planes in R 3 . Suppose we have two planes with normal vectors n 1 , n 2

Discrete time Markov chains Today: Short recap of probability theory Markov chain

Rotor-routing, smoothing kernels, and reduction of variance: breaking the O(1/n) barrier Jim

Physics 2D Lecture Slides Lecture 9 : Jan 19th 2005 Vivek Sharma UCSD Physics Definition

Risk thinking and nuclear power Cathryn Carson Societal Risks

Sambuz

Useful Links

Newsletter

Mail Us

Chapter 2 Section 3 MA1032 Data, Functions & Graphs Sidney Butler Michigan Technological