 
              Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille – Nord Europe, Team SequeL
Outline Introduction to RL and DP 1 Approximate Dynamic Programming (AVI & API) 2 How does Statistical Learning Theory come to the picture? 3 Error Propagation (AVI & API Error Propagation) 4 An AVI Algorithm (Fitted Q-Iteration) 5 FQI: error at each iteration Final performance bound of FQI An API Algorithm (Least-Squares Policy Iteration) 6 Error at each iteration (LSTD error) Final performance bound of LSPI Discussion 7
Sequential Decision-Making under Uncertainty ! !"#$%&'$($)))$* Move around in the physical world (e.g. driving, navigation) Play and win a game Retrieve information over the web Medical diagnosis and treatment Maximize the throughput of a factory Optimize the performance of a rescue team
Reinforcement Learning (RL) RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance Interaction: Modeled as a MDP or a POMDP
Markov Decision Process MDP An MDP M is a tuple �X , A , r , p , γ � . The state space X is a bounded closed subset of R d . The set of actions A is finite ( |A| < ∞ ). The reward function r : X × A → R is bounded by R max . The transition model p ( ·| x , a ) is a distribution over X . γ ∈ ( 0 , 1 ) is a discount factor. Policy: a mapping from states to actions π ( x ) ∈ A
Value Function For a policy π V π : X → R Value function � ∞ � � � � V π ( x ) = E γ t r X t , π ( X t ) | X 0 = x t = 0 Q π : X × A → R Action-value function � ∞ � � Q π ( x , a ) = E γ t r ( X t , A t ) | X 0 = x , A 0 = a t = 0
Notation Bellman Operator Bellman operator for policy π T π : B V ( X ; V max ) → B V ( X ; V max ) V π is the unique fixed-point of the Bellman operator � � � � � ( T π V )( x ) = r x , π ( x ) + γ p dy | x , π ( x ) V ( y ) X The action-value function Q π is defined as � Q π ( x , a ) = r ( x , a ) + γ p ( dy | x , a ) V π ( y ) X
Optimal Value Function and Optimal Policy Optimal value function V ∗ ( x ) = sup V π ( x ) ∀ x ∈ X π Optimal action-value function Q ∗ ( x , a ) = sup Q π ( x , a ) ∀ x ∈ X , ∀ a ∈ A π A policy π is optimal if V π ( x ) = V ∗ ( x ) ∀ x ∈ X
Notation Bellman Optimality Operator Bellman optimality operator T : B V ( X ; V max ) → B V ( X ; V max ) V ∗ is the unique fixed-point of the Bellman optimality operator � � � ( T V )( x ) = max r ( x , a ) + γ p ( dy | x , a ) V ( y ) a ∈A X Optimal action-value function Q ∗ is defined as � Q ∗ ( x , a ) = r ( x , a ) + γ p ( dy | x , a ) V ∗ ( y ) X
Properties of Bellman Operators Monotonicity: if V 1 ≤ V 2 component-wise T π V 1 ≤ T π V 2 and T V 1 ≤ T V 2 ∀ V 1 , V 2 ∈ B V ( X ; V max ) Max-Norm Contraction: ||T π V 1 − T π V 2 || ∞ ≤ γ || V 1 − V 2 || ∞ ||T V 1 − T V 2 || ∞ ≤ γ || V 1 − V 2 || ∞
Dynamic Programming Algorithms Value Iteration start with an arbitrary action-value function Q 0 at each iteration k Q k + 1 = T Q k Convergence lim k →∞ V k = V ∗ . || V ∗ − V k + 1 || ∞ = ||T V ∗ −T V k || ∞ ≤ γ || V ∗ − V k || ∞ ≤ γ k + 1 || V ∗ − V 0 || ∞ k →∞ − → 0
Dynamic Programming Algorithms Policy Iteration start with an arbitrary policy π 0 at each iteration k Policy Evaluation: Compute Q π k Policy Improvement: Compute the greedy policy w.r.t. Q π k π k + 1 ( x ) = ( G π k )( x ) = arg max Q π k ( x , a ) a ∈A Convergence PI generates a sequence of policies with increasing performance ( V π k + 1 ≥ V π k ) and stops after a finite number of iterations with the optimal policy π ∗ . V π k = T π k V π k ≤ T V π k = T π k + 1 V π k ≤ lim n →∞ ( T π k + 1 ) n V π k = V π k + 1
Approximate Dynamic Programming
Approximate Dynamic Programming Algorithms Value Iteration start with an arbitrary action-value function Q 0 at each iteration k Q k + 1 = T Q k What if Q k + 1 ≈ T Q k ? ? || Q ∗ − Q k + 1 || ≤ γ || Q ∗ − Q k ||
Approximate Dynamic Programming Algorithms Policy Iteration start with an arbitrary policy π 0 at each iteration k Policy Evaluation: Compute Q π k Policy Improvement: Compute the greedy policy w.r.t. Q π k π k + 1 ( x ) = ( G π k )( x ) = arg max Q π k ( x , a ) a ∈A What if we cannot compute Q π k exactly? (Compute � Q π k ≈ Q π k instead) ? π k + 1 ( x ) = arg max Q π k ( x , a ) � = ( G π k )( x ) − � → V π k + 1 ≥ V π k a ∈A
Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� � � �� � estimation error approximation error regression
Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� � � �� � estimation error approximation error regression
Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� � � �� � estimation error approximation error regression
Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� � � �� � estimation error approximation error regression
Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� � � �� � estimation error approximation error regression
Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� � � �� � estimation error approximation error regression
Statistical Learning Theory in RL & ADP Approximate Policy Iteration (API) - policy evaluation Q = min f || f − Q π k || µ finding a function that best approximates Q π k only noisy observations of Q π k are available � Q π k Target Function = Q π k Noisy Observation = � Q π k we minimize the empirical error Q = min f || f − � � Q π k || � µ with the target of minimizing the true error Q = min f || f − Q π k || µ Objective: || � Q − Q π k || µ ≤ || � + || Q − Q π k || µ Q − Q || µ to be small � �� � � �� � estimation error approximation error regression
Recommend
More recommend