Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

In This Lecture ◮ How do we solve an MDP online? ⇒ RL Algorithms A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 2/76

In This Lecture ◮ Dynamic programming algorithms require an explicit definition of ◮ transition probabilities p ( ·| x , a ) ◮ reward function r ( x , a ) ◮ This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). ◮ Can we relax this assumption? A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 3/76

In This Lecture ◮ Learning with generative model. A black-box simulator f of the environment is available. Given ( x , a ) , f ( x , a ) = { y , r } with y ∼ p ( ·| x , a ) , r = r ( x , a ) . ◮ Episodic learning. Multiple trajectories can be repeatedly generated from the same state x and terminating when a reset condition is achieved: ( x i 0 = x , x i 1 , . . . , x i T i ) n i = 1 . ◮ Online learning. At each time t the agent is at state x t , it takes action a t , it observes a transition to state x t + 1 , and it receives a reward r t . We assume that x t + 1 ∼ p ( ·| x t , a t ) and r t = r ( x t , a t ) (i.e., MDP assumption). A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 4/76

Mathematical Tools Outline Mathematical Tools The Monte-Carlo Algorithm The TD(1) Algorithm The TD(0) Algorithm The TD( λ ) Algorithm The Q -learning Algorithm A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 5/76

Mathematical Tools Concentration Inequalities Let X be a random variable and { X n } n ∈ N a sequence of r.v. a . s . ◮ { X n } converges to X almost surely , X n − → X , if P ( lim n →∞ X n = X ) = 1 , P ◮ { X n } converges to X in probability , X n − → X , if for any ǫ > 0, n →∞ P [ | X n − X | > ǫ ] = 0 , lim D ◮ { X n } converges to X in law (or in distribution), X n − → X , if for any bounded continuous function f n →∞ E [ f ( X n )] = E [ f ( X )] . lim a . s . P D Remark: X n − → X = ⇒ X n − → X = ⇒ X n − → X . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 6/76

Mathematical Tools Concentration Inequalities Proposition (Markov Inequality) Let X be a positive random variable. Then for any a > 0, P ( X ≥ a ) ≤ E X a . Proof. P ( X ≥ a ) = E [ I { X ≥ a } ] = E [ I { X / a ≥ 1 } ] ≤ E [ X / a ] � A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 7/76

Mathematical Tools Concentration Inequalities Proposition (Hoeffding Inequality) Let X be a centered random variable bounded in [ a , b ] . Then for any s ∈ R , E [ e sX ] ≤ e s 2 ( b − a ) 2 / 8 . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 8/76

Mathematical Tools Concentration Inequalities Proof. From convexity of the exponential function, for any a ≤ x ≤ b , e sx ≤ x − a b − ae sb + b − x b − a e sa . Let p = − a / ( b − a ) then (recall that E [ X ] = 0) b a b − ae sa − E [ e sx ] b − ae sb ≤ ( 1 − p + pe s ( b − a ) ) e − ps ( b − a ) = e φ ( u ) = with u = s ( b − a ) and φ ( u ) = − pu + log ( 1 − p + pe u ) whose derivative is p φ ′ ( u ) = − p + p + ( 1 − p ) e − u , p ( 1 − p ) e − u and φ ( 0 ) = φ ′ ( 0 ) = 0 and φ ′′ ( u ) = ( p +( 1 − p ) e − u ) 2 ≤ 1 / 4. Thus from Taylor’s theorem , the exists a θ ∈ [ 0 , u ] such that φ ( θ ) = φ ( 0 ) + θφ ′ ( 0 ) + u 2 2 φ ′′ ( θ ) ≤ u 2 8 = s 2 ( b − a ) 2 . 8 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 9/76

Mathematical Tools Concentration Inequalities Proposition (Chernoff-Hoeffding Inequality) Let X i ∈ [ a i , b i ] be n independent r.v. with mean µ i = E X i . Then �� n � 2 ǫ 2 � � � X i − µ i � ≥ ǫ ≤ 2 exp − � n . P � i = 1 ( b i − a i ) 2 i = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 10/76

Mathematical Tools Concentration Inequalities Proof. � � n � P ( e s � n i = 1 X i − µ i ≥ e s ǫ ) X i − µ i ≥ ǫ = P i = 1 e − s ǫ E [ e s � n i = 1 X i − µ i ] , ≤ Markov inequality n � e − s ǫ E [ e s ( X i − µ i ) ] , = independent random variables i = 1 n � e s 2 ( b i − a i ) 2 / 8 , e − s ǫ ≤ Hoeffding inequality i = 1 e − s ǫ + s 2 � n i = 1 ( b i − a i ) 2 / 8 = If we choose s = 4 ǫ/ � n i = 1 ( b i − a i ) 2 , the result follows. � � n � Similar arguments hold for P i = 1 X i − µ i ≤ − ǫ . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 11/76

Mathematical Tools Monte-Carlo Approximation of a Mean Definition Let X be a random variable with mean µ = E [ X ] and variance σ 2 = V [ X ] and x n ∼ X be n i.i.d. realizations of X. The Monte-Carlo approximation of the mean (i.e., the empirical mean) built on n i.i.d. realizations is defined as n � µ n = 1 x i . n i = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 12/76

Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . ◮ Finite sample guarantee : � � � � � � � n 2 n ǫ 2 � 1 � � X t − E [ X 1 ] � > ǫ ≤ 2 exp − P �� ( b − a ) 2 n � �� t = 1 accuracy � �� confidence deviation A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 13/76

Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . ◮ Finite sample guarantee : �� n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 14/76

Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . ◮ Finite sample guarantee : �� n � � 1 � � X t − E [ X 1 ] � > ǫ ≤ δ P n t = 1 if n ≥ ( b − a ) 2 log 2 /δ . 2 ǫ 2 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 15/76

Mathematical Tools Exercise Simulate n Bernoulli of probability p and verify the correctness and the accuracy of the C-H bounds. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 16/76

Mathematical Tools Stochastic Approximation of a Mean Definition Let X a random variable bounded in [ 0 , 1 ] with mean µ = E [ X ] and x n ∼ X be n i.i.d. realizations of X. The stochastic approximation of the mean is, µ n = ( 1 − η n ) µ n − 1 + η n x n with µ 1 = x 1 and where ( η n ) is a sequence of learning steps. Remark: When η n = 1 n this is the recursive definition of empirical mean. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 17/76

Mathematical Tools Stochastic Approximation of a Mean Proposition (Borel-Cantelli) Let ( E n ) n ≥ 1 be a sequence of events such that � n ≥ 1 P ( E n ) < ∞ , then the probability of the intersection of an infinite subset is 0. More formally, � � � ∞ � ∞ � � lim sup n →∞ E n = P E k = 0 . P n = 1 k = n A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 18/76

Mathematical Tools Stochastic Approximation of a Mean Proposition If for any n , η n ≥ 0 and are such that � � η 2 η n = ∞ ; n < ∞ , n ≥ 0 n ≥ 0 then a . s . µ n − → µ, and we say that µ n is a consistent estimator. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 19/76

Mathematical Tools Stochastic Approximation of a Mean Proof. We focus on the case η n = n − α . In order to satisfy the two conditions we need 1 / 2 < α ≤ 1. In fact, for instance � n 2 = π 2 1 α = 2 ⇒ 6 < ∞ (see the Basel problem) n ≥ 0 � 1 � 2 � � 1 α = 1 / 2 ⇒ √ n n = ∞ = (harmonic series) . n ≥ 0 n ≥ 0 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 20/76

Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case α = 1 Let ( ǫ k ) k a sequence such that ǫ k → 0, almost sure convergence corresponds to � � � � � ≤ ǫ k ) = 1 . � µ n − µ n →∞ µ n = µ lim = P ( ∀ k , ∃ n k , ∀ n ≥ n k , P From Chernoff-Hoeffding inequality for any fixed n �� ≥ ǫ ≤ 2 e − 2 n ǫ 2 . � µ n − µ (1) P � � � ≥ ǫ } . From C-H � µ n − µ Let { E n } be a sequence of events E n = { � P ( E n ) < ∞ , n ≥ 1 and from Borel-Cantelli lemma we obtain that with probability 1 there � � � µ n − µ � ≥ ǫ . exist only a finite number of n values such that A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 21/76

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture How do we solve an MDP online? RL Algorithms A. LAZARIC Reinforcement Learning

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Encoding and decoding neural information Encoding : building functional models of neurons/neural

The 10,000 Hours Rule Learning Proficiency to Play Games with AI Shane M. Conway @statalgo,

Rootkits and Trojans on your SAP Landscape Ertunga Arsal Chaos Communication Congress 2010 1

Baumgartner, POLI 203 Spring 2016 RJA 1: the 2009 Law Reading: RJA 2009, 11, 15 March 7,

Manufactured Housing Communities This is a trailer park Trailer Park by Sutton, Berens

A Theory of Pareto Distributions UZH Macroeconomics Seminar Franois Geerolf UCLA May 3, 2017

Early Site Permit Application (ESPA) Review Clinch River Nuclear Site Safety Panel August 14,

What do we want? And when do we want it? Alternative objectives and their implications for