 
              Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 14/83
Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 14/83
Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . ◮ Finite sample guarantee : � � � � � � n � 2 n ǫ 2 � 1 � � X t − E [ X 1 ] > ǫ ≤ 2 exp − P � ���� ( b − a ) 2 n t = 1 � �� � accuracy � �� � confidence deviation A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 14/83
Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . ◮ Finite sample guarantee : �� � � � n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 15/83
Mathematical Tools Monte-Carlo Approximation of a Mean ◮ Unbiased estimator : Then E [ µ n ] = µ (and V [ µ n ] = V [ X ] n ) P ◮ Weak law of large numbers: µ n − → µ . a . s . ◮ Strong law of large numbers : µ n − → µ . ◮ Central limit theorem (CLT) : √ n ( µ n − µ ) D − → N ( 0 , V [ X ]) . ◮ Finite sample guarantee : �� � � n � � 1 � � X t − E [ X 1 ] � > ǫ ≤ δ P n t = 1 if n ≥ ( b − a ) 2 log 2 /δ . 2 ǫ 2 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 16/83
Mathematical Tools Exercise Simulate n Bernoulli of probability p and verify the correctness and the accuracy of the C-H bounds. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 17/83
Mathematical Tools Stochastic Approximation of a Mean Definition Let X a random variable bounded in [ 0 , 1 ] with mean µ = E [ X ] and x n ∼ X be n i.i.d. realizations of X. The stochastic approximation of the mean is, µ n = ( 1 − η n ) µ n − 1 + η n x n with µ 1 = x 1 and where ( η n ) is a sequence of learning steps. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 18/83
Mathematical Tools Stochastic Approximation of a Mean Definition Let X a random variable bounded in [ 0 , 1 ] with mean µ = E [ X ] and x n ∼ X be n i.i.d. realizations of X. The stochastic approximation of the mean is, µ n = ( 1 − η n ) µ n − 1 + η n x n with µ 1 = x 1 and where ( η n ) is a sequence of learning steps. Remark: When η n = 1 n this is the recursive definition of empirical mean. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 18/83
Mathematical Tools Stochastic Approximation of a Mean Proposition (Borel-Cantelli) Let ( E n ) n ≥ 1 be a sequence of events such that � n ≥ 1 P ( E n ) < ∞ , then the probability of the intersection of an infinite subset is 0. More formally, � � � ∞ � � � ∞ P lim sup n →∞ E n = P E k = 0 . n = 1 k = n A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 19/83
Mathematical Tools Stochastic Approximation of a Mean Proposition If for any n , η n ≥ 0 and are such that � � η 2 η n = ∞ ; n < ∞ , n ≥ 0 n ≥ 0 then a . s . µ n − → µ, and we say that µ n is a consistent estimator. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 20/83
Mathematical Tools Stochastic Approximation of a Mean Proof. We focus on the case η n = n − α . In order to satisfy the two conditions we need 1 / 2 < α ≤ 1. In fact, for instance � n 2 = π 2 1 α = 2 ⇒ 6 < ∞ (see the Basel problem) n ≥ 0 � 1 � 2 � � 1 α = 1 / 2 ⇒ √ n = n = ∞ (harmonic series) . n ≥ 0 n ≥ 0 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 21/83
Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case α = 1 Let ( ǫ k ) k a sequence such that ǫ k → 0, almost sure convergence corresponds to � � � � � µ n − µ � ≤ ǫ k ) = 1 . n →∞ µ n = µ lim = P ( ∀ k , ∃ n k , ∀ n ≥ n k , P From Chernoff-Hoeffding inequality for any fixed n �� � � ≤ 2 e − 2 n ǫ 2 . � µ n − µ � ≥ ǫ P (1) � � � µ n − µ � ≥ ǫ } . From C-H Let { E n } be a sequence of events E n = { � P ( E n ) < ∞ , n ≥ 1 and from Borel-Cantelli lemma we obtain that with probability 1 there � � � ≥ ǫ . � µ n − µ exist only a finite number of n values such that A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 22/83
Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case α = 1 Then for any ǫ k there exist only a finite number of instants were � � � µ n − µ � ≥ ǫ k , which corresponds to have ∃ n k such that � � � ≤ ǫ k ) = 1 � µ n − µ P ( ∀ n ≥ n k , Repeating for all ǫ k in the sequence leads to the statement. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 23/83
Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case α = 1 Then for any ǫ k there exist only a finite number of instants were � � � µ n − µ � ≥ ǫ k , which corresponds to have ∃ n k such that � � � ≤ ǫ k ) = 1 � µ n − µ P ( ∀ n ≥ n k , Repeating for all ǫ k in the sequence leads to the statement. Remark: when α = 1, µ n is the Monte-Carlo estimate and this corresponds to the strong law of large numbers. A more precise and accurate proof is here: http://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/ A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 23/83
Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case 1 / 2 < α < 1 . The stochastic approximation µ n is µ 1 = x 1 µ 2 = ( 1 − η 2 ) µ 1 + η 2 x 2 = ( 1 − η 2 ) x 1 + η 2 x 2 µ 3 = ( 1 − η 3 ) µ 2 + η 3 x 3 = ( 1 − η 2 )( 1 − η 3 ) x 1 + η 2 ( 1 − η 3 ) x 2 + η 3 x 3 . . . � n µ n = λ i x i , i = 1 � n j = i + 1 ( 1 − η j ) such that � n with λ i = η i i = 1 λ i = 1. By C-H inequality � n � n �� � �� � 2 ǫ 2 � � − � � ≥ ǫ � µ n − µ � ≥ ǫ � n i = 1 λ 2 i . P λ i x i − λ i E [ x i ] = P ≤ e i = 1 i = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 24/83
Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case 1 / 2 < α < 1 . From the definition of λ i n n � � log λ i = log η i + log ( 1 − η j ) ≤ log η i − η j j = i + 1 j = i + 1 since log ( 1 − x ) < − x . Thus λ i ≤ η i e − � n j = i + 1 η j and for any 1 ≤ m ≤ n , n n � � i e − 2 � n λ 2 η 2 j = i + 1 η j ≤ i i = 1 i = 1 m n � � ( a ) e − 2 � n j = i + 1 η j + η 2 ≤ i i = 1 i = m + 1 ( b ) me − 2 ( n − m ) η n + ( n − m ) η 2 ≤ m me − 2 ( n − m ) n − α + ( n − m ) m − 2 α . ( c ) = A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 25/83
Mathematical Tools Stochastic Approximation of a Mean Proof (cont’d). Case 1 / 2 < α < 1 . Let m = n β with β = ( 1 + α/ 2 ) / 2 (i.e. 1 − 2 αβ = 1 / 2 − α ) : n � i ≤ ne − 2 ( 1 − n − 1 / 4 ) n 1 − α + n 1 / 2 − α ≤ 2 n 1 / 2 − α λ 2 i = 1 for n big enough , which leads to �� � � ǫ 2 � ≥ ǫ ≤ e − n 1 / 2 − α . � µ n − µ P From this point we follow the same steps as for α = 1 (application of the Borel-Cantelli lemma) and obtain the convergence result for µ n . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 26/83
Mathematical Tools Stochastic Approximation of a Fixed Point Definition Let T : R N → R N be a contraction in some norm || · || with fixed point V . For any function W and state x, a noisy observation � T W ( x ) = T W ( x ) + b ( x ) is available. For any x ∈ X = { 1 , . . . , N } , we defined the stochastic approximation V n + 1 ( x ) = ( 1 − η n ( x )) V n ( x ) + η n ( x )( ˆ T V n ( x )) = ( 1 − η n ( x )) V n ( x ) + η n ( x )( T V n ( x ) + b n ) , where η n is a sequence of learning steps. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 27/83
Mathematical Tools Stochastic Approximation of a Fixed Point Proposition Let F n = { V 0 , . . . , V n , b 0 , . . . , b n − 1 , η 0 , . . . , η n } the filtration of the algorithm and assume that E [ b 2 n ( x ) |F n ] ≤ c ( 1 + || V n || 2 ) E [ b n ( x ) |F n ] = 0 and for a constant c . If the learning rates η n ( x ) are positive and satisfy the stochastic approximation conditions � � η 2 η n = ∞ , n < ∞ , n ≥ 0 n ≥ 0 then for any x ∈ X V n ( x ) a . s . − → V ( x ) . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 28/83
Mathematical Tools Stochastic Approximation of a Zero Robbins-Monro (1951) algorithm. Given a noisy function f , find x ∗ such that f ( x ∗ ) = 0. In each x n , observe y n = f ( x n ) + b n (with b n a zero-mean independent noise) and compute x n + 1 = x n − η n y n . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 29/83
Mathematical Tools Stochastic Approximation of a Zero Robbins-Monro (1951) algorithm. Given a noisy function f , find x ∗ such that f ( x ∗ ) = 0. In each x n , observe y n = f ( x n ) + b n (with b n a zero-mean independent noise) and compute x n + 1 = x n − η n y n . If f is an increasing function, then under the same assumptions on the learning step a . s . → x ∗ − x n A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 29/83
Mathematical Tools Stochastic Approximation of a Minimum Kiefer-Wolfowitz (1952) algorithm. Given a function f and noisy observations of its gradient, find x ∗ = arg min f ( x ) . In each x n , observe g n = ∇ f ( x n ) + b n (with b n a zero-mean independent noise) and compute x n + 1 = x n − η n g n . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 30/83
Mathematical Tools Stochastic Approximation of a Minimum Kiefer-Wolfowitz (1952) algorithm. Given a function f and noisy observations of its gradient, find x ∗ = arg min f ( x ) . In each x n , observe g n = ∇ f ( x n ) + b n (with b n a zero-mean independent noise) and compute x n + 1 = x n − η n g n . If the Hessian ∇ 2 f is positive , then under the same assumptions on the learning step a . s . → x ∗ x n − Remark: this is often referred to as the stochastic gradient algorithm. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 30/83
The Monte-Carlo Algorithm How to solve incrementally an RL problem Reinforcement Learning Algorithms Tools Policy Evaluation Policy Learning A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 31/83
The Monte-Carlo Algorithm The RL Interaction Protocol For i = 1 , . . . , n 1. Set t = 0 2. Set initial state x 0 [possibly random] [execute one trajectory] 3. While ( x t not terminal) 3.1 Take action a t 3.2 Observe next state x t + 1 and reward r t 3.3 Set t = t + 1 EndWhile EndFor A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 32/83
The Monte-Carlo Algorithm The RL Interaction Protocol x (1) T (1) . . . x (1) 2 x (1) 1 . . . x ( i ) x ( i ) x ( i ) x 0 1 2 T ( i ) . . . x ( n ) x ( n ) x ( n ) 1 2 T ( n ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 33/83
The Monte-Carlo Algorithm Policy Evaluation Objective: given a policy π evaluate its quality at the (fixed) initial state x 0 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 34/83
The Monte-Carlo Algorithm Policy Evaluation Objective: given a policy π evaluate its quality at the (fixed) initial state x 0 For i = 1 , . . . , n 1. Set t = 0 2. Set initial state x 0 [possibly random] [execute one trajectory] 3. While ( x t not terminal) 3.1 Take action a t = π ( x t ) 3.2 Observe next state x t + 1 and reward r t = r π ( x t ) 3.3 Set t = t + 1 EndWhile EndFor A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 34/83
The Monte-Carlo Algorithm The RL Interaction Protocol r π ( x (1) T ( i ) ) x (1) T (1) . . . r π ( x (1) 2 ) x (1) 2 r π ( x (1) 1 ) x (1) 1 r π ( x (2) r π ( x (2) r π ( x (2) T (2) ) 1 ) 2 ) . . . x ( i ) x ( i ) x ( i ) x 0 1 2 T ( i ) r π ( x ( n ) r π ( x ( n ) r π ( x ( n ) T ( n ) ) 1 ) 2 ) . . . x ( n ) x ( n ) x ( n ) 1 2 T ( n ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 35/83
The Monte-Carlo Algorithm State Value Function ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . � T � � V π ( x ) = E γ t r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first ( random ) time when the termination state is achieved. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 36/83
The Monte-Carlo Algorithm Monte-Carlo Approximation Idea: we can approximate an expectation by an average ! ◮ Return of trajectory i T ( i ) � γ t r π ( x ( i ) � R i ( x 0 ) = t ) t = 0 ◮ Estimated value function n � n ( x 0 ) = 1 � � V π R i ( x 0 ) n i = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 37/83
The Monte-Carlo Algorithm Monte-Carlo Approximation For i = 1 , . . . , n 1. Set t = 0 2. Set initial state x 0 [possibly random] [execute one trajectory] 3. While ( x t not terminal) 3.1 Take action a t = π ( x t ) 3.2 Observe next state x t + 1 and reward r t = r π ( x t ) 3.3 Set t = t + 1 EndWhile EndFor Collect trajectories and compute � V π n ( x 0 ) using MC approximation A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 38/83
The Monte-Carlo Algorithm Monte-Carlo Approximation: Properties ◮ All returns are unbiased estimators of V π ( x ) � � r π ( x ( i ) 0 ) + γ r π ( x ( i ) 1 ) + · · · + γ T ( i ) r π ( x ( i ) E [ � R ( i ) ( x 0 )] = E = V π ( x ) T ( i ) ) ◮ Thus n ( x 0 ) a . s . V π � → V π ( x 0 ) . − ◮ Finite-sample guarantees are also possible A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 39/83
The Monte-Carlo Algorithm Monte-Carlo Approximation: Extensions Non-episodic problems ◮ Interrupt trajectories after H steps H � γ t r π ( x ( i ) � R i ( x 0 ) = t ) t = 0 ◮ Loss in accuracy limited to γ H r max 1 − γ A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 40/83
The Monte-Carlo Algorithm Monte-Carlo Approximation: Extensions Multiple subtrajectories r π ( x (1) T ( i ) ) x (1) T (1) . . . r π ( x (1) 2 ) x (1) 2 r π ( x (1) 1 ) x (1) 1 r π ( x (2) r π ( x (2) r π ( x (2) T (2) ) 1 ) 2 ) . . . x ( i ) x ( i ) x ( i ) 1 = x x 0 2 T ( i ) r π ( x ( n ) r π ( x ( n ) r π ( x ( n ) T ( n ) ) 1 ) 2 ) . . . x ( n ) x ( n ) x ( n ) = x 1 2 T ( n ) All subtrajectories starting with x can be used to estimate V π ( x ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 41/83
The Monte-Carlo Algorithm First-visit and Every-Visit Monte-Carlo Remark: any trajectory ( x 0 , x 1 , x 2 , . . . , x T ) contains also the sub-trajectory ( x t , x t + 1 , . . . , x T ) whose return � R ( x t ) = r π ( x t ) + · · · + r π ( x T − 1 ) could be used to build an estimator of V π ( x t ) . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 42/83
The Monte-Carlo Algorithm First-visit and Every-Visit Monte-Carlo Remark: any trajectory ( x 0 , x 1 , x 2 , . . . , x T ) contains also the sub-trajectory ( x t , x t + 1 , . . . , x T ) whose return � R ( x t ) = r π ( x t ) + · · · + r π ( x T − 1 ) could be used to build an estimator of V π ( x t ) . ◮ First-visit MC. For each state x we only consider the sub-trajectory when x is first achieved. Unbiased estimator , only one sample per trajectory . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 42/83
The Monte-Carlo Algorithm First-visit and Every-Visit Monte-Carlo Remark: any trajectory ( x 0 , x 1 , x 2 , . . . , x T ) contains also the sub-trajectory ( x t , x t + 1 , . . . , x T ) whose return � R ( x t ) = r π ( x t ) + · · · + r π ( x T − 1 ) could be used to build an estimator of V π ( x t ) . ◮ First-visit MC. For each state x we only consider the sub-trajectory when x is first achieved. Unbiased estimator , only one sample per trajectory . ◮ Every-visit MC. Given a trajectory ( x 0 = x , x 1 , x 2 , . . . , x T ) , we list all the m sub-trajectories starting from x up to x T and we average them all to obtain an estimate. More than one sample per trajectory , biased estimator . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 42/83
The Monte-Carlo Algorithm Question More samples or no bias? ⇒ Sometimes a biased estimator is preferable if consistent! A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 43/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo Example: 2-state Markov Chain 1 1−p p 1 0 The reward is 1 while in state 1 (while is 0 in the terminal state). All trajectories are ( x 0 = 1 , x 1 = 1 , . . . , x T = 0 ) . By Bellman equations V ( 1 ) = 1 + ( 1 − p ) V ( 1 ) + 0 · p = 1 p , since V ( 0 ) = 0. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 44/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo We measure the mean squared error (MSE) of � V w.r.t. V � V − V ) 2 � � � 2 ��� � 2 � ( � E [ � V − E [ � V ] − V E = + E V ] � �� � � �� � 2 Variance Bias A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 45/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo First-visit Monte-Carlo. All the trajectories start from state 1, then the return over one single trajectory is exactly T , i.e., � V = T . The time-to-end T is a geometric r.v. with expectation V ] = E [ T ] = 1 E [ � p = V π ( 1 ) ⇒ unbiased estimator. Thus the MSE of � V coincides with the variance of T , which is �� � 2 � T − 1 = 1 p 2 − 1 p . E p A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 46/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo Every-visit Monte-Carlo. Given one trajectory, we can construct T − 1 sub-trajectories (number of times state 1 is visited), where the t -th trajectory has a return T − t . T − 1 T � � V = 1 ( T − t ) = 1 t ′ = T + 1 � . T T 2 t = 0 t ′ = 1 The corresponding expectation is � T + 1 � = 1 + p 2 p � = V π ( 1 ) ⇒ biased estimator . E 2 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 47/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo Let’s consider n independent trajectories , each of length T i . Total number of samples � n i = 1 T i and the estimator � V n is � n � T i − 1 � n t = 0 ( T i − t ) i = 1 T i ( T i + 1 ) � i = 1 � n 2 � n V n = = i = 1 T i i = 1 T i = 1 / n � n i = 1 T i ( T i + 1 ) 2 / n � n i = 1 T i → E [ T 2 ] + E [ T ] = 1 a . s . p = V π ( 1 ) ⇒ consistent estimator . − 2 E [ T ] The MSE of the estimator �� T + 1 � 2 � − 1 2 p 2 − 3 1 4 p + 1 4 ≤ 1 p 2 − 1 = p . E 2 p A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 48/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo In general ◮ Every-visit MC : biased but consistent estimator. ◮ First-visit MC : unbiased estimator with potentially bigger MSE . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 49/83
The Monte-Carlo Algorithm First-visit vs Every-Visit Monte-Carlo In general ◮ Every-visit MC : biased but consistent estimator. ◮ First-visit MC : unbiased estimator with potentially bigger MSE . Remark: when the state space is large the probability of visiting multiple times the same state is low, then the performance of the two methods tends to be the same. A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 49/83
The Monte-Carlo Algorithm Monte-Carlo Approximation: Extensions Full estimate of V π over any x ∈ X ◮ Use subtrajectories ◮ Restart from random states over X A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 50/83
The Monte-Carlo Algorithm Monte-Carlo Approximation: Limitations ◮ The estimate � V π ( x 0 ) is computed when all trajectories are terminated A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 51/83
The Monte-Carlo Algorithm Temporal Difference TD ( 1 ) Idea: we can approximate an expectation by an incremental average! ◮ Return of trajectory i T ( i ) � γ t r π ( x ( i ) � R i ( x 0 ) = t ) t = 0 ◮ Estimated value function after trajectory i � i ( x 0 ) = ( 1 − α i ) � i − 1 ( x 0 ) + α i � V π V π R i ( x 0 ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 52/83
The Monte-Carlo Algorithm Temporal Difference TD ( 1 ) For i = 1 , . . . , n 1. Set t = 0 2. Set initial state x 0 [possibly random] [execute one trajectory] 3. While ( x t not terminal) 3.1 Take action a t = π ( x t ) 3.2 Observe next state x t + 1 and reward r t = r π ( x t ) 3.3 Set t = t + 1 EndWhile 4. Update � V π i ( x 0 ) using TD ( 1 ) approximation EndFor Collect trajectories and compute � V π n ( x 0 ) using MC approximation A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 53/83
The Monte-Carlo Algorithm Temporal Difference TD ( 1 ) : Properties ◮ If α i = 1 / i , then TD ( 1 ) is just the incremental version of the empirical mean i ( x 0 ) = n − 1 i − 1 ( x 0 ) + 1 V π � V π � � R i ( x 0 ) n n ◮ Using a generic step-size (learning rate) α i gives flexibility to the algorithm A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 54/83
The Monte-Carlo Algorithm Temporal Difference TD ( 1 ) : Properties Proposition If the learning rate satisfies the Robbins-Monro conditions ∞ ∞ � � α 2 α i = ∞ , i < ∞ , i = 0 i = 0 then n ( x 0 ) a . s . � V π → V π ( x 0 ) − A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 55/83
The Monte-Carlo Algorithm Temporal Difference TD ( 1 ) : Extensions ◮ Non-episodic problems : Truncated trajectories ◮ Multiple sub-trajectories ◮ Updates of all the states using sub-trajectories ◮ state-dependent learning rate α i ( x ) ◮ i is the index of the number of updates in that specific state A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 56/83
The Monte-Carlo Algorithm Temporal Difference TD ( 1 ) : Limitations ◮ The estimate � V π ( x 0 ) is updated when the trajectory is completely terminated A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 57/83
The Monte-Carlo Algorithm The Bellman Equation Proposition For any stationary policy π = ( π, π, . . . ) , the state value function at a state x ∈ X satisfies the Bellman equation : � V π ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) V π ( y ) . y A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 58/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) Idea: we can approximate V π by estimating the Bellman error A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) Idea: we can approximate V π by estimating the Bellman error ◮ Bellman error of a function V in a state x � B π ( V ; x ) = r π ( x ) + γ p ( y | x , π ( x )) V ( y ) − V ( x ) . y A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) Idea: we can approximate V π by estimating the Bellman error ◮ Bellman error of a function V in a state x � B π ( V ; x ) = r π ( x ) + γ p ( y | x , π ( x )) V ( y ) − V ( x ) . y V π for a transition ◮ Temporal difference of a function � � x t , r t , x t + 1 � δ t = r t + γ � V π ( x t + 1 ) − � V π ( x t ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) Idea: we can approximate V π by estimating the Bellman error ◮ Bellman error of a function V in a state x � B π ( V ; x ) = r π ( x ) + γ p ( y | x , π ( x )) V ( y ) − V ( x ) . y V π for a transition ◮ Temporal difference of a function � � x t , r t , x t + 1 � δ t = r t + γ � V π ( x t + 1 ) − � V π ( x t ) ◮ Estimated value function after transition � x t , r t , x t + 1 � � �� � � V π ( x t ) = � V π ( x t ) + α i ( x t ) r t + γ � V π ( x t + 1 ) 1 − α i ( x t ) = � V π ( x t ) + α i ( x t ) δ t A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 59/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) For i = 1 , . . . , n 1. Set t = 0 2. Set initial state x 0 [possibly random] [execute one trajectory] 3. While ( x t not terminal) 3.1 Take action a t = π ( x t ) 3.2 Observe next state x t + 1 and reward r t = r π ( x t ) 3.3 Set t = t + 1 3.4 Update � V π ( x t ) using TD ( 0 ) approximation EndWhile 4. Update � V π i ( x 0 ) using TD ( 1 ) approximation EndFor Collect trajectories and compute � V π n ( x 0 ) using MC approximation A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 60/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) : Properties ◮ The update rule � �� � � � r t + γ � V π ( x t ) = V π ( x t ) + α i ( x t ) V π ( x t + 1 ) 1 − α i ( x t ) V π in other state. is bootstrapping the current estimate of � ◮ The temporal difference is an unbiased sample of the Bellman error V π ( x t )] = T π � E [ δ t ] = E [ r t + γ � V π ( x t + 1 ) − � V π ( x t ) − � V π ( x t ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 61/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) : Properties Proposition If the learning rate satisfies the Robbins-Monro conditions in all states x ∈ X � ∞ � ∞ α 2 α i ( x ) = ∞ , i ( x ) < ∞ , i = 0 i = 0 and all states are visited infinitely often , then for all x ∈ X V π ( x ) a . s . � → V π ( x ) − A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 62/83
The Monte-Carlo Algorithm Temporal Difference TD ( 0 ) For i = 1 , . . . , n 1. Set t = 0 2. Set � V π ( x ) = 0 , ∀ x ∈ X 3. Set initial state x 0 4. While ( x t not terminal) 4.1 Take action a t = π ( x t ) 4.2 Observe next state x t + 1 and reward r t = r π ( x t ) 4.3 Set t = t + 1 4.4 Compute the TD δ t = r t + γ � V π ( x t + 1 ) − � V π ( x t ) 4.5 Update the value function estimate in x t as V π ( x t ) = � � V π ( x t ) + α i ( x t ) δ t 4.6 Update the learning rate, e.g., 1 α ( x t ) = # visits ( x t ) EndWhile EndFor A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 63/83
The Monte-Carlo Algorithm Comparison between TD(1) and TD(0) TD(1) ◮ Update rule V π ( x t ) � V π ( x t ) + α ( x t )[ δ t + γδ t + 1 + · · · + γ T − 1 δ T ] . � = ◮ No bias, large variance TD(0) ◮ Update rule V π ( x t ) = � � V π ( x t ) + α ( x t ) δ t . ◮ Potential bias, small variance A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 64/83
The Monte-Carlo Algorithm Comparison between TD(1) and TD(0) TD(1) ◮ Update rule V π ( x t ) � V π ( x t ) + α ( x t )[ δ t + γδ t + 1 + · · · + γ T − 1 δ T ] . � = ◮ No bias, large variance TD(0) ◮ Update rule V π ( x t ) = � � V π ( x t ) + α ( x t ) δ t . ◮ Potential bias, small variance ⇒ TD ( λ ) perform intermediate updates! A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 64/83
The Monte-Carlo Algorithm The T π λ Bellman operator Definition Given λ < 1 , then the Bellman operator T π λ is � T π λ m ( T π ) m + 1 . λ = ( 1 − λ ) m ≥ 0 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 65/83
The Monte-Carlo Algorithm The T π λ Bellman operator Definition Given λ < 1 , then the Bellman operator T π λ is � T π λ m ( T π ) m + 1 . λ = ( 1 − λ ) m ≥ 0 Remark: convex combination of the m -step Bellman operators ( T π ) m weighted by a sequences of coefficients defined as a function of a λ . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 65/83
The Monte-Carlo Algorithm Temporal Difference TD ( λ ) Idea: use the whole series of temporal differences to update � V π V π for a transition ◮ Temporal difference of a function � � x t , r t , x t + 1 � δ t = r t + γ � V π ( x t + 1 ) − � V π ( x t ) ◮ Estimated value function T � V π ( x t ) = � � V π ( x t ) + α i ( x t ) ( γλ ) s − t δ s s = t A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 66/83
The Monte-Carlo Algorithm Temporal Difference TD ( λ ) Idea: use the whole series of temporal differences to update � V π V π for a transition ◮ Temporal difference of a function � � x t , r t , x t + 1 � δ t = r t + γ � V π ( x t + 1 ) − � V π ( x t ) ◮ Estimated value function T � V π ( x t ) = � � V π ( x t ) + α i ( x t ) ( γλ ) s − t δ s s = t ⇒ Still requires the whole trajectory before updating... A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 66/83
The Monte-Carlo Algorithm Temporal Difference TD ( λ ) : Eligibility Traces ◮ Eligibility traces z ∈ R N ◮ For every transition x t → x t + 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm Temporal Difference TD ( λ ) : Eligibility Traces ◮ Eligibility traces z ∈ R N ◮ For every transition x t → x t + 1 1. Compute the temporal difference d t = r π ( x t ) + γ � V π ( x t + 1 ) − � V π ( x t ) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm Temporal Difference TD ( λ ) : Eligibility Traces ◮ Eligibility traces z ∈ R N ◮ For every transition x t → x t + 1 1. Compute the temporal difference d t = r π ( x t ) + γ � V π ( x t + 1 ) − � V π ( x t ) 2. Update the eligibility traces  λ z ( x ) if x � = x t  z ( x ) = 1 + λ z ( x ) if x = x t  0 if x t = x 0 (reset the traces) A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 67/83
The Monte-Carlo Algorithm Temporal Difference TD ( λ ) : Eligibility Traces ◮ Eligibility traces z ∈ R N ◮ For every transition x t → x t + 1 1. Compute the temporal difference d t = r π ( x t ) + γ � V π ( x t + 1 ) − � V π ( x t ) 2. Update the eligibility traces  λ z ( x ) if x � = x t  z ( x ) = 1 + λ z ( x ) if x = x t  0 if x t = x 0 (reset the traces) 3. For all state x ∈ X V π ( x ) ← � � V π ( x ) + α ( x ) z ( x ) δ t . A. LAZARIC – Reinforcement Learning Algorithms Oct 15th, 2013 - 67/83
Recommend
More recommend