It is hard to predict, especially about the future. Niels Bohr You - PowerPoint PPT Presentation

Concentration bounds: Non-averaged case Why are these bounds problematic? 1 / √ n � � with a step-size γ n = c / ( c + n ) Obtaining optimal rate O In expectation: Require c to be chosen such that ( 1 − β ) 2 µ c ∈ ( 1 / 2 , ∞ ) In high-probability: c should satisfy ( µ ( 1 − β ) / 2 + 3 B ( s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B ( s 0 ) Even for finite state space settings, B ( s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, 2015 16 / 84

Concentration bounds: Non-averaged case Proof Outline Let z n = θ n − θ ∗ . We first bound the deviation of this error from its mean:   ǫ 2   P ( � z n � 2 − E � z n � 2 ≥ ǫ ) ≤ exp  −  , ∀ ǫ > 0 ,   n � L 2 2 i i = 1 and then bound the size of the mean itself: � E � z n � 2 ≤ 2 exp ( − ( 1 − β ) µ Γ n ) � z 0 � 2 � �� initial error � n − 1 � 1 2 � � ( 3 + 6 H ) 2 B ( s 0 ) 2 γ 2 + k + 1 exp ( − 2 ( 1 − β ) µ (Γ n − Γ k + 1 ) , k = 1 � �� sampling and mixing error n �� 1 / 2 � � � � γ j � � Note that L i := γ i 1 − 2 γ j µ 1 − β − + [ 1 + β ( 3 − β )] B ( s 0 ) 2 j = i + 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 17 / 84

Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f X n ( θ ) := [ r ( s n , π ( s n )) + βθ T n − 1 φ ( s n + 1 ) − θ T n − 1 φ ( s n )] φ ( s n ) . Then, TD update is equivalent to θ n + 1 = θ n + γ n [ E Ψ ( f X n ( θ n )) + ǫ n + ∆ M n ] (1) Mixing error ǫ n := E ( f X n ( θ n ) | s 0 ) − E Ψ ( f X n ( θ n )) Martingale sequence ∆ M n := f X n ( θ n ) − E ( f X n ( θ n ) | s 0 ) Unrolling (1), we obtain: z n + 1 = ( I − γ n A ) z n + γ n ( ǫ n + ∆ M n ) n � γ k Π n Π − 1 = Π n z 0 + ( ǫ k + ∆ M k ) k k = 1 n � Here A := Φ T Ψ( I − β P )Φ and Π n := ( I − γ k A ) . k = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 18 / 84

Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation z n + 1 = ( I − γ n A ) z n + γ n ( ǫ n + ∆ M n ) n � γ k Π n Π − 1 = Π n z 0 + ( ǫ k + ∆ M k ) k k = 1 By Jensen’s inequality, we obtain 1 E ( � z n � 2 | s 0 ) ≤ ( E ( � z n , z n � ) | s 0 ) 2 � � 1 � n n � � � � 2 � � � � 2 � 2 � Π n Π − 1 � Π n Π − 1 2 � Π n z 0 � 2 γ 2 � � � ǫ k � 2 γ 2 � � � ∆ M k � 2 ≤ 2 + 3 2 | s 0 + 2 2 | s 0 2 E 2 E k � k � k k k = 1 k = 1 Rest of the proof amounts to bounding each of the terms on RHS above. Prashanth L A Convergence rate of TD(0) March 27, 2015 19 / 84

Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n − θ ∗ . Step 1: (Error decomposition) n n � � � z n � 2 − E � z n � 2 = g i − E [ g i |F i − 1 ] = D i , i = 1 i = 1 where D i := g i − E [ g i |F i − 1 ] , g i := E [ � z n � 2 | θ i ] , and F i = σ ( θ 1 , . . . , θ n ) . Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i . Step 3: (Concentration inequality) � n � αλ 2 � � n � � L 2 P ( � z n � 2 − E � z n � 2 ≥ ǫ ) = P D i ≥ ǫ ≤ exp ( − λǫ ) exp . i 2 i = 1 i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

Concentration bounds: Iterate Averaging Concentration Bounds: Iterate Averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, 2015 21 / 84

Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n 2 c + n with α ∈ ( 1 / 2 , 1 ) and c > 0 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ( n + c ) α/ 2 , where E � � � � θ 0 − θ ∗ � 2 2 β ( 1 − β ) c α HB ( s 0 ) � K A 1 ( n ) := 1 + 9 B ( s 0 ) 2 ( n + c ) ( 1 − α ) / 2 + 1 + 2 α ( µ c α ( 1 − β ) 2 ) α 2 ( 1 − α ) Prashanth L A Convergence rate of TD(0) March 27, 2015 22 / 84

Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n 2 c + n High-probability bound �� K IA 2 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ≥ 1 − δ, where P � ( n + c ) α/ 2 � � � � � c α + 2 ( 3 α ) 2 α � ( 1 + 9 B ( s 0 ) 2 ) � 1 − β � α µ + B ( s 0 ) 2 K A 2 ( n ) := + K 1 ( n ) � � 2 + B ( s 0 ) 1 n ( 1 − α ) / 2 µ 1 − β Prashanth L A Convergence rate of TD(0) March 27, 2015 23 / 84

Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 High-probability bound �� K IA � 2 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ≥ 1 − δ, where P � ( n + c ) α/ 2 1 / √ n � � α can be chosen arbitrarily close to 1, resulting in a rate O . Prashanth L A Convergence rate of TD(0) March 27, 2015 24 / 84

Concentration bounds: Iterate Averaging Proof Outline Let ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n and z n = ¯ θ n + 1 − θ ∗ . Then,   ǫ 2   P ( � z n � 2 − E � z n � 2 ≥ ǫ ) ≤ exp  −  , ∀ ǫ > 0 ,   n � L 2 2 i i = 1 n − 1 l � � �� γ i � � � γ j � � � where L i := 1 + 1 − 2 γ j µ 1 − β − + [ 1 + β ( 3 − β )] B ( s 0 ) . n 2 l = i + 1 j = i With γ n = ( 1 − β )( c / ( c + n )) α , we obtain   2 2 α  + 5 α  � 1 − β   � α   c α µ + B ( s 0 ) n � 2 × 1 L 2 i ≤ � 1 � 2 2 + B ( s 0 ) n i = 1 µ 2 1 − β Prashanth L A Convergence rate of TD(0) March 27, 2015 25 / 84

Concentration bounds: Iterate Averaging Proof outline: Bound in expectation To bound the expected error we directly average the errors of the non-averaged iterates: n 2 ≤ 1 � � θ n + 1 − θ ∗ � � ¯ E � θ k − θ ∗ � 2 , E � n k = 1 and then specialise to the choice of step-size: γ n = ( 1 − β )( c / ( c + n )) α � ∞ � 1 + 9 B ( s 0 ) � � θ n + 1 − θ ∗ � � ¯ exp ( − µ c ( n + c ) 1 − α ) � θ 0 − θ ∗ � 2 2 ≤ E � n n = 1 � µ c α ( 1 − β ) 2 � − α 1 + 2 α � 2 ( 1 − α ) ( n + c ) − α + 2 β Hc α ( 1 − β ) 2 Prashanth L A Convergence rate of TD(0) March 27, 2015 26 / 84

Centered TD(0) Centered TD (CTD) Prashanth L A Convergence rate of TD(0) March 27, 2015 27 / 84

Centered TD(0) The Variance Problem Why does iterate averaging work? in TD(0), each iterate introduces a high variance , which must be controlled by the step-size choice averaging the iterates reduces the variance of the final estimator reduced variance allows for more exploration within the iterates through larger step sizes Prashanth L A Convergence rate of TD(0) March 27, 2015 28 / 84

Centered TD(0) A Control Variate Solution Centering: another approach to variance reduction instead of averaging iterates one can use an average to guide the iterates now all iterates are informed by their history constructing this average in epochs allows a constant step-size choice Prashanth L A Convergence rate of TD(0) March 27, 2015 29 / 84

Centered TD(0) Centering: The Idea Recall that for TD ( 0 ) , θ n + 1 = θ n + γ n ( r ( s n , π ( s n )) + βθ T n φ ( s n + 1 ) − θ T n φ ( s n )) φ ( s n ) � �� = f n ( θ n ) and that θ n → θ ∗ , the solution of F ( θ ) := Π T π (Φ θ ) − Φ θ = 0. Centering each iterate:    f n ( θ n ) − f n (¯ θ n ) + F (¯   θ n + 1 = θ n + γ θ n )  � �� (*) Prashanth L A Convergence rate of TD(0) March 27, 2015 30 / 84

Centered TD(0) Centering: The Idea    f n ( θ n ) − f n (¯ θ n ) + F (¯   θ n + 1 = θ n + γ θ n )  � �� (*) Why Centering helps? No updates after hitting θ ∗ An average guides the updates, resulting in low variance of term (*) Allows using a (large) constant step-size O ( d ) update - same as TD(0) Working with epochs ⇒ need to store only the averaged iterate ¯ θ n and an estimate of ˆ F (¯ θ n ) Prashanth L A Convergence rate of TD(0) March 27, 2015 31 / 84

Centered TD(0) Centering: The Idea Centered update: � � f n ( θ n ) − f n (¯ θ n ) + F (¯ θ n + 1 = θ n + γ θ n ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, 2015 32 / 84

Centered TD(0) Take action Update θ n ¯ θ ( m ) , ˆ F ( m ) (¯ θ ( m ) ) θ ( m + 1 ) , ˆ ¯ F ( m + 1 ) (¯ θ ( m + 1 ) ) θ n θ n + 1 π ( s n ) using (2) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, θ ( m ) is chosen uniformly at random from the previous epoch an iterate ¯ Epoch run Set θ mM := ¯ θ ( m ) , and, for n = mM , . . . , ( m + 1 ) M − 1 � � f X in ( θ n ) − f X in (¯ θ ( m ) ) + ˆ F ( m ) (¯ θ ( m ) ) θ n + 1 = θ n + γ , mM (2) F ( m ) ( θ ) := 1 � where ˆ f X i ( θ ) M i =( m − 1 ) M Prashanth L A Convergence rate of TD(0) March 27, 2015 33 / 84

Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where � � γ d 2 1 C 1 := 2 µγ M (( 1 − β ) − d 2 γ ) + 2 (( 1 − β ) − d 2 γ ) Error bound � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ Ψ ≤ C m Ψ 1 m − 1 � C ( m − 2 ) − k B kM + C 2 H ( 5 γ + 4 ) ( k − 1 ) M ( s 0 ) , 1 k = 1 kM where C 2 = γ/ ( 2 M (( 1 − β ) − d 2 γ )) and B kM � ( k − 1 ) M is an upper bound on the partial sums ( E ( φ ( s i ) | s 0 ) − E Ψ ( φ ( s i ))) i =( k − 1 ) M kM ( E ( φ ( s i ) φ ( s i + l ) | s 0 ) − E Ψ ( φ ( s i ) φ ( s i + l ) T )) , for l = 0 , 1. � and i =( k − 1 ) M Prashanth L A Convergence rate of TD(0) March 27, 2015 34 / 84

Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: | P ( s t = s | s 0 ) − ψ ( s ) | ≤ C ρ t / M , then � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ + CMC 2 H ( 5 γ + 4 ) max { C 1 , ρ M } ( m − 1 ) Ψ ≤ C m 1 Ψ When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

Centered TD(0) Proof Outline Let ¯ f X in ( θ n ) := f X in ( θ n ) − f X in (¯ θ ( m ) ) + E Ψ ( f X in (¯ θ ( m ) )) . Step 1: (Rewriting CTD update) � � ¯ where ǫ n := E ( f X in (¯ θ ( m ) ) | F mM ) − E Ψ ( f X in (¯ θ ( m ) )) θ n + 1 = θ n + γ f X in ( θ n ) + ǫ n Step 2: (Bounding the variance of centered updates) �� ≤ d 2 � � � θ ( m ) − θ ∗ ) � 2 � ¯ � 2 � Φ( θ n − θ ∗ ) � 2 Ψ + � Φ(¯ f X in ( θ n ) E Ψ Ψ 2 Prashanth L A Convergence rate of TD(0) March 27, 2015 36 / 84

Centered TD(0) Proof Outline Step 3: (Analysis for a particular epoch) �� ¯ � � � 2 E θ n � θ n + 1 − θ ∗ � 2 2 ≤ � θ n − θ ∗ � 2 2 + γ 2 E θ n � ǫ n � 2 2 + 2 γ ( θ n − θ ∗ ) T E θ n + γ 2 E θ n � ¯ f X in ( θ n ) f X in ( θ n ) 2 Ψ + γ 2 d 2 � � θ ( m ) − θ ∗ ) � 2 ≤ � θ n − θ ∗ � 2 2 − 2 γ (( 1 − β ) − d 2 γ ) � Φ( θ n − θ ∗ ) � 2 � Φ(¯ + γ 2 E θ n � ǫ n � 2 Ψ 2 Summing the above inequality over an epoch and noting that θ ( m ) − θ ∗ ) ≤ 1 θ ( m ) − θ ∗ ) T I (¯ θ ( m ) − θ ∗ ) T Φ T ΨΦ(¯ θ ( m ) − θ ∗ ) , 2 ≥ 0 and (¯ µ (¯ E Ψ ,θ n � θ n + 1 − θ ∗ � 2 we obtain the following by setting θ 0 = ¯ θ ( m ) : � 1 � � � θ ( m + 1 ) − θ ∗ ) � 2 θ ( m ) − θ ∗ ) � 2 2 γ M (( 1 − β ) − d 2 γ ) � Φ(¯ µ + γ 2 Md 2 � Φ(¯ Ψ ≤ Ψ mM � + γ 2 E θ i � ǫ i � 2 2 i =( m − 1 ) M The final step is to unroll (across epochs) the final recursion above to obtain the rate for CTD. Prashanth L A Convergence rate of TD(0) March 27, 2015 37 / 84

Centered TD(0) TD(0) on a batch Prashanth L A Convergence rate of TD(0) March 27, 2015 38 / 84

Centered TD(0) Dilbert’s boss on big data! Prashanth L A Convergence rate of TD(0) March 27, 2015 39 / 84

fast LSTD LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 40 / 84

fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, 2015 41 / 84

fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, 2015 42 / 84

fast LSTD Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Convergence rate of TD(0) March 27, 2015 43 / 84

fast LSTD Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

fast LSTD Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

fast LSTD The constants √ c � � � θ 0 − ˆ � � θ T n (( 1 − β ) 2 µ c − 1 ) / 2 + ( 1 − β ) ch 2 ( n ) � 2 K 1 ( n ) = , 2 � log δ − 1 ( 1 − β ) c K 2 ( n ) = � + K 1 ( n ) , �� 4 3 ( 1 − β ) 2 µ c − 1 2 where �� 4 h ( k ) :=( 1 + R max + β ) 2 max � θ 0 − ˆ � ˆ � � � � θ T 2 + ln n + θ T , 1 � � 2 Both K 1 ( n ) and K 2 ( n ) are O ( 1 ) Prashanth L A Convergence rate of TD(0) March 27, 2015 46 / 84

fast LSTD Iterate Averaging Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ E � ( n + c ) α/ 2 High-probability bound �� K IA � 2 ( n ) � ¯ θ n − ˆ � � P θ T 2 ≤ ≥ 1 − δ, � ( n + c ) α/ 2 Dependency of c on µ is removed dependency at the cost of ( 1 − α ) / 2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

fast LSTD The constants � � � θ 0 − ˆ � � C θ T h ( n ) c α ( 1 − β ) � K IA 2 1 ( n ) := ( n + c ) ( 1 − α ) / 2 + , and 1 + 2 α ( µ c α ( 1 − β ) 2 ) α 2 ( 1 − α ) � � 2 � � � µ c α ( 1 − β ) 2 + 2 α log δ − 1 2 α 1 3 α + K IA ( n + c ) ( 1 − α ) / 2 + K IA 2 ( n ) := 1 ( n ) . µ ( 1 − β ) α As before, both K IA 1 ( n ) and K IA 2 ( n ) are O ( 1 ) Prashanth L A Convergence rate of TD(0) March 27, 2015 48 / 84

fast LSTD Performance bounds Approximate value function ˜ v n := Φ θ n True value function v �� v − Π v � T d ( 1 − β ) 2 µ 2 n ln 1 1 � v − ˜ v n � T ≤ + O + O � ( 1 − β ) 2 µ T δ 1 − β 2 � �� approximation error estimation error computational error T 1 � f � 2 T := T − 1 � f ( s i ) 2 , for any function f . i = 1 2Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Convergence rate of TD(0) March 27, 2015 49 / 84

fast LSTD Performance bounds �� v − Π v � T d ( 1 − β ) 2 µ 2 n ln 1 1 1 � v − ˜ v n � T ≤ + O + O � ( 1 − β ) 2 µ T δ 1 − β 2 � �� approximation error estimation error computational error Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln ( 1 /δ ) T / ( d µ ) , the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement � ∞ � � Q π ( s , a ) = E β t r ( s t , π ( s t )) | s 0 = s , a 0 = a t = 0 π ′ ( s ) = arg max θ T φ ( s , a ) a ∈A Prashanth L A Convergence rate of TD(0) March 27, 2015 51 / 84

Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := { ( s i , a i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTDQ approximates Q π by θ T = ¯ ˆ T ¯ A − 1 b T where T T A T = 1 � � ¯ φ ( s i , a i )( φ ( s i , a i ) − βφ ( s ′ i , π ( s ′ i ))) T , and ¯ b T = T − 1 r i φ ( s i , a i ) . T i = 1 i = 1 Fast LSTDQ using SA: � � k − 1 φ ( s ′ i k , π ( s ′ r i k + βθ T i k )) − θ T θ k = θ k − 1 + γ k k − 1 φ ( s i k , a i k ) φ ( s i k , a i k ) Prashanth L A Convergence rate of TD(0) March 27, 2015 52 / 84

Fast LSPI using SA Fast LSPI using SA (fLSPI-SA) Input: Sample set D := { s i , a i , r i , s ′ i } T i = 1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k ∼ U ( { 1 , . . . , T } ) - Update fLSTD-SA iterate θ k θ ′ ← θ τ , ∆ = � θ − θ ′ � 2 Policy Improvement Obtain a greedy policy π ′ ( s ) = arg max θ ′ T φ ( s , a ) a ∈A θ ← θ ′ , π ← π ′ until ∆ < ǫ Prashanth L A Convergence rate of TD(0) March 27, 2015 53 / 84

Experiments - Traffic Signal Control The traffic control problem Prashanth L A Convergence rate of TD(0) March 27, 2015 54 / 84

Experiments - Traffic Signal Control Simulation Results on 7x9-grid network Throughput (TAR) Tracking error · 10 4 � � 2 � θ k − ˆ � � θ T 0 . 6 � 1 . 5 0 . 5 2 0 . 4 � � � θ T � θ k − ˆ 1 TAR 0 . 3 � � 0 . 2 0 . 5 0 . 1 LSPI 0 0 fLSPI-SA 0 100 200 300 400 500 0 1 , 000 2 , 000 3 , 000 4 , 000 5 , 000 step k of fLSTD-SA time steps Prashanth L A Convergence rate of TD(0) March 27, 2015 55 / 84

Experiments - Traffic Signal Control Runtime Performance on three road networks · 10 5 1 . 91 · 10 5 2 1 . 5 runtime (ms) 1 0 . 5 30 , 144 4 , 917 66 159 287 0 7x9-Grid 14x9-Grid 14x18-Grid ( d = 504) ( d = 1008) ( d = 2016) LSPI fLSPI-SA Prashanth L A Convergence rate of TD(0) March 27, 2015 56 / 84

Experiments - Traffic Signal Control SGD in Linear Bandits Prashanth L A Convergence rate of TD(0) March 27, 2015 57 / 84

Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

It is hard to predict, especially about the future. Niels Bohr You - PowerPoint PPT Presentation

It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, 2015 1 / 84 Convergence rate of TD(0) with

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

PREDICT- -HD HD PREDICT BIG QUESTION: What do we need before we can treat HD ? How does

DB Future Directions Future Directions The Future is hard to predict and is driven by

Solar Cycle 25 in V2 of SSN If possible, also provide: Predict north/south hemispheres

It is difficult to make predictions, especially about the future. Niels Henrik David Bohr (1885

Niels ten Oever Head of Digital Article19 niels@article19.org nto@jabber.org PGP : 8D9F C567

Niels Elers Koch Niels Elers Koch Hrsholm, Denmark Hrsholm, Denmark Danish Centre for

Image segmentation applied to cytology Niels VAN VLIET <niels@lrde.epita.fr> LRDE seminar,

State of the Art in PN Gravity Theory Mich` ele Levi Niels Bohr International Academy Niels

Online Template Attack on ECDSA: Extracting Keys Via The Other Side By: Niels Roelofs, Niels

Elementary Particles Lecture 3 Niels Tuning Harry van der Graaf Niels Tuning (1) Plan

HydroCare HC-44 HydroCare HC-44 Hard Water Problems Hard Water Problems Hard Water Costs You

6/18/2018 When Family Life Gets Hard 1 6/18/2018 When Family Life Gets Hard God

Hard-Potato Routing Costas Busch, Maurice Herlihy, and Roger Wattenhofer Brown University 1

MPI @ 35 Dan Holmes EuroMPI 2017 25 th Anniversary Symposium Could you please predict something

Evaluation learning algorithm ? Do you want to predict accuracy or predict Charles Sutton

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

TD #1 Large-scale Mathematical Programming Leo Liberti, CNRS LIX Ecole Polytechnique

Dark SRF - experiment Anna Grassellino (APS-TD) for the Dark SRF Collaboration Roni Harnik

Ray Intersection Steve Marschner CS 4620 Cornell University Cornell CS4620 Fall 2020 Steve

Prophecy : Using History for High Throughput Fault Tolerance Siddhartha Sen Joint work with

Introduction to HTML SSE 3200 Web-based Services Michigan Technological University Nilufer

almost relational database Ramesh Subramonian Oracle Labs 1 ramesh.subramonian@oracle.com

Coordinated Platoon Routing in a Metropolitan Network Jeffrey Larson Todd Munson Vadim Sokolov

It is hard to predict, especially about the future. Niels Bohr You - PowerPoint PPT Presentation

It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, 2015 1 / 84 Convergence rate of TD(0) with

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

PREDICT- -HD HD PREDICT BIG QUESTION: What do we need before we can treat HD ? How does

DB Future Directions Future Directions The Future is hard to predict and is driven by

Solar Cycle 25 in V2 of SSN If possible, also provide: Predict north/south hemispheres

It is difficult to make predictions, especially about the future. Niels Henrik David Bohr (1885

Niels ten Oever Head of Digital Article19 niels@article19.org nto@jabber.org PGP : 8D9F C567

Niels Elers Koch Niels Elers Koch Hrsholm, Denmark Hrsholm, Denmark Danish Centre for

Image segmentation applied to cytology Niels VAN VLIET &lt;niels@lrde.epita.fr&gt; LRDE seminar,

State of the Art in PN Gravity Theory Mich` ele Levi Niels Bohr International Academy Niels

Online Template Attack on ECDSA: Extracting Keys Via The Other Side By: Niels Roelofs, Niels

Elementary Particles Lecture 3 Niels Tuning Harry van der Graaf Niels Tuning (1) Plan

HydroCare HC-44 HydroCare HC-44 Hard Water Problems Hard Water Problems Hard Water Costs You

6/18/2018 When Family Life Gets Hard 1 6/18/2018 When Family Life Gets Hard God

Hard-Potato Routing Costas Busch, Maurice Herlihy, and Roger Wattenhofer Brown University 1

MPI @ 35 Dan Holmes EuroMPI 2017 25 th Anniversary Symposium Could you please predict something

Evaluation learning algorithm ? Do you want to predict accuracy or predict Charles Sutton

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

TD #1 Large-scale Mathematical Programming Leo Liberti, CNRS LIX Ecole Polytechnique

Dark SRF - experiment Anna Grassellino (APS-TD) for the Dark SRF Collaboration Roni Harnik

Ray Intersection Steve Marschner CS 4620 Cornell University Cornell CS4620 Fall 2020 Steve

Prophecy : Using History for High Throughput Fault Tolerance Siddhartha Sen Joint work with

Introduction to HTML SSE 3200 Web-based Services Michigan Technological University Nilufer

almost relational database Ramesh Subramonian Oracle Labs 1 ramesh.subramonian@oracle.com

Coordinated Platoon Routing in a Metropolitan Network Jeffrey Larson Todd Munson Vadim Sokolov

Image segmentation applied to cytology Niels VAN VLIET <niels@lrde.epita.fr> LRDE seminar,