Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT - PowerPoint PPT Presentation

Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT Bombay ∗† Mar. 23, 2015, IIT, Chennai ∗ Joint work with Dileep Kalathil (Uni. of California, Berkeley), Rahul Jain (Uni. of Southern California) † Work supported in part by the Department of Science and Technology

OUTLINE 1. Markov Decision Processes (Discounted cost) 2. Value/Q-value iteration algorithms 3. Classical Q-learning 4. Main results

{ X n , n ≥ 0 } a controlled Markov chain with: • a finite state space S = { 1 , 2 , · · · , s } , • a finite action space A = { a 1 , · · · , a d } , • an A -valued control process { Z n , n ≥ 0 } ,

• a controlled transition probability function p ( j | i, u ) , i, j ∈ S, u ∈ A, such that P ( X n +1 = i | X m , Z m , m ≤ n ) = p ( i | X n , Z n ) ∀ n, i.e., the probability of going from X n = j (say) to i under action Z n = u (say) is p ( i | j, u ).

Say that { Z n } is: • admissible if above holds, • randomized stationary Markov if P ( Z n = u |F n − 1 , X n = x ) = ( ϕ ( x ))( u ) ∀ n for some ϕ : S �→ P ( A ), • stationary Markov if Z n = v ( X n ) ∀ n for some v : S �→ A .

With abuse of terminology, the last two are identified with ϕ, v esp. Objective: Minimize the discounted cost ∞   β m c ( X m , Z m ) | X 0 = i  , � J i ( { Z n } ) := E  m =0 where • c : S × A �→ R is a prescribed ‘running cost’ function, • β ∈ (0 , 1) is the discount factor.

Dynamic Programming Define ‘ value function ’ V : S �→ R by V ( i ) = inf { Z n } J i ( { Z n } ) . Then by the ‘dynamic programming principle’   � V ( i ) = min  c ( i, u ) + β p ( j | i, u ) V ( j )  , i ∈ S.   u j This is the associated dynamic programming equation . Furthermore, if the minimum of the right is attained at u = v ∗ ( i ), then the stationary Markov policy v ∗ ( · ) is optimal. The converse also holds.

DP equation is a fixed point equation: V = F ( V ) for [ F 1 ( x ) , · · · , F s ( x )] T where F ( x ) = F i ( x ) := min u [ c ( i, u ) + β � p ( j | i, u ) x j ] . j Then � F ( x ) − F ( y ) � ∞ ≤ β � x − y � ∞ , i.e., F is an � · � ∞ -contraction = ⇒ V a unique solution to the DP equation and the ‘ value iteration scheme ’   V n +1 ( i ) = min p ( j | i, u ) V n ( j )  c ( i, u ) + β �  , n ≥ 0 ,   u j converges exponentially to V .

Other schemes: policy iteration, linear programming (primal/dual) Problematic if: • ( i ) p ( ·|· , · ) unknown, or, • ( ii ) p ( ·|· , · ) known, but too complex (e.g., extremely large state space).

Sometimes simulation of the system is ‘easy’, e.g., when the system is composed of a large number of intercon- nected simple components whose individual transitions are easy to simulate (e.g., queuing networks, robots). This has motivated simulation based schemes for approximate dynamic programming, based on stochastic approximation versions of classical iterative schemes. (‘reinforcement learning’, ‘approximate dynamic programming’, ‘neurodynamic programming’)

Q-learning: a simulation based scheme for approximate dynamic programming due to CJCH Watkins (1992). Define Q-values Q ( i, u ) := c ( i, u ) + β � p ( j | i, u ) V ( j ) , i ∈ S, u ∈ A. j Then V ( i ) = min Q ( i, u ) , u � Q ( i, u ) = c ( i, u ) + β p ( j | i, u ) min Q ( j, a ) . a j This is the ‘DP equation’ for Q-values.

Again, the last equation is of the form Q = G ( Q ) where � G ( x ) − G ( y ) � ∞ ≤ β � x − y � ∞ Thus we have the ‘ Q-value iteration ’ Q n +1 ( i, u ) = c ( i, u ) + β Q n ( j, a ) , n ≥ 0 . � p ( j | i, u ) min a j Then Q n → the unique solution to the Q-DP equation. Furthermore, v ∗ ( i ) ∈ Argmin Q ( i, · ) , i ∈ S, yields an optimal stationary Markov policy v ∗ . Note V n ∈ R s , Q n ∈ R s × d = ⇒ no motivation to do Q-value iteration.

However, one big change from value iteration: the nonlinearity (minimization over A ) is now inside the averaging = ⇒ can use an incremental method based on stochastic approximation. Advantage: can be based upon simulation, low computation per iterate Disadvantage: slow convergence

Stochastic Approximation Robbins-Monro scheme: x ( n + 1) = x ( n ) + a ( n )[ h ( x ( n )) + M ( n + 1)] . Here, for F n := σ ( x (0) , M ( k ) , k ≤ n ) (i.e., the ‘history till time n ’), n a ( n ) 2 < ∞ , and, • a ( n ) > 0 with n a ( n ) = ∞ , � � • { M ( n ) } a martingale difference sequence: E [ M ( n + 1) |F n ] = 0 ∀ n.

Need: h Lipschitz and E [ � M ( n + 1) � 2 |F n ] ≤ K (1 + � x ( n ) � 2 ) . Typically, x ( n + 1) = x ( n ) + a ( n ) f ( x ( n ) , ξ ( n + 1)) , with { ξ ( n ) } IID. Then set h ( x ) = E [ f ( x, ξ n )] , M ( n + 1) = f ( x ( n ) , ξ ( n + 1)) − h ( x ( n )) .

‘ODE’ approach (Derevitskii-Fradkov, Ljung): Treat the iteration as a noisy discretization of the ODE x ( t ) = h ( x ( t )) . ˙ If this has x ∗ as its unique asymptotically stable equilibrium, then ⇒ x ( n ) → x ∗ a.s. sup n � x ( n ) � < ∞ = (LHS needs separate ‘stability’ tests)

Idea of proof: Treat the iteration as noisy discretization of the ODE. Specifically, � n − 1 • define ¯ x ( t ) , t ≥ 0, by ¯ x ( m =0 a ( m )) := x ( n ), with linear interpolation, • compare ¯ x ( s ) , t ≤ s ≤ t + T , with ODE trajectory on the same time interval with the same initial condition,

• Gronwall inequality yields bound in terms of discretization error and error due to noise, • verify that these errors go to zero asymptotically (the latter follows by martingale arguments, using square- summability of { a ( n ) } ), • use either a Liapunov function argument (when avail- able) or a characterization of limit set (Benaim) to conclude.

Synchronous Q-learning: j p ( j | i, u ) min a Q n ( j, a ) by 1. Replace conditional average � evaluation at an actual simulated sample: Q n ( ζ i,u ( n + 1) , a ) , min a where ζ i,u ( n + 1) ≈ p ( ·| i, u ). 2. replace ‘full move’ by an incremental move, i.e., a convex combination of the previous iterate and the correction term due to the new observation.

The algorithm is: Q n +1 ( i, u ) = (1 − a ( n )) Q n ( i, u ) u ′ Q n ( ξ i,u ( n + 1) , u ′ )] + a ( n )[ c ( i, u ) + β min = Q n ( i, u ) + a ( n )[ c ( i, u ) u ′ Q n ( ξ i,u ( n + 1) , u ′ ) − Q n ( i, u )] . + β min Limiting ODE is x ( t ) = G ( x ( t )) − x ( t ) ˙ has the desired Q as its globally asymptotically stable equilibrium ( � x − Q � ∞ works as a Liapunov function) = ⇒ a.s. convergence to Q (stability is separately proved).

Asynchronous version (single simulation case): Q n +1 ( i, u ) = Q n ( i, u ) + a ( n ) I { X n = i, Z n = u } × u ′ Q n ( X n +1 , u ′ ) − Q n ( i, u )] . [ c ( i, u ) + β min Limiting ODE: ˙ x ( t ) = Λ( t )( G ( x ( t )) − x ( t )), Λ( · ) diagonal, non-negative (‘relative frequency’) Convergence to Q if diagonal elements of Λ( · ) are bounded away from zero ⇐ ⇒ all pairs ( i, u ) are sampled comparably often. (‘infinitely often’ suffices (Yu-Bertsekas)) Problem: slow!

Non-incremental Q-learning Fix N := number of samples per stage. The algorithm is:  N   1 Q n +1 ( i, u ) = c ( i, u ) + β Q n ( ξ m  , � min i,u ( n + 1) , a ) a N m =1 where: • { ξ m i,u ( n ) } are IID ≈ p ( ·| i, u ) for each ( i, u ), and, • { ξ m i,u ( n ) } i,u,m,n are independent.

This is equivalent to Q n +1 ( i, u ) = c ( i, u ) + β p ( n ) ( j | i, u ) min Q n ( j, a ) , � ˜ a j p ( n ) ( ·| i, u ) are the empirical transition probabilities where ˜ given by N p ( n ) ( j | i, u ) := 1 I { ξ m ˜ � i,u ( n + 1) = j } . N m =1 For a fixed sample run, we can view this as ‘quenched’ randomness, leading to a time-dependent sequence of transition matrices.

Claim: Q n → Q a.s.! Empirical observation: Convergence extremely fast initially to a ‘ball park’ estimate, then very slow. = ⇒ one can consider hybrid schemes where one switches to stochastic approximation after the initial period.

Idea of proof Consider a controlled Markov chain { X n } governed by time-inhomogeneous transition probabilities p ( n ) ( j | i, u ) , n ≥ 0 . ˜ V n in value iteration (always) has the interpretation of being the optimal finite horizon cost with ‘terminal cost’ V 0 , i.e., n − 1   β m c ( X m , Z m ) + β n V 0 ( X n ) | X 0 = i V n ( i ) = min � { Z n } E   m =0

Thus n − 1   β m c ( X ∗ m , v ∗ ( m, X ∗ m )) + β n V 0 ( X ∗ n ) | X ∗ V n ( i ) = E  , � 0 = i  m =0 where ( X ∗ n , v ∗ ( n, X ∗ n )) is the optimal state-control process, defined consistently because the function v ( n, · ) depends on the remaining time horizon. Similarly, n − 1   m ) + β n min β m c ( X ∗ m , Z ∗ Q 0 ( X ∗ n , a ) | X ∗ Q n ( i, u ) = E  , � 0 = i  a m =0 where Z ∗ 0 = u and Z ∗ n = v ∗ ( n, X ∗ n ) thereafter.

Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT - PowerPoint PPT Presentation

Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT Bombay Mar. 23, 2015, IIT, Chennai Joint work with Dileep Kalathil (Uni. of California, Berkeley), Rahul Jain (Uni. of Southern California) Work supported in part by

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

6. Approximation and fitting norm approximation least-norm problems regularized

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT Supervised by Francis BACH

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Uniform Sampling of Subshifts of Finite Type Ir` ene Marcovici With the support of the European

On the mixing time of the flip walk on triangulations of the sphere Thomas Budzinski ENS Paris

Concentration of measure and mixing for Markov chains Malwina J Luczak Department of Mathematics

CS 574: Randomized Algorithms Lecture 20. Random Walks and Electrical Networks, contd.

Randomness in Algorithm Design Shuji Kijima Fukuoka ( ) Dept. Informatics, Grad. School

Spin-Glass Bottlenecks in Quantum Annealing Sergey Knysh SGT Inc., NASA Ames Research Center

Summary (of part 1) Basic deep networks via iterated logistic regression. Deep network

Fast Eigenvalue Computation of Symmetric Rationally Generated Toeplitz Matrices Luca Gemignani