Chapter 12. Dynamic Programming
Neural Networks and Learning Machines (Haykin)
Lecture Notes on
Self-learning Neural Algorithms
Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University
Version: 20171011
Chapter 12. Dynamic Programming Neural Networks and Learning - - PowerPoint PPT Presentation
Chapter 12. Dynamic Programming Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20171011 Contents
Version: 20171011
12.1 Introduction …………………………………………………..…………………………….... 3 12.2 Markov Decision Process ………………………………………….………………..…. 5 12.3 Bellman’s Optimality Criterion …………………………..………………….…….... 8 12.4 Policy Iteration ……….………..………………….…………………………..………..... 11 12.5 Value Iteration ………………………………………………………………..…….…...…. 13 12.6 Approximate DP: Direct Methods ….…………..……..……….………………….. 17 12.7 Temporal Difference Learning ….………………….……..…………….………….. 18 12.8 Q-Learning ……………….…………………..……………….…………………….....……. 21 12.9 Approximate DP: Indirect Methods ………………………….……..….........…. 24 12.10 Least Squares Policy Evaluation …………….…………………………..…….…... 26 12.11 Approximate Policy Iteration …..……….….……………………..………….……. 30 Summary and Discussion …………….…………….………………………….……………... 33
(c) 2017 Biointelligence Lab, SNU 2
(c) 2017 Biointelligence Lab, SNU 3
1. Learning with a teacher: Supervised learning 2. Learning without a teacher: Unsupervised learning / reinforcement learning / semisupervised learning
1. Behavioral learning (action, sequential decision-making) 2. Interaction between an agent and its environment 3. Achieving a specific goal under uncertainty
1. Classical approach: punishment and reward (classical conditioning), highly skilled behavior 2. Modern approach: dynamic programming, planning
(c) 2017 Biointelligence Lab, SNU 4
Dynamic programming (DP)
with the outcome of each decision being predictable to some extent before the next decision is made.
present must be balanced against the undesirability of high cost in the future.
(credit assignment problem)
stochastic environment when the attainment of this improvement may require having to sacrifice short-term performance?
Right balance between
(theoretical)
(c) 2017 Biointelligence Lab, SNU 5
Markov decision process (MDP):
The state of the environment is a summary of the entire past experience of an agent gained from its interaction with the environment, such that the information necessary for the agent to predict the future behavior of the environment is contained in that summary.
Figure 12.1 An agent interacting with its environment.
! MDP!=!The!sequence!of!states!{X n ,!n= 0,1,2,...} i.e.,!a!Markov!chain!with!transition!probabilities pij(µ(i))!for!actions!µ(i).
(c) 2017 Biointelligence Lab, SNU 6
i, j ∈X :#states Ai = {aik}:#actions π = {µ0,µ1,...}:#policy#(states#X #to#actions#A) #####µn(i)∈Ai ######for#all#states#i #####Nonstationary#policy:#π = {µ0,µ1,...} #####Stationary#policy:#π = {µ,µ,...} pij(a):#transition#probability #####pij(a)= P(Xn+1 = j|Xn = i,An = a) #####1.#pij(a)≥0#####for all i and j #####2. pij(a)=1
j
for all i g(i,aik , j):#cost#function γ :#discount#factor #####γ ng(i,aik , j):#discounted#cost
Figure 12.2 Illustration of two possible transitions: The transition from state to state is probabilistic, but the transition from state to is deterministic.
(c) 2017 Biointelligence Lab, SNU 7
Dynamic programming (DP) problem
Cost-to-go function (total expected cost) J π(i) = E γ ng(X n,µn(X n), X n+1) | X0 = i
n=0 ∞
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ g(X n,µn(X n), X n+1): observed cost Optimal value J *(i) = min
π J π(i) (For stationary policy: J µ(i) = J *(i))
Basic problem in DP Given a stationary MDP, find a stationary policy π that minimizes the cost-to-go function J µ(i) for all initial states i.
Notation: Cost function J(i) ó Value function V(s) Cost g(.) ó Reward r(.)
(c) 2017 Biointelligence Lab, SNU 8
Principle)of)optimality !!!!!An!optimal!policy!has!the!property!that!whatever!the!initial!state!and !!!!!!initial!decsion!are,!the!remaining!decsions!must!constitute!an!optimal !!!!!!policy!starting!from!the!state!resulting!from!the!first!decision. Consider!a!finite:horizon!problem!for!which!the!cost:to:go!function!is:! !!!!!J0(X0)= E gK(XK ) gn(Xn,µn(Xn),Xn+1)
n=0 K−1
⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Suppose!we!wish!to!minimize!the!cost:to:go!function !!!!!Jn(Xn)= E gK(XK ) gk(Xk ,µk(Xk),Xk+1)
k=n K−1
⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Then,!the!trauncated!policy!{µn
*,µn+1 * ,...,µK−1 *
}!!is!optimal!for!the!subproblem.
(c) 2017 Biointelligence Lab, SNU 9
Dynamic(programming(algorithm !!!!! For!every!initial!state!X0,!the!optimal!cost!J *(X0)!of!the!basic! finite9horizon!problem!is!equal!to!J0(X0),!where!the!function!J0! is!obtained!from!the!last!step!of!the!algorithm !!!!!Jn(Xn)= min
µn
E
Xn+1
gn(Xn,µn(Xn),Xn+1)+ Jn+1(Xn+1) ⎡ ⎣ ⎤ ⎦!!!!!!!(12.13) which!runs!backward!in!time,!with! !!!!!JK(XK )= gK(XK ) Furthermore,!if!µn
*!minimizes!the!right9hand!side!of!Eq.!(12.13)!
for!each!Xn!and!n,!then!the!policy!π* = {µ0
*,µ1 *,...,µK−1 *
}!is!optimal.!
(c) 2017 Biointelligence Lab, SNU 10
! Bellman's!optimality!equation!!!! !! J *(i)= min
µ
c(i,µ(i))+γ pij(µ)J *( j)
j=1 N
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ for i =1,2,...,N ! Immediate!expected!cost !!!!!c(i,µ(i))= Ε X1[g(i,µ(i),X1)]= pijg(i,µ(i), j)
j=1 N
!!!!!!!!!!!!!!!!!!!! Two!methods!for!computing!an!optimal!policy !!!!A!Policy!iteration !!!!A!Value!iteration
Figure 12.3 Policy iteration algorithm.
(c) 2017 Biointelligence Lab, SNU 11
j=1 n
1. Policy evaluation step: the cost-to-go function for some current policy and the corresponding Q-factor are computed for all states and actions; 2. Policy improvement step: the current policy is updated in order to be greedy with respect to the cost-to-go function computed in step 1.
(c) 2017 Biointelligence Lab, SNU 12
J
µn (i) = c(i,µn(i))+γ
pij(µn(i))J
µn ( j)
j=1 N
, i =1,2,...,N Q
µn (i,a) = c(i,a)+γ
pij(a)J
µn ( j)
j=1 N
, a ∈ A
i and i =1,2,...,N
(c) 2017 Biointelligence Lab, SNU 13
Figure 12.4 Illustrative backup diagrams for (a) policy iteration and (b) value iteration.
(c) 2017 Biointelligence Lab, SNU 14
Figure 12.5 Flow graph for stagecoach problem.
(c) 2017 Biointelligence Lab, SNU 15
Figure 12.6 Steps involved in calculating the 𝑅-factors for the stagecoach
(c) 2017 Biointelligence Lab, SNU 16
(c) 2017 Biointelligence Lab, SNU 17
(c) 2017 Biointelligence Lab, SNU 18
! Sequence!of!states:!{in}n=0
N
Cost2to2go!function!for!Bellman!equation !!!J µ(in)= E g(in,in+1)+ J µ(in+1) ⎡ ⎣ ⎤ ⎦, n= 0,1,...,N −1 Applying!the!Robbins(Monro!stochastic!approximation !!!!!r + =(1−η)r +ηg(r,v)! we!have !!! J +(in) = (1−η)J(in)+η g(in,in+1)+ J(in+1) ⎡ ⎣ ⎤ ⎦ = J(in)+η g(in,in+1)+ J(in+1)− J(in) ⎡ ⎣ ⎤ ⎦ Temporal!difference!(TD) !!!! dn = g(in,in+1)+ J(in+1)− J(in), n= 0,1,...,N −1 TD!learning !!!!!!!J +(in)= J(in)+ηdn !!!!
(c) 2017 Biointelligence Lab, SNU 19
! Monte!Carlo!simulation!algorithm !J µ(in)= E g(in+k ,in+k+1)
k=0 N−n−1
⎡ ⎣ ⎢ ⎤ ⎦ ⎥, n= 0,1,...,N −1 Robbins/Monro!stochastic!approximation J+(in)= J(in)+ µk g(in+k ,in+k+1)
k=0 N−n−1
− J(in) ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ !!!!!!!!!!! = J(in)+ µk[g(in,in+1)+ J(in+1)− J(in) !!!!!!!!!!!!!!!!!!+!g(in+1,in+2)+ J(in+2)− J(in+1) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!+!g(iN−2,iN−1)+ J(iN−1)− J(iN−2) !!!!!!!!!!!!!!!!!!+!g(iN−1,iN )+ J(iN )− J(iN−1)] Using!the!temporal!difference !!!!!J+(in)= J(in)+ dn+k
k=0 N−n−1
To justify this is an iterative implementation
Total sum of the cost for traj {in,in+1,...,iN} c(in) = g(in+k,in+k+1), n = 0,...,N −1
k=0 N −n−1
Cost-to-go after visiting in for T simulations J(in) = 1 T c(in)
n=1 T
Ensemble averaged cost-to-go J µ(in) = Ε[c(in)] for all n Iterative formula J +(in) = J(in) +ηn(c(in) − J(in))
(c) 2017 Biointelligence Lab, SNU 20
k=n ∞
k=0 N−n−1
(c) 2017 Biointelligence Lab, SNU 21
Two$step)version)of)Bellman's)optimality)equation )))))Q*(i,a)= pij(a) g(i,a, j)+γ min
b∈Ai
Q*( j,b)) ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ !for!all!(i,a)
j=1 N
The$value$iteration$algorithm $$$$$Q*(i,a)= pij(a) g(i,a, j)+γ min
b∈Ai
Q( j,b)) ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ !for!all!(i,a)
j=1 N
Small%step*size*version *****Q*(i,a)=(1−η)Q(i,a)+η pij(a) g(i,a, j)+γ min
b∈Ai
Q( j,b) ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
j=1 N
!!for!all!(i,a) Stochastic*version*based*on*a*single*sample *****Qn+1(i,a)=(1−ηn(i,a))Qn(i,a)+ηn(i,a)[g(i,a, j)+γ Jn( j)]##for#(i,a)"="(in,an) """""""""""""""""""""Jn( j)= min
b∈Ai
Qn( j,b) """""Qn+1(i,a)= Qn(i,a)""for"all"(i,a)≠(in,an) Q"learning*algorithm: """""Qn+1(i,a)= Qn(i,a)+ηn(i,a)[g(i,a, j)+γ min
b∈Ai
Qn( j,b)−Qn(i,a)]
Is there any on-line procedure for learning an optimal control policy through experience that is gained solely
n n n n n
n n n n
Convergence Theorem
Suppose that the learning-rate parameter etan(i, a) satisfies the conditions Then, the sequence of Q-factors {Qn(i, a)} generated by the Q-learning algorithm converges with probability 1 to the optimal value Q*(i, a) for all state–action pairs (i, a) as the number of iterations n approaches infinity, provided that all state–action pairs are visited infinitely often.
(c) 2017 Biointelligence Lab, SNU 22
2
( , ) and ( , ) for all (i,a)
n n n n
i a i a h h
¥ ¥ = =
= ¥ < ¥
Q-learning algorithm may be viewed in one of two equivalent ways
Compromise between two conflicting objectives in reinforcement learning
to satisfy the Q-learning convergence theorem
(c) 2017 Biointelligence Lab, SNU 23
Figure 12.7 The time slots pertaining to the auxiliary and original control processes.
! Mixed!nonstationary!policy switches!between!an!auxilary!Markov!process!and!the!original!Markov!process! controlled!by!a!stationary!greedy!policy!determined!by!Q:learning !!!!!nk = mk−1 + L,!!k =1,2,...,and!m0 =1! !!!!!mk = nk +kL,!!k =1,2,...
(c) 2017 Biointelligence Lab, SNU 24
Indirect approach to approximate DP:
Ø Rather than explicitly estimate the transition probabilities and associated transition costs, use Monte Carlo simulation to generate one or more system trajectories, so as to approximate the cost-to-go function of a given policy, or even the optimal cost-to-go function, and then
Ø Thus, having abandoned the notion of optimality, we may capture the goal of the indirect approach to approximate dynamic programming in the following simple statement: Do as well as possible, and not more. Ø Performance optimality is traded off for computational optimality. This strategy is precisely what the human brain does on a daily basis: Given a difficult decision-making problem, the brain provides a suboptimal solution that is the “best” in terms of reliability and available resource allocation.
! Goal!of!approximate!DP: !!!!!Find!a!function!! J(i,w)!that!approximates!the!optimal!cost7to7go!function!J *(i)!forstate!i,!such!that!the! !!!!!cost!difference!!J *(i)− !! J(i,w)!is!minimized!according!to!some!statistical!criterion. Two!basic!questions !!!!!Question1:!How!do!we!choose!the!approximation!function!! J(i,w)!in!the!first!place? !!!!!Question2:!Having!chosen!an!appropriate!approximation!function!! J(i,w),!how!do!we!adapt!the!weight! !!!!!!!!!!vector!w!so!as!to!provide!the!“best!fit”!to!Bellman’s!equation!of!optimality
(c) 2017 Biointelligence Lab, SNU 25
Figure 12.8 Architectural layout of the linear approach to approximate dynamic programming.
Approximate+DP 1.#Linear#approach# ########! J(i,w)= ϕijw j = φi
T j
w##for#all#i 2.#Nonlinear#approach ####5#Recurrent#multilayer#perceptrons#(deep#architectures) ####5#Supervised#training#of#a#recurrent#multilayer#perceptron#by#nonlinear# ######sequential5state#estimation#algorithm#that#is#derivative#free.
(c) 2017 Biointelligence Lab, SNU 26
J(i) ≈ ! J(i,w) = φT (i)w
1
( ) ( , ) |
n n n n
J i g i i i i g
¥ + =
é ù = E = ê ú ë û
The implication of this assumption is that the Markov chain has a single recurrent class with no transient states.
is s. The implication of this second assumption is that the columns of the feature matrix , and therefore the basis functions represented by , are linearly independent.
n→∞
k=1 n
f f fw
Perform value iteration within a lower dimensional subspace spanned by a set of basis functions.
φ = φ
1 T
φ2
T
! φN
T
⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥
27 1
( ) ( ( , ) ( )), i 1,2,...,
N ij j
J i p g i j J i N g
=
= + =
T
g = p1 j
j
g(1, j) p2 j
j
g(2, j) ! pNj
j
g(N, j) ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
P = p11 … p1N ! " ! pN1 ! pNN ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ J = J(1) J(2) ! J(N) ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ≈ φw g = + J g PJ T
1 n n + =
J J T
φwn+1 = ΠΤ(φwn), n = 0,1,2,… Π : projection onto subspace S
Figure 12.9 Projected value iteration (PVI) method.
Projected)value)iteration)(PVI))for)policy)evaluation: At!iteration!n,!the!current!iterate!φwn!is!operated!on!by!the!mapping!T!and!the!new!vector!!T(φwn)! is!projected!onto!the!subspace!S,!thereby!yielding!the!updated!iterate!φwn+1.
(c) 2017 Biointelligence Lab, SNU 28
! From!porojected!value!iteration!to!least3square!policy!evaluation!(LSPE) Least'squares!minimization!for!the!projection!Π,!i.e.!for!φwn+1 = ΠΤ(φwn) !!!!!wn+1 = argmin
w
φw −Τ(φwn)
π 2
Least'squares!version!of!PVI!algorithm! !!!!!wn+1 = argmin
w
πi
i=1 N
φT(i)w − pijg(i, j)+γφT( j)wn
j=1 N
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ Use!Monte'Carlo!by!generating!a!trajectory!(i0,!i1,!i2,!...)!for!state!i!and!updating!wn! after!each!iteration!(in,in+1) !!!!!wn+1 = argmin
w
φT(ik)w − g(ik ,ik+1)−γφT(ik+1)wn
( )
2 k=1 n
(Least'squares!policy!evaluation!or!LSPE!algorithm) LSPE!converges!to!PVI !!!!!φw* = ΠT(φw*)
Figure 12.10 Illustration of the least- squares policy evaluation (LSPE) as a stochastic version of the projected value iteration (PVI).
(c) 2017 Biointelligence Lab, SNU 29
At iteration n+1 of the LSPE() algorithm, the updated weight vector is computed as the particular value of the weight vector w that minimizes the least-squares difference between the following two quantities:
approximating the cost function ;
which is extracted from a single simulated trajectory for k = 0, 1, ..., n.
1 n+
w ( )
T k
i F w ( )
k
J i ΦT (ik )wn + (γλ)m−k dn(im,im+1)
m=k n
LSPE(λ) dn(ik ,ik+1)= g(ik ,ik+1)+γφT(ik+1)wn − φT(ik)wn wn+1 = argmin
w
φT(i k)w − φT(ik)wn − γλ
m−k dn(im,im+1) m=k n
⎛ ⎝ ⎜ ⎞ ⎠ ⎟
k=1 n
2
(c) 2017 Biointelligence Lab, SNU 30
Given the current policy, we compute a cost-to-go function approximating the actual cost-to-go function for all states i. The vector is the weight vector of the neural network used to perform the approximation.
approximate cost-to-go function , we generate an improved policy .This new policy is designed to be greedy with respect to for all i.
( ) J i
µ
w ! J µ(i,w) µ ! J µ(i,w) ! J µ(i,w)
Figure 12.11 Approximate policy iteration algorithm.
(c) 2017 Biointelligence Lab, SNU 31
ε(w) = (k(i,m) − ! J µ(i,w))2
m=1 M (i)
i∈X
a∈Ai
Figure 12.12 Block diagram of the approximate policy iteration algorithm.
Q(i,a,w) = pij(a)(g(i,a, j) +γ ! J µ( j,w))
j∈X
(c) 2017 Biointelligence Lab, SNU 32
(c) 2017 Biointelligence Lab, SNU 33