Chapter 12. Dynamic Programming Neural Networks and Learning - PowerPoint PPT Presentation

Chapter 12. Dynamic Programming Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20171011

Contents 12.1 Introduction …………………………………………………..…………………………….... 3 12.2 Markov Decision Process ………………………………………….………………..…. 5 12.3 Bellman’s Optimality Criterion …………………………..………………….…….... 8 12.4 Policy Iteration ……….………..………………….…………………………..………..... 11 12.5 Value Iteration ………………………………………………………………..…….…...…. 13 12.6 Approximate DP: Direct Methods ….…………..……..……….………………….. 17 12.7 Temporal Difference Learning ….………………….……..…………….………….. 18 12.8 Q-Learning ……………….…………………..……………….…………………….....……. 21 12.9 Approximate DP: Indirect Methods ………………………….……..….........…. 24 12.10 Least Squares Policy Evaluation …………….…………………………..…….…... 26 12.11 Approximate Policy Iteration …..……….….……………………..………….……. 30 Summary and Discussion …………….…………….………………………….……………... 33 (c) 2017 Biointelligence Lab, SNU 2

12.1 Introduction (1/2) Two paradigms of learning 1. Learning with a teacher: Supervised learning 2. Learning without a teacher: Unsupervised learning / reinforcement learning / semisupervised learning Reinforcement learning 1. Behavioral learning (action, sequential decision-making) 2. Interaction between an agent and its environment 3. Achieving a specific goal under uncertainty Two approaches to reinforcement learning 1. Classical approach: punishment and reward (classical conditioning), highly skilled behavior 2. Modern approach : dynamic programming, planning (c) 2017 Biointelligence Lab, SNU 3

12.1 Introduction (2/2) Dynamic programming (DP) A technique that deals with situations where decisions are made in stages, • with the outcome of each decision being predictable to some extent before the next decision is made. Decisions cannot be made in solation, but the desire for a low cost at the • present must be balanced against the undesirability of high cost in the future. Credit or blame must be assigned to each one of a set of interacting decisions • (credit assignment problem) Decision making by an agent that operates in a stochastic environment • How can an agent or decision maker improve its long-term performance in a • stochastic environment when the attainment of this improvement may require having to sacrifice short-term performance? Markov decision process • Right balance between Realistic description of a given problem (practical) • Power of analytic and computational methods to apply to the problem • (theoretical) (c) 2017 Biointelligence Lab, SNU 4

12.2 Markov Decision Process (1/3) Markov decision process (MDP) : • a finite set of discrete states a finite set of possible actions • • cost (or reward) • discrete time The state of the environment is a summary of the entire past experience of an agent gained from its interaction with the environment, such that the information necessary for the agent to predict the future behavior of the environment is contained in that summary. MDP!=!The!sequence!of!states!{ X n ,! n = 0,1,2,...} Figure 12.1 An agent interacting i.e.,!a!Markov!chain!with!transition!probabilities with its environment. p ij ( µ ( i ))!for!actions! µ ( i ). ! (c) 2017 Biointelligence Lab, SNU 5

12.2 Markov Decision Process (2/3) i , j ∈ X :#states A i = { a ik }:#actions π = { µ 0 , µ 1 ,...}:#policy#(states# X #to#actions# A ) ##### µ n ( i ) ∈ A i ######for#all#states# i #####Nonstationary#policy:# π = { µ 0 , µ 1 ,...} #####Stationary#policy:# π = { µ , µ ,...} p ij ( a ):#transition#probability ##### p ij ( a ) = P ( X n + 1 = j | X n = i , A n = a ) #####1.# p ij ( a ) ≥ 0#####for all i and j ∑ Figure 12.2 Illustration of two possible p ij ( a ) = 1 #####2. for all i transitions: The transition from state to j g ( i , a ik , j ):#cost#function state is probabilistic, but the transition from state to is deterministic. γ :#discount#factor ##### γ n g ( i , a ik , j ):#discounted#cost (c) 2017 Biointelligence Lab, SNU 6

12.2 Markov Decision Process (3/3) Dynamic programming (DP) problem - Finite-horizon problem Notation: - Infinite-horizon problem Cost-to-go function (total expected cost) Cost function J ( i ) ⎡ ⎤ ó Value function V ( s ) ∞ ∑ ⎢ ⎥ J π ( i ) = E γ n g ( X n , µ n ( X n ), X n + 1 ) | X 0 = i ⎢ ⎥ ⎣ ⎦ n = 0 Cost g (.) g ( X n , µ n ( X n ), X n + 1 ): observed cost ó Reward r (.) Optimal value π J π ( i ) (For stationary policy: J µ ( i ) = J * ( i )) J * ( i ) = min Basic problem in DP Given a stationary MDP, find a stationary policy π that minimizes the cost-to-go function J µ ( i ) for all initial states i . (c) 2017 Biointelligence Lab, SNU 7

12.3 Bellman’s Optimality Criterion (1/3) Principle)of)optimality !!!!!An!optimal!policy!has!the!property!that!whatever!the!initial!state!and !!!!!!initial!decsion!are,!the!remaining!decsions!must!constitute!an!optimal !!!!!!policy!starting!from!the!state!resulting!from!the!first!decision. Consider!a!finite:horizon!problem!for!which!the!cost:to:go!function!is:! ⎡ ⎤ K − 1 ∑ !!!!! J 0 ( X 0 ) = E g K ( X K ) g n ( X n , µ n ( X n ), X n + 1 ) ⎢ ⎥ ⎣ ⎦ n = 0 Suppose!we!wish!to!minimize!the!cost:to:go!function ⎡ ⎤ K − 1 ∑ !!!!! J n ( X n ) = E g K ( X K ) g k ( X k , µ k ( X k ), X k + 1 ) ⎢ ⎥ ⎣ ⎦ k = n * ,..., µ K − 1 Then,!the!trauncated!policy!{ µ n * , µ n + 1 * }!!is!optimal!for!the!subproblem. (c) 2017 Biointelligence Lab, SNU 8

12.3 Bellman’s Optimality Criterion (2/3) Dynamic(programming(algorithm !!!!! For!every!initial!state! X 0 ,!the!optimal!cost! J *( X 0 )!of!the!basic! finite9horizon!problem!is!equal!to! J 0 ( X 0 ),!where!the!function! J 0 ! is!obtained!from!the!last!step!of!the!algorithm ⎡ ⎤ !!!!! J n ( X n ) = min g n ( X n , µ n ( X n ), X n + 1 ) + J n + 1 ( X n + 1 ) E ⎦ !!!!!!!(12.13) ⎣ X n + 1 µ n which!runs!backward!in!time,!with! !!!!! J K ( X K ) = g K ( X K ) Furthermore,!if! µ n * !minimizes!the!right9hand!side!of!Eq.!(12.13)! for!each! X n !and! n ,!then!the!policy! π * = { µ 0 * , µ 1 * ,..., µ K − 1 * }!is!optimal.! (c) 2017 Biointelligence Lab, SNU 9

12.3 Bellman’s Optimality Criterion (3/3) Bellman's!optimality!equation !!!! ⎛ ⎞ N ∑ !! J * ( i ) = min c ( i , µ ( i )) + γ p ij ( µ ) J * ( j ) for i = 1,2,..., N ! ⎜ ⎟ ⎝ ⎠ µ j = 1 Immediate!expected!cost N ∑ !!!!! c ( i , µ ( i )) = Ε X 1 [ g ( i , µ ( i ), X 1 )] = p ij g ( i , µ ( i ), j ) !!!!!!!!!!!!!!!!!!!! j = 1 Two!methods!for!computing!an!optimal!policy !!!!A!Policy!iteration !!!!A!Value!iteration ! (c) 2017 Biointelligence Lab, SNU 10

12.4 Policy Iteration (1/2) n ∑ Q µ ( i , a ) = c ( i , a ) + γ p ij ( a ) J µ ( j ) ( Q -factor) j = 1 1. Policy evaluation step : the cost-to-go function for some current policy and the corresponding Q -factor are computed for all states and actions; 2. Policy improvement step : the current policy is updated in order to be greedy with respect to the cost-to-go function computed in step 1. Figure 12.3 Policy iteration algorithm. (c) 2017 Biointelligence Lab, SNU 11

12.4 Policy Iteration (2/2) ! µ n + 1 ( i ) N µ n ( i ) = c ( i , µ n ( i )) + γ ∑ µ n ( j ) p ij ( µ n ( i )) J i = 1,2,..., N J , j = 1 N µ n ( i , a ) = c ( i , a ) + γ ∑ µ n ( j ) a ∈ A i and i = 1,2,..., N Q p ij ( a ) J , j = 1 (c) 2017 Biointelligence Lab, SNU 12

12.5 Value Iteration (4/4) Figure 12.6 Steps involved in calculating the 𝑅 -factors for the stagecoach problem. The routes (printed in blue) from 𝐵 to 𝐾 are the optimal ones. (c) 2017 Biointelligence Lab, SNU 16

12.6 Approximate Dynamic Programming: Direct Methods • Dynamic programming (DP) requires an explicit model, i.e. transition probabilities. • Approximate DP: We may use Monte Carlo simulation to explicitly estimate (i.e. approximate) the transition probabilities. 1. Direct methods 2. Indirect methods: Approximate policy evaluation, approximate cost-to-go • Direct methods for approximate DP 1. Value iteration: Temporal-difference (TD) learning 2. Policy iteration: Q-learning Reinforcement learning as the direct approximation of DP • (c) 2017 Biointelligence Lab, SNU 17

Chapter 12. Dynamic Programming Neural Networks and Learning - PowerPoint PPT Presentation

Chapter 12. Dynamic Programming Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20171011 Contents

Dynamic Programming Outline and Reading Matrix Chain-Product (5.3.1) Dynamic Programming:

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is

CS 170 Section 6 Dynamic Programming Owen Jow | owenjow@berkeley.edu Agenda Dynamic

Dynamic Programming Kevin Zatloukal July 18, 2011 Motivation Dynamic programming deserves

Dynamic programming 1 Dynamic programming also solve a problem by combining the solutions to

Dynamic Programming (Chapter 6) Algorithm Design Techniques Greedy Divide and Conquer Dynamic

Dynamic Programming December 15, 2016 CMPE 250 Dynamic Programming December 15, 2016 1 / 60

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Dynamic Programming Has nothing to do with programming in the way we normally use that term

Lecture 18: Elements of Dynamic Programming COMS10007 - Algorithms Dr. Christian Konrad

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Dynamic Programming Part 2 Algorithm Theory WS 2012/13 Fabian Kuhn Dynamic Programming

Open, extensible dynamic programming systems or just how deep is the dynamic rabbit hole?

17: Dynamic Programming CS1101S: Programming Methodology Martin Henz October 19, 2012 CS1101S:

using vim at work @danishprakash danish prakash software engineer @HackerRank

rol Re Recent Advances in Paper Machine Contro Q. Lu 1 , M.G. Forbes 2 , R.B. Gopaluni 1 , P.D.

Complete Compete Top 10 Teams Standings Top 5 Teams initial standings Delegate Profiles

Rise and Smile Dr Amol Wagholikar, PhD Story Discovery Positive Difference Here is the

Attitudes and behavior Please indicate the extent to which you agree with the Please indicate

Numerical algorithms in control Numerical rank revealing Eigenvalues and singular values

Einsum Networks Fast and Scalable Learning ofTractable Probabilistic Circuits Robert Peharz

Agreement and Disagreement in a Non-Classical World Adam Brandenburger, Patricia