SLIDE 1 Finite Markov Decision Processes (MDP)
2020/3/20
SLIDE 2 Markov Decision Process (MDP)
https://en.wikipedia.org/wiki/Markov_decision_process
SLIDE 3 Markov Property
- Current state can represent all information from the past states
- i.e. memoryless
- Let bygones be bygones
SLIDE 4 Markov Process
- A Markov process is a memoryless random process, i.e. a sequence of
random states S1, S2, … with Markov property
- Transition probability P(s, s’) is the probability of moving from state s
to state s’
SLIDE 5
Student Markov Chain
SLIDE 6
Student Markov Chain Episodes
SLIDE 7
Example: Student Markov Chain Transition Matrix
SLIDE 8 Adding Reward to Markov Process
- A Markov reward process is a Markov chain with values.
SLIDE 9
Student MRP
SLIDE 10 Discounted Future Return Gt
- The discount 𝛿 ∈ [0,1] is the present value of future rewards
− 𝛿 close to 0 leads to “short-sighed” evaluation − 𝛿 close to 1 leads to “far-sighed” evaluation
SLIDE 11 Why add discount factor 𝛿?
- Uncertainty about the future
- Avoids infinite returns in cyclic Markov processes
- Animal/human behaviour shows preference for immediate reward
SLIDE 12 Value Function
- The value function v(s) estimates the long-term value of state s
SLIDE 13 Student MRP Returns
1 2
SLIDE 14
State-Value Function for Student MRP (1)
SLIDE 15
State-Value Function for Student MRP (2)
SLIDE 16
State-Value Function for Student MRP (3)
SLIDE 17 Bellman Equation for MRPs
- The value function can be decomposed into two parts:
− immediate reward Rt+1 − discounted value of next state 𝛿 v(St+1)
SLIDE 18
Backup Diagram for Bellman Equation
SLIDE 19
Calculating Student MDP using Bellman Equation
SLIDE 20 Markov Decision Process
- A Markov decision process (MDP) is a Markov reward process with decisions.
SLIDE 21
Student MDP with Actions
SLIDE 22 Policy
- MDP Policies only depend on the current state, i.e. stationary
SLIDE 23
Policies
SLIDE 24
Value Function
SLIDE 25
State-Value Function for Student MDP
SLIDE 26
Backup Diagram for 𝑤𝜌 and 𝑟𝜌
SLIDE 27
SLIDE 28
SLIDE 29
Bellman Expectation Equation for Student MDP
SLIDE 30
Optimal Value Function
SLIDE 31
Optimal Value Function for Student MDP
SLIDE 32
Optimal Action-Value Function for Student MDP
SLIDE 33 Reference
- Davlid Silver, Lecture 2: Markov Decision Processes, Reinforcement Learning
(https://www.youtube.com/watch?v=lfHX2hHRMVQ&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=2)
- Chapter 3, Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An
Introduction,” 2nd edition, Nov. 2018