Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision - - PowerPoint PPT Presentation

processes mdp
SMART_READER_LITE
LIVE PREVIEW

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision - - PowerPoint PPT Presentation

Finite Markov Decision Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP) https://en.wikipedia.org/wiki/Markov_decision_process Markov Property Current state can represent all information from the past states


slide-1
SLIDE 1

Finite Markov Decision Processes (MDP)

  • Prof. Kuan-Ting Lai

2020/3/20

slide-2
SLIDE 2

Markov Decision Process (MDP)

https://en.wikipedia.org/wiki/Markov_decision_process

slide-3
SLIDE 3

Markov Property

  • Current state can represent all information from the past states
  • i.e. memoryless
  • Let bygones be bygones
slide-4
SLIDE 4

Markov Process

  • A Markov process is a memoryless random process, i.e. a sequence of

random states S1, S2, … with Markov property

  • Transition probability P(s, s’) is the probability of moving from state s

to state s’

slide-5
SLIDE 5

Student Markov Chain

slide-6
SLIDE 6

Student Markov Chain Episodes

slide-7
SLIDE 7

Example: Student Markov Chain Transition Matrix

slide-8
SLIDE 8

Adding Reward to Markov Process

  • A Markov reward process is a Markov chain with values.
slide-9
SLIDE 9

Student MRP

slide-10
SLIDE 10

Discounted Future Return Gt

  • The discount 𝛿 ∈ [0,1] is the present value of future rewards

− 𝛿 close to 0 leads to “short-sighed” evaluation − 𝛿 close to 1 leads to “far-sighed” evaluation

slide-11
SLIDE 11

Why add discount factor 𝛿?

  • Uncertainty about the future
  • Avoids infinite returns in cyclic Markov processes
  • Animal/human behaviour shows preference for immediate reward
slide-12
SLIDE 12

Value Function

  • The value function v(s) estimates the long-term value of state s
slide-13
SLIDE 13

Student MRP Returns

  • 𝛿 =

1 2

slide-14
SLIDE 14

State-Value Function for Student MRP (1)

slide-15
SLIDE 15

State-Value Function for Student MRP (2)

slide-16
SLIDE 16

State-Value Function for Student MRP (3)

slide-17
SLIDE 17

Bellman Equation for MRPs

  • The value function can be decomposed into two parts:

− immediate reward Rt+1 − discounted value of next state 𝛿 v(St+1)

slide-18
SLIDE 18

Backup Diagram for Bellman Equation

slide-19
SLIDE 19

Calculating Student MDP using Bellman Equation

slide-20
SLIDE 20

Markov Decision Process

  • A Markov decision process (MDP) is a Markov reward process with decisions.
slide-21
SLIDE 21

Student MDP with Actions

slide-22
SLIDE 22

Policy

  • MDP Policies only depend on the current state, i.e. stationary
slide-23
SLIDE 23

Policies

slide-24
SLIDE 24

Value Function

slide-25
SLIDE 25

State-Value Function for Student MDP

slide-26
SLIDE 26

Backup Diagram for 𝑤𝜌 and 𝑟𝜌

slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

Bellman Expectation Equation for Student MDP

slide-30
SLIDE 30

Optimal Value Function

slide-31
SLIDE 31

Optimal Value Function for Student MDP

slide-32
SLIDE 32

Optimal Action-Value Function for Student MDP

slide-33
SLIDE 33

Reference

  • Davlid Silver, Lecture 2: Markov Decision Processes, Reinforcement Learning

(https://www.youtube.com/watch?v=lfHX2hHRMVQ&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=2)

  • Chapter 3, Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An

Introduction,” 2nd edition, Nov. 2018