Dynamic Programming Talk 5 by Daniela and Christoph Content - - PowerPoint PPT Presentation

dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Dynamic Programming Talk 5 by Daniela and Christoph Content - - PowerPoint PPT Presentation

Reinforcement Learning and Dynamic Programming Talk 5 by Daniela and Christoph Content Reinforcement Learning Problem Agent-Environment Interface Markov Decision Processes Value Functions Bellman equations Dynamic Programming


slide-1
SLIDE 1

Reinforcement Learning and Dynamic Programming

Talk 5 by Daniela and Christoph

slide-2
SLIDE 2

Content

Reinforcement Learning Problem

  • Agent-Environment Interface
  • Markov Decision Processes
  • Value Functions
  • Bellman equations

Dynamic Programming

  • Policy Evaluation, Improvement and Iteration
  • Asynchronous DP
  • Generalized Policy Iteration
slide-3
SLIDE 3

Reinforcement Learning Problem

  • Learning from interactions
  • Achieving a goal
slide-4
SLIDE 4

Example robot

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

actions Reward is -1 for all transition, except for the last transition. Reward for the last transition is 2.

slide-5
SLIDE 5

Agent-Environment Interface

Agent

  • Learner
  • Decision maker

Environment

  • Everything outside of the agent

Agent Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

slide-6
SLIDE 6

Interaction

  • State: 𝑇𝑢 ∈ 𝑇
  • Reward: 𝑆𝑢 ∈ ℝ
  • Action: 𝐵𝑢 ∈ 𝐵(𝑇𝑢)

Discrete time steps

  • 𝑢 = 0,1,2,3 …

Agent Environment

St Rt At

1

  • 1 or 2
slide-7
SLIDE 7

Example Robot

Agent Environment

S0=1

1 2 3 4 5 6

slide-8
SLIDE 8

Example Robot

Agent Environment

  • 1

1 2 3 4 5 6

S1=2

slide-9
SLIDE 9

Example Robot

Agent Environment

  • 1

1 2 3 4 5 6

S2=5

slide-10
SLIDE 10

Example Robot

Agent Environment

  • 1

1 2 3 4 5 6

S3=5

slide-11
SLIDE 11

Example Robot

Agent Environment

2

1 2 3 4 5 6

S4=6

slide-12
SLIDE 12

Policy

  • In each state, the agent can choose between

different actions. The probability that the agent selects a possible action is called policy.

  • 𝜌𝑢 𝑏|𝑡 : probability that 𝐵𝑢 = 𝑏 if 𝑇𝑢 = 𝑡
  • In reinforcement learning: the agent changes the

policy as a result of the experience

𝜌𝑢 𝑣𝑞|𝑡𝑗 = 0.25 𝜌𝑢 𝑚𝑓𝑔𝑢|𝑡𝑗 = 0.25

0.25 0.25 0.25 0.25

𝜌𝑢 𝑒𝑝𝑥𝑜|𝑡𝑗 = 0.25 𝜌𝑢 𝑠𝑗𝑕ℎ𝑢|𝑡𝑗 = 0.25

slide-13
SLIDE 13

Example Robot: Diagram

2 1 6 3 4 5

1 2 3 4 5 6

0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5

slide-14
SLIDE 14

Reward signal

  • Goal: Maximizing the total amount of

cumulative reward over the long run

2 1 6 3 4 5

1 2 3 4 5 6

0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5 2

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

2

  • 1
  • 1
slide-15
SLIDE 15

Return

Sum of the rewards

  • 𝐻𝑢 = 𝑆𝑢+1 + 𝑆𝑢+2 + 𝑆𝑢+3 + ⋯ + 𝑆𝑈, where T is a final step

Maximize the expected return

1 2 3 4 5 6

G0=-1-1+2=0 G0=-1-1-1-1+2=-2

t=0

slide-16
SLIDE 16

Discounting

  • If the task is a continuing task, a discount rate for the return is

needed

Discount rate determines the present value of the future rewards in a continuing task

  • 𝐻𝑢 = 𝑆𝑢+1 + 𝛿 ∗ 𝑆𝑢+2 + 𝛿2 ∗ 𝑆𝑢+3 + ⋯ =

𝛿𝑙𝑆𝑢+𝑙+1

∞ 𝑙=0

where 𝛿 is called the discount rate: 0 ≤ 𝛿 ≤ 1

Unified Notation: 𝑯𝒖 = 𝜹𝒍𝑺𝒖+𝒍+𝟐

𝑼 𝒍=𝟏

slide-17
SLIDE 17

The Markov Property

  • 𝑄𝑠 𝑆𝑢+1 = 𝑠, 𝑇𝑢+1 = 𝑡′|𝑇0, 𝐵0, 𝑆1, … , 𝑇𝑢−1, 𝐵𝑢−1 , 𝑆1, 𝑇𝑢, 𝐵𝑢 =

𝑄𝑠 𝑆𝑢+1 = 𝑠, 𝑇𝑢+1 = 𝑡′|𝑇𝑢, 𝐵𝑢

  • State signal summarizes past sensations compactly such that

all relevant information is retained

  • Decisions are assumed to be a function of the current state
  • nly

1 2 3 4 5 6 7 8 9

slide-18
SLIDE 18

The Markov Decision Processes

Task has to satisfy the Markov Property

  • If the state and action spaces are finite, then it is called a

finite Markov decision process

  • Given any state and action, s and a, the probability of each

possible next state and reward, s’, r, is: 𝑞(𝑡′, 𝑠|𝑡, 𝑏) = 𝑄𝑠 𝑇𝑢+1 = 𝑡′, 𝑆𝑢+1 = 𝑠|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏

slide-19
SLIDE 19

Example robot

𝑞(𝑡′, 𝑠|𝑡, 𝑏) = 𝑄𝑠 𝑇𝑢+1 = 𝑡′, 𝑆𝑢+1|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏

𝑞(2, −1|1, 𝑠𝑗𝑕ℎ𝑢) = 1 𝑞(4, −1|1, 𝑒𝑝𝑥𝑜) = 1 𝑞(4, −1|1, 𝑣𝑞) = 0

2 1 6 3 4 5

0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5 2

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

2

  • 1
  • 1
slide-20
SLIDE 20

The Markov Decision Processes

  • Given any current state and action, s and a, together with any

next state, s’, the expected value of next reward is: 𝑠(𝑡, 𝑏, 𝑡′) = 𝐹 𝑆𝑢+1|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏, 𝑇𝑢+1 = 𝑡′

slide-21
SLIDE 21

Example robot

𝑠 1, 𝑠𝑗𝑕ℎ𝑢, 2 = −1

𝑠(𝑡, 𝑏, 𝑡′) = 𝐹 𝑆𝑢+1|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏, 𝑇𝑢+1 = 𝑡′

𝑠 1, 𝑒𝑝𝑥𝑜, 4 = −1

2 1 6 3 4 5

0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5 2

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

2

  • 1
  • 1

𝑠 5, 𝑠𝑗𝑕ℎ𝑢, 6 = 2

slide-22
SLIDE 22

Value functions

  • Value functions estimate how good it is for the agent to be in

a given state (state-value function) or how good it is to perform a certain action in a given state (action-value function)

  • Value functions are defined with respect to particular policies
  • The value of a state s under a policy π is the expected return

when starting in s and following π thereafter: 𝑤𝜌 𝑡 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡

  • vπ is called the state-value function for policy π
slide-23
SLIDE 23

State-value function

slide-24
SLIDE 24

Property of state-value function

  • Bellman equation for vπ
  • Expresses a relationship between the value of a state and

the value of its successor states

𝑤𝜌 𝑡 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡 = 𝜌(𝑏|𝑡)

𝑏

𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤𝜌 (𝑡′)

𝑡′,𝑠

slide-25
SLIDE 25

Example state-value function

𝑤𝜌 𝑡 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡 = 𝜌(𝑏|𝑡)

𝑏

𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤𝜌 (𝑡′)

𝑡′,𝑠 𝑤𝜌 1 = 3 ∗ (0.25 ∗ 1 ∗ −1 + 𝑤𝜌 1 ) + 0.25 ∗ 1 ∗ (−1 + 𝑤𝜌 2 ) 1 2 3

3 1 2

0.25 0.25 0.25 0.75 0.5

  • 1
  • 1
  • 1
  • 1

2

𝛿 = 1

𝑤𝜌 2 = 2 ∗ (0.25 ∗ 1 ∗ −1 + 𝑤𝜌 2 ) + 0.25 ∗ 1 ∗ −1 + 𝑤𝜌 1 + 0.25 ∗ 1 ∗ (2 + 𝑤𝜌 3 ) 𝑤𝜌 3 = 0

𝒘𝝆 𝟐 = −𝟘 𝒘𝝆 𝟑 = −𝟔 𝒘𝝆 𝟒 = 𝟏

slide-26
SLIDE 26

Action-value function

  • The value of the expected return taking action a in state s

under policy π

  • 𝑟𝜌 𝑡, 𝑏 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏
  • qπ is called the action-value function for policy π
slide-27
SLIDE 27

Optimal policy

A policy π is better or equal to a policy π’ if the state-value function is greater or equal to that of π’

  • 𝜌 ≥ 𝜌′𝑗𝑔 𝑏𝑜𝑒 𝑝𝑜𝑚𝑧 𝑗𝑔 𝑤𝜌(𝑡) ≥ 𝑤𝜌′ 𝑡 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇

Optimal state-value function

  • 𝑤∗ 𝑡 = max

𝜌

𝑤𝜌 𝑡 , 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇

Optimal action-value function

  • 𝑟∗ 𝑡, 𝑏 = max

𝜌

𝑟𝜌 𝑡, 𝑏 , 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇 𝑏𝑜𝑒 𝑏 ∈ 𝐵(𝑡)

slide-28
SLIDE 28

Bellman optimality equation

  • Without a reference to any specific policy

Bellman optimality equation for v*

  • 𝑤∗ 𝑡 = max

𝑏∈𝐵(𝑡)

𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤∗ (𝑡′)

𝑡′,𝑠

slide-29
SLIDE 29

Bellman optimality equation for v*

𝑤∗ 𝑡 = max

𝑏∈𝐵(𝑡) 𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤∗ (𝑡′) 𝑡′,𝑠

𝑤∗ 1 = max 1 ∗ −1 + 𝑤∗ 1 1 ∗ −1 + 𝑤∗ 1 1 ∗ −1 + 𝑤∗ 1 1 ∗ (−1 + 𝑤∗ 2 )

1 2 3

3 1 2

  • 1
  • 1
  • 1
  • 1

2

𝛿 = 1

𝑤∗ 2 = max 1 ∗ −1 + 𝑤∗ 2 1 ∗ −1 + 𝑤∗ 2 1 ∗ −1 + 𝑤∗ 1 1 ∗ 2 + 𝑤∗ 3 𝑤∗ 3 = 0

𝒘∗ 𝟐 =? 𝒘∗ 𝟑 =?

up down left right up down left right actions:

slide-30
SLIDE 30

Bellman optimality equation

Bellman optimality equation for q*

  • 𝑟∗ 𝑡, 𝑏 =

𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿 max

𝑏′ 𝑟∗ (𝑡′, 𝑏′) 𝑡′,𝑠

slide-31
SLIDE 31

Bellman optimality equation

  • System of nonlinear equations, one for each state
  • N states: there are N equations and N unknowns
  • If we know 𝑞 𝑡′, 𝑠 𝑡, 𝑏 and 𝑠(𝑡, 𝑏, 𝑡′) then in principle one

can solve this system of equations

  • If we have v*

it is relatively easy to determine an optimal

policy

  • 9
  • 5
  • 3
  • 5
  • 3
  • 2
  • 3
  • 2

v* π*

slide-32
SLIDE 32

Assumptions for solving the Bellman

  • ptimality equation
  • Markov property
  • We know the dynamics of the environment
  • We have enough computational resources to complete the

computation of the solution

  • Problem: Long computational time
  • Solution: Dynamic programming
slide-33
SLIDE 33

Dynamic Programming

slide-34
SLIDE 34

Dynamic Programming

Collection of algorithms that can be used to compute

  • ptimal policies given a perfect model of the

environment as a Markov decision process Problem of classic DP algorithms: They are only of limited utility in reinforcement learning:

  • Assumption of perfect model
  • Great computational expense
slide-35
SLIDE 35

Key Idea of Dynamic Programming

Goal: Find optimal policy Problem: Solve the Bellman optimality equation 𝑤∗ 𝑡 = max

𝑏

𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤∗ 𝑡′ ]

𝑡′,𝑠

Solution methods:

  • Direct search
  • Linear programming
  • Dynamic programming
slide-36
SLIDE 36

Key Idea of Dynamic Programming

Key idea of DP (and of reinforcement learning in general): Use of value functions to organize and structure the search for good policies Dynamic programming approach: Introduce two concepts:

  • Policy evaluation
  • Policy improvement

Use those concepts to get an optimal policy

slide-37
SLIDE 37

Assumptions

We always assume that the environment is a finite MDP, i.e:

  • State, action and reward sets S, A(s) and R, for 𝑡 ∈S, are

finite

  • Dynamics are given by a set of probabilities 𝑞(𝑡′, 𝑠|𝑡, 𝑏), for

all 𝑡 ∈S, 𝑏 ∈A(s), r ∈R , and 𝑡′ ∈ 𝑇+ (𝑇+ is S plus a terminal state if the problem is episodic)

slide-38
SLIDE 38

Policy Evaluation

How to compute state-value function 𝑤𝜌 for an arbitrary policy 𝜌: Recall Bellman equation: 𝑤𝜌 𝑡 = 𝜌 𝑏 𝑡 𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝜌 𝑡′ ]

𝑡′,𝑠 𝑏

Existence and uniqueness of 𝑤𝜌 are guaranteed if:

  • Either 𝛿 < 1 or
  • Eventual termination is guaranteed from all states under policy 𝜌
slide-39
SLIDE 39

Iterative Policy Evaluation

Consider iterative solution methods for Bellman equation: Consider a sequence of approximate value functions 𝑤0, 𝑤1, 𝑤2, …, each mapping 𝑇+ to ℝ . Initial approximation, 𝑤0, is chosen arbitrarily (except that the terminal states, if any, must be given value 0) . Subsequently, use the Bellman equation for 𝑤𝜌 as an update rule: 𝑤𝑙+1 𝑡 = 𝜌 𝑏 𝑡 𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝑙 𝑡′ ]

𝑡′,𝑠 𝑏

for all 𝑡 ∈S.

slide-40
SLIDE 40

Iterative Policy Evaluation

𝑤𝑙+1 𝑡 = 𝜌 𝑏 𝑡 𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝑙 𝑡′ ]

𝑡′,𝑠 𝑏

Convergence: One can show that the sequence 𝑤𝑙 converges to 𝑤𝜌 as 𝑙 → ∞ under the same conditions that guarantee the existence

  • f 𝑤𝜌, i.e.
  • Either 𝛿 < 1 or
  • Eventual termination is guaranteed from all states under

policy 𝜌

slide-41
SLIDE 41

Consider the robot example: Goal: reach top left or bottom right corner →( Nonterminal states are S = {2, 3, ..., 15})

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

actions Reward is -1 for all transition

Example: Iterative Policy Evaluation

slide-42
SLIDE 42

Example: Iterative Policy Evaluation

Recall: can choose initial approximation arbitrarily (except for terminal state) → choose 𝑤0 𝑡 = 0 for all states 𝑡 ∈ 𝑇+ = {1,2, … , 16}

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

𝑤0 for the random policy:

slide-43
SLIDE 43

Example: Iterative Policy Evaluation

Let’s calculate 𝑤1:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

slide-44
SLIDE 44

Example: Iterative Policy Evaluation

Let’s calculate 𝑤1: 𝑡 = 6 𝑤1 6 = 𝜌(𝑏|6) 𝑞 𝑡′, 𝑠 6, 𝑏 [𝑠 + 𝛿𝑤0 𝑡′ ]

𝑡′,𝑠 𝑏∈{𝑣,𝑒,𝑚,𝑠}

= 0.25 ∗ −𝑞 2 6, 𝑣 − 𝑞 10 6, 𝑒 − 𝑞 5 6, 𝑚 − 𝑞 7 6, 𝑠 = 0.25 ∗ {−1 − 1 − 1 − 1} = −1

= 0.25 ∀𝑏 = −1 = 0 ∀𝑡′

𝑤1 6 = −1 Analogously for all non-terminal states 𝑡 ∈ 𝑇: 𝑤1 𝑡 = −1 = 𝜌(𝑏|6) 𝑞 𝑡′ 6, 𝑏 [𝑠 + 𝛿𝑤0 𝑡′ ]

𝑡′ 𝑏∈{𝑣,𝑒,𝑚,𝑠}

slide-45
SLIDE 45

Example: Iterative Policy Evaluation

Let’s calculate 𝑤1: For the terminal states 1 and 16 the process terminates, i.e. for 𝑡 ∈ {1, 16}: 𝑞 𝑡′ 𝑡, 𝑏 = 0 ∀𝑡′ ∈ 𝑇, 𝑏 ∈ {𝑣, 𝑒, 𝑚, 𝑠} 𝑤1 for the random policy:

0.0

  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0
  • 1.0

0.0

𝑤𝑙 1 , 𝑤𝑙 16 = 0 ∀𝑙

slide-46
SLIDE 46

Example: Iterative Policy Evaluation

Let’s calculate 𝑤2: 𝑡 = 6: 𝑤2 6 = 𝜌(𝑏|6) 𝑞 𝑡′ 6, 𝑏 [𝑠 + 𝛿𝑤1 𝑡′ ]

𝑡′ 𝑏∈{𝑣,𝑒,𝑚,𝑠}

= 0.25 ∗ {𝑞 2 6, 𝑣 −1 − 𝛿 + 𝑞 10 6, 𝑒 −1 − 𝛿 + 𝑞 5 6, 𝑚 −1 − 𝛿 + 𝑞 7 6, 𝑠 −1 − 𝛿 } = 0.25 ∗ {−2 − 2 − 2 − 2} = −2

= 0.25 ∀𝑏 = −1 = −1, 𝑡′ ∈ 𝑇 0, 𝑡′ ∈ 𝑇+\𝑇

𝛿 = 1

𝑤2 6 = −2

slide-47
SLIDE 47

Example: Iterative Policy Evaluation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Analogously, we get for all red colored states s 𝒘𝟑 𝒕 = −𝟑

slide-48
SLIDE 48

Example: Iterative Policy Evaluation

Let’s calculate 𝑤2:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

slide-49
SLIDE 49

Example: Iterative Policy Evaluation

Let’s calculate 𝑤2: 𝑡 = 2: 𝑤2 2 = 𝜌(𝑏|2) 𝑞 𝑡′ 2, 𝑏 [𝑠 + 𝛿𝑤1 𝑡′ ]

𝑡′ 𝑏∈{𝑣,𝑒,𝑚,𝑠}

= 0.25 ∗ {𝑞 2 2, 𝑣 −1 − 𝛿 + 𝑞 6 2, 𝑒 −1 − 𝛿 + 𝑞 1 2, 𝑚 −1 − 𝛿 ∗ 0 + 𝑞 3 2, 𝑠 −1 − 𝛿 } = 0.25 ∗ {−2 − 2 − 1 − 2} = −1.75

= 0.25 ∀𝑏 = −1 = −1, 𝑡′ ∈ 𝑇 0, 𝑡′ ∈ 𝑇+\𝑇

𝛿 = 1

𝑤2 2 = −1.75

slide-50
SLIDE 50

Example: Iterative Policy Evaluation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Analogously, we get for all blue colored states s 𝒘𝟑 𝒕 = −𝟐. 𝟖𝟔

slide-51
SLIDE 51

Example: Iterative Policy Evaluation

𝑤2 for the random policy:

0.0

  • 1.7
  • 2.0
  • 2.0
  • 1.7
  • 2.0
  • 2.0
  • 2.0
  • 2.0
  • 2.0
  • 2.0
  • 1.7
  • 2.0
  • 2.0
  • 1.7

0.0

slide-52
SLIDE 52

Example: Iterative Policy Evaluation

𝑤𝑙 for the random policy:

0.0

  • 1.7 -2.0 -2.0
  • 1.7 -2.0 -2.0 -2.0
  • 2.0 -2.0 -2.0 -1.7
  • 2.0 -2.0 -1.7

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

𝑙 = 0

0.0

  • 1.0 -1.0 -1.0
  • 1.0 -1.0 -1.0 -1.0
  • 1.0 -1.0 -1.0 -1.0
  • 1.0 -1.0 -1.0

0.0

𝑙 = 1 𝑙 = 2

0.0

  • 2.4 -2.9 -3.0
  • 2.4 -2.9 -3.0 -2.9
  • 2.9 -3.0 -2.9 -2.4
  • 3.0 -2.9 -2.4

0.0

𝑙 =3

0.0

  • 6.1 -8.4 -9.0
  • 6.1 -7.7 -8.4 -8.4
  • 8.4 -8.4 -7.7 -6.1
  • 9.0 -8.4 -6.1

0.0

𝑙 = 10

0.0

  • 14. -20. -22.
  • 14. -18. -20. -20.
  • 20. -20. -18. -14.
  • 22. -20. -14.

0.0

𝑙 = ∞

𝑤𝜌

slide-53
SLIDE 53

Policy Evaluation

Reason for computing value function 𝑤𝜌 for a policy 𝜌: → Finding better policies → Policy improvement

slide-54
SLIDE 54
  • Suppose we have determined the value function 𝑤𝜌 for an

arbitrary deterministic policy 𝜌

  • Should we change the policy to deterministically choose an

action 𝑏 ≠ 𝜌(𝑡) for some state s?

  • What we know: how good it is to follow the current policy

from s : 𝑤𝜌(𝑡)

  • What we want to know: would it be better or worse to change

to the new policy?

Policy Improvement

slide-55
SLIDE 55

Would it be better or worse to change to the new policy ? (new policy: for some s choose action 𝑏 ≠ 𝜌(𝑡)) → Consider selecting 𝑏 in s and thereafter following the existing policy 𝜌: value of this way of behaving is: 𝑟𝜌 𝑡, 𝑏 = 𝐹𝜌 𝑆𝑢+1 + 𝛿𝑤𝜌 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏] = 𝑞 𝑡′, 𝑠 𝑡, 𝑏) [𝑠 + 𝛿𝑤𝜌 𝑡′ ]

𝑡′,𝑠

→ If this is greater than 𝑤𝜌(𝑡), i.e., if it is better to select 𝑏 once in s and thereafter follow 𝜌 than it would be to follow 𝜌 all the time, then we would expect the new policy to be better than 𝜌

Policy Improvement

slide-56
SLIDE 56

Policy Improvement Theorem

Let 𝜌 and 𝜌′ be any pair of deterministic policies such that, for all 𝑡 ∈ 𝑇, 𝑟𝜌 𝑡, 𝜌′(𝑡) ≥ 𝑤𝜌 𝑡 . Then the policy 𝜌′ must be as good as, or better than, 𝜌. That is, it must obtain greater or equal expected return from all states 𝑡 ∈ 𝑇: 𝑤𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡 . Moreover, if there is strict inequality of (1) at any state then there must be strict inequality of (2) at at least one state. (1) (2)

slide-57
SLIDE 57

For situation before:

  • Suppose we have a deterministic policy 𝜌, and a new policy 𝜌′

that equals 𝜌 except for one state 𝑡 for which 𝜌′ 𝑡 = 𝑏 ≠ 𝜌(𝑡)

  • Suppose 𝑟𝜌 𝑡, 𝑏 ≥ 𝑤𝜌 𝑡 , i.e. (1) is satisfied

𝑞𝑝𝑚𝑗𝑑𝑧 𝑗𝑛𝑞𝑠𝑝𝑤. 𝑢ℎ𝑛.

𝜌′ is as good as, or better than, 𝜌

Policy Improvement

slide-58
SLIDE 58

Claim: 𝑟𝜌 𝑡, 𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡 1 𝑤𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡

Policy Improvement Theorem: Proof

slide-59
SLIDE 59

Proof: 𝑤𝜌 𝑡 ≤ 𝑟𝜌 𝑡, 𝜌′ 𝑡 = 𝑭𝜌[𝑆𝑢+1 + 𝛿 𝑤𝜌 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝜌′ 𝑡 = 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑤𝜌 𝑇𝑢+1 |𝑇𝑢 = 𝑡] ≤ 𝑭𝜌′[𝑆𝑢+1 + 𝛿𝑟𝜌 𝑇𝑢+1, 𝜌′ 𝑇𝑢+1 |𝑇𝑢 = 𝑡] = 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑭𝜌′[𝑆𝑢+2 + 𝛿𝑤𝜌 𝑇𝑢+2 ]|𝑇𝑢 = 𝑡] = 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 + 𝛿2𝑤𝜌 𝑇𝑢+2 |𝑇𝑢 = 𝑡] ≤ 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 +𝛿2𝑟𝜌 𝑇𝑢+2, 𝜌′ 𝑇𝑢+2 |𝑇𝑢 = 𝑡]

= 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2+𝛿2𝑭𝜌′[𝑆𝑢+3 + 𝛿𝑤𝜌 𝑇𝑢+3 ]|𝑇𝑢 = 𝑡]

= 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 + 𝛿2𝑆𝑢+3 + 𝛿3𝑤𝜌(𝑇𝑢+3)|𝑇𝑢 = 𝑡]

= 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 + 𝛿2𝑆𝑢+3 + 𝛿3𝑆𝑢+4 + ⋯ 𝑇𝑢 = 𝑡 = 𝑭𝜌′ [𝐻𝑢|𝑇𝑢 = 𝑡] = 𝑤𝜌′ 𝑡

(1) (1) (1)

= 𝐻𝑢

slide-60
SLIDE 60
  • What we have seen: Given a (deterministic) policy and its

value function we can easily evaluate a change in the policy at a single state

  • What if we allow changes at all states?

→For a given (deterministic) policy 𝜌 select at each state 𝑡 ∈ 𝑇 the action that appears best according to 𝑟𝜌 𝑡, 𝑏 → i.e., consider the new greedy policy 𝜌′, given by 𝜌′ 𝑡 = argmax

𝑏

𝑟𝜌(𝑡, 𝑏) → take action that looks best in the short term – after one step of lookahead – according to 𝑤𝜌

Policy Improvement

(3)

slide-61
SLIDE 61

By construction, the greedy policy 𝜌′ fulfills the condition 𝑟𝜌 𝑡, 𝜌′(𝑡) ≥ 𝑤𝜌 𝑡

policy impr. theorem the policy 𝜌′ is as good as, or better than, the original

policy

Policy Improvement

(1)

slide-62
SLIDE 62

The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.

Policy Improvement

slide-63
SLIDE 63

Suppose the new greedy policy, 𝜌′, is as good as, but not better than, the old policy 𝜌. Then 𝑤𝜌 = 𝑤𝜌′ , and from 𝜌′ 𝑡 = argmax

𝑏

𝑟𝜌(𝑡, 𝑏) it follows that for all 𝑡 ∈ 𝑇: 𝑤𝜌′ 𝑡 = max

𝑏

𝐹 𝑆𝑢+1 + 𝛿𝑤𝜌′ 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏 = max

𝑏

𝑞 𝑡′, 𝑠 𝑡, 𝑏 𝑠 + 𝛿𝑤𝜌′ 𝑡 .

𝑡′,𝑠

This is the Bellman optimality equation, and therefore, 𝑤𝜌′ must be 𝑤∗, and both 𝜌 and 𝜌′ must be optimal policies.

Policy Improvement

(3)

slide-64
SLIDE 64
  • All the ideas of policy improvement can be extended to

stochastic policies. (A stochastic policy 𝜌 specifies probabilities 𝜌(𝑏|𝑡) for taking each action 𝑏 in each state 𝑡.)

  • In particular, the policy improvement theorem holds also for

stochastic policies, under the natural definition: 𝑟𝜌 𝑡, 𝜌′(𝑡) = 𝜌′ 𝑏 𝑡 𝑟𝜌(𝑡, 𝑏)

𝑏

.

Policy Improvement

slide-65
SLIDE 65

Example: Policy Improvement

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

random policy 𝜌 value function 𝑤𝜌 policy improvement

slide-66
SLIDE 66

Example: Policy Improvement

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

random policy 𝜌 value function 𝑤𝜌 policy improvement 2

slide-67
SLIDE 67

Example: Policy Improvement

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

random policy 𝜌 value function 𝑤𝜌 policy improvement 3

slide-68
SLIDE 68

Example: Policy Improvement

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

random policy 𝜌 value function 𝑤𝜌 policy improvement 4

slide-69
SLIDE 69

Example: Policy Improvement

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

random policy 𝜌 value function 𝑤𝜌 policy improvement

slide-70
SLIDE 70

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′:

slide-71
SLIDE 71

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′: Is 𝜌′ a better policy than the random policy 𝜌?

slide-72
SLIDE 72

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′: 𝑤𝜌′ 2 = 𝜌′ 𝑏 2 𝑞 𝑡′ 2, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]

𝑡′ 𝑏∈{𝑚}

= 𝜌′ 𝑚 2 𝑞 𝑡′ 2, 𝑚 [−1 + 𝑤𝜌′ 𝑡′ ]

𝑡′

= −1 + 𝑤𝜌′(1) = −1

= 0 𝑔𝑝𝑠 𝑡′ ∈ {2,3, … , 16} 1 𝑔𝑝𝑠 𝑡′ = 1

Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :

= 1

slide-73
SLIDE 73

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′: 𝑤𝜌′ 3 = 𝜌′ 𝑏 3 𝑞 𝑡′ 3, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]

𝑡′ 𝑏∈{𝑚}

= −1 + 𝑤𝜌′(2) = −1 − 1 = −2

= 0 𝑔𝑝𝑠 𝑡′ ∈ {1,3,4 … , 16} 1 𝑔𝑝𝑠 𝑡′ = 2

Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :

= 1

slide-74
SLIDE 74

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′: 𝑤𝜌′ 6 = 𝜌′ 𝑏 6 𝑞 𝑡′ 6, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]

𝑡′ 𝑏∈{𝑚,𝑣}

= 0.5 ∗ {𝑞 5 6, 𝑚 ∗ [−1 + 𝑤𝜌′ 5 ] +𝑞 2 6, 𝑣 ∗ [−1 + 𝑤𝜌′ 2 ]} = 0.5 ∗ {−2 − 2} = −2

= 0 𝑔𝑝𝑠 𝑡′ ∈ 𝑇{2,5} 1 𝑔𝑝𝑠 𝑡′ ∈ {2,5}

Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :

= 0.5

= −1 = −1

slide-75
SLIDE 75

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′: 𝑤𝜌′ 4 = 𝜌′ 𝑏 4 𝑞 𝑡′ 4, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]

𝑡′ 𝑏∈{𝑚,𝑒}

= 0.5 ∗ {𝑞 3 4, 𝑚 ∗ [−1 + 𝑤𝜌′ 3 ] +𝑞 8 4, 𝑒 ∗ [−1 + 𝑤𝜌′ 8 ]} = 0.5 ∗ {−3 − 3} = −3 Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :

= 0.5 = −2 = −2

slide-76
SLIDE 76

Example: Policy Improvement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

New policy 𝜌′:

0.0

  • 1.0
  • 2.0
  • 3.0
  • 1.0
  • 2.0
  • 3.0
  • 2.0
  • 2.0
  • 3.0
  • 2.0
  • 1.0
  • 3.0
  • 2.0
  • 1.0

0.0

Value function 𝑤𝜌′:

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

Value function 𝑤𝜌: Since 𝑤𝜌 𝑡 ≤ −14 for all non-terminal states, and 𝑤𝜌′ 𝑡 ≥ −3 for all non-terminal states Clearly 𝑤𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡 ∀ 𝑡 ∈ 𝑇

𝜌′ is better than 𝜌

slide-77
SLIDE 77

Policy Iteration

Policy iteration is a way of finding an optimal policy: Once a policy 𝜌 has been improved using 𝑤𝜌 to yield a better policy 𝜌′ we can then compute 𝑤𝜌′ and improve it again to yield an even better policy 𝜌′′ Thus, we can obtain a sequence of monotonically improving policies and value functions: 𝜌0

𝐹

→ 𝑤𝜌0

𝐽

→ 𝜌1

𝐹

→ 𝑤𝜌1

𝐽

→ 𝜌2

𝐹

→...

𝐽

→ 𝜌∗

𝐹

→ 𝑤∗

𝐹

→ denotes a policy evaluation and

𝐽

→ denotes a policy improvement

slide-78
SLIDE 78

Policy Iteration

𝜌0

𝐹

→ 𝑤𝜌0

𝐽

→ 𝜌1

𝐹

→ 𝑤𝜌1

𝐽

→ 𝜌2

𝐹

→...

𝐽

→ 𝜌∗

𝐹

→ 𝑤∗ Because a finite MDP has only a finite number of policies, the policy iteration has to converge to an optimal policy and optimal value function in a finite number of iterations.

slide-79
SLIDE 79

Policy iteration often converges in surprisingly few iterations:

Example: Policy Iteration

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Take the random policy as 𝜌0

slide-80
SLIDE 80

Policy iteration often converges in surprisingly few iterations:

Example: Policy Iteration

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

𝜌0

𝐹

0.0

  • 14.
  • 20.
  • 22.
  • 14.
  • 18.
  • 20.
  • 20.
  • 20.
  • 20.
  • 18.
  • 14.
  • 22.
  • 20.
  • 14.

0.0

𝐽

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

𝜌1 𝑤𝜌0

slide-81
SLIDE 81

Each of its iterations involves policy evaluation, which itself is an iterative process, that may require multiple sweeps through the state set.

Policy Iteration: Drawback

slide-82
SLIDE 82

Exact convergence to 𝑤𝜌 occurs only in the limit in iterative policy evaluation. Do we really need exact convergence? → No Value iteration: stop policy evaluation after just one sweep of the state set.

Policy Iteration: Drawback

slide-83
SLIDE 83

Value iteration can be written as a simple backup operation that combines the policy improvement and truncated policy evaluation steps: 𝑤𝑙+1 𝑡 = max

𝑏

𝐹 𝑆𝑢+1 + 𝛿𝑤𝑙 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏] = max

𝑏

𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝑙 𝑡′ ]

𝑡′,𝑠

, for all 𝑡 ∈ 𝑇. The sequence {𝑤𝑙} converges to 𝑤∗ under the same assumptions that guarantee the existence of 𝑤∗, i.e.

  • Either 𝛿 < 1 or
  • Eventual termination is guaranteed from all states under the
  • ptimal policy

Value Iteration

(4)

slide-84
SLIDE 84

Asynchronous Dynamic Programming

Major drawback of DP methods discussed so far: → Require sweeps over the whole state set → If state set is very large: single sweep already prohibitively expensive «Solution»: Asynchronous DP algorithms

slide-85
SLIDE 85

Asynchronous DP algorithms:

  • Are iterative DP algorithms that are not organized in terms of

systematic sweeps of the state set.

  • Back up the values of states in any order whatsovever, using

whatever values of other states happen to be available.

  • Must continue to back up the values of all the states to

converge correctly (can’t ignore any state after some point in the computation).

Asynchronous Dynamic Programming

slide-86
SLIDE 86

Version of asynchronous value iteration: On each step k it only backs up the value of one state 𝑡𝑙, using the value iteration backup: 𝑤𝑙+1 𝑡𝑙 = max

𝑏

𝐹 𝑆𝑢+1 + 𝛿𝑤𝑙 𝑇𝑢+1 𝑇𝑢 = 𝑡𝑙, 𝐵𝑢 = 𝑏] If 0 ≤ 𝛿 < 1, convergence to 𝑤∗ is guaranteed given only that all states occur in the sequence {𝑡𝑙} infinitely often.

Asynchronous Dynamic Programming

(4)

slide-87
SLIDE 87

Asynchronous algorithms make it easier to intermix computation with real-time interaction: To solve a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually experiencing the MDP → Experience can be used to determine states to which DP algorithm applies its backups At the same time, the latest value and policy information from the algorithm can guide the agent’s decision-making.

Asynchronous Dynamic Programming

slide-88
SLIDE 88

Generalized Policy Iteration

Policy iteration consists of two interacting processes:

  • Policy evaluation: making value function consistent with the

current policy

  • Policy improvement: making the policy greedy w.r.t. the

current value function

slide-89
SLIDE 89

Generalized policy iteration (GPI) refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

Generalized Policy Iteration

slide-90
SLIDE 90

Interacting processes: Policy evaluation & policy improvement

  • In policy iteration, theses two processes alternate, each

completing before the other begins.

  • In value iteration, only one iteration of policy evaluation is

performed in between each policy improvement.

  • In asynchronous DP methods, the evaluation and

improvement processes are interleaved at an even finer grain. As long as both processes continue to update all states, the ultimate result is typically the same: convergence to optimal value function and an optimal policy.

Generalized Policy Iteration

slide-91
SLIDE 91

Almost all reinforcement learning methods are well described as GPI:

Generalized Policy Iteration

slide-92
SLIDE 92

It is easy to see that if both the evaluation process and the improvement process stabilize, then the value function and policy must be optimal:

  • Value function stabilizes only when it is consistent with

current policy

  • Policy stabilices only when it is greedy w.r.t. the current value

function → Both processes stabilize only when a policy has been found that is greedy w.r.t. its own value function → Bellman optimality equation holds → Policy and value funtion are optimal

Generalized Policy Iteration

slide-93
SLIDE 93

Evaluation and improvement processes in GPI: Both competing and cooperating

Generalized Policy Iteration

Pull in opposing directions Interact to find optimal solution

slide-94
SLIDE 94

Efficiency of Dynamic Programming

A DP method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is 𝑛𝑜

  • 𝑜 = number of states
  • 𝑛 = number of actions

→ DP is exponentially faster than any direct search in policy space could be

slide-95
SLIDE 95
  • In practice, DP methods can be used with today’s computers

to solve MDPs with millions of states.

  • Both policy and value iteration are widely used, and it is not

clear which, if either, is better in general.

  • In practice, these methods usually converge much faster than

their theoretical worst-case run times.

  • On problems with large state spaces, asynchronous DP

methods are often preferred.

Efficiency of Dynamic Programming

slide-96
SLIDE 96

Tic Tac Toe with Dynamic Programming

slide-97
SLIDE 97