Dynamic Programming Talk 5 by Daniela and Christoph Content - - PowerPoint PPT Presentation
Dynamic Programming Talk 5 by Daniela and Christoph Content - - PowerPoint PPT Presentation
Reinforcement Learning and Dynamic Programming Talk 5 by Daniela and Christoph Content Reinforcement Learning Problem Agent-Environment Interface Markov Decision Processes Value Functions Bellman equations Dynamic Programming
Content
Reinforcement Learning Problem
- Agent-Environment Interface
- Markov Decision Processes
- Value Functions
- Bellman equations
Dynamic Programming
- Policy Evaluation, Improvement and Iteration
- Asynchronous DP
- Generalized Policy Iteration
Reinforcement Learning Problem
- Learning from interactions
- Achieving a goal
Example robot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
actions Reward is -1 for all transition, except for the last transition. Reward for the last transition is 2.
Agent-Environment Interface
Agent
- Learner
- Decision maker
Environment
- Everything outside of the agent
Agent Environment
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Interaction
- State: 𝑇𝑢 ∈ 𝑇
- Reward: 𝑆𝑢 ∈ ℝ
- Action: 𝐵𝑢 ∈ 𝐵(𝑇𝑢)
Discrete time steps
- 𝑢 = 0,1,2,3 …
Agent Environment
St Rt At
1
- 1 or 2
Example Robot
Agent Environment
S0=1
1 2 3 4 5 6
Example Robot
Agent Environment
- 1
1 2 3 4 5 6
S1=2
Example Robot
Agent Environment
- 1
1 2 3 4 5 6
S2=5
Example Robot
Agent Environment
- 1
1 2 3 4 5 6
S3=5
Example Robot
Agent Environment
2
1 2 3 4 5 6
S4=6
Policy
- In each state, the agent can choose between
different actions. The probability that the agent selects a possible action is called policy.
- 𝜌𝑢 𝑏|𝑡 : probability that 𝐵𝑢 = 𝑏 if 𝑇𝑢 = 𝑡
- In reinforcement learning: the agent changes the
policy as a result of the experience
𝜌𝑢 𝑣𝑞|𝑡𝑗 = 0.25 𝜌𝑢 𝑚𝑓𝑔𝑢|𝑡𝑗 = 0.25
0.25 0.25 0.25 0.25
𝜌𝑢 𝑒𝑝𝑥𝑜|𝑡𝑗 = 0.25 𝜌𝑢 𝑠𝑗ℎ𝑢|𝑡𝑗 = 0.25
Example Robot: Diagram
2 1 6 3 4 5
1 2 3 4 5 6
0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5
Reward signal
- Goal: Maximizing the total amount of
cumulative reward over the long run
2 1 6 3 4 5
1 2 3 4 5 6
0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5 2
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
2
- 1
- 1
Return
Sum of the rewards
- 𝐻𝑢 = 𝑆𝑢+1 + 𝑆𝑢+2 + 𝑆𝑢+3 + ⋯ + 𝑆𝑈, where T is a final step
Maximize the expected return
1 2 3 4 5 6
G0=-1-1+2=0 G0=-1-1-1-1+2=-2
t=0
Discounting
- If the task is a continuing task, a discount rate for the return is
needed
Discount rate determines the present value of the future rewards in a continuing task
- 𝐻𝑢 = 𝑆𝑢+1 + 𝛿 ∗ 𝑆𝑢+2 + 𝛿2 ∗ 𝑆𝑢+3 + ⋯ =
𝛿𝑙𝑆𝑢+𝑙+1
∞ 𝑙=0
where 𝛿 is called the discount rate: 0 ≤ 𝛿 ≤ 1
Unified Notation: 𝑯𝒖 = 𝜹𝒍𝑺𝒖+𝒍+𝟐
𝑼 𝒍=𝟏
The Markov Property
- 𝑄𝑠 𝑆𝑢+1 = 𝑠, 𝑇𝑢+1 = 𝑡′|𝑇0, 𝐵0, 𝑆1, … , 𝑇𝑢−1, 𝐵𝑢−1 , 𝑆1, 𝑇𝑢, 𝐵𝑢 =
𝑄𝑠 𝑆𝑢+1 = 𝑠, 𝑇𝑢+1 = 𝑡′|𝑇𝑢, 𝐵𝑢
- State signal summarizes past sensations compactly such that
all relevant information is retained
- Decisions are assumed to be a function of the current state
- nly
1 2 3 4 5 6 7 8 9
The Markov Decision Processes
Task has to satisfy the Markov Property
- If the state and action spaces are finite, then it is called a
finite Markov decision process
- Given any state and action, s and a, the probability of each
possible next state and reward, s’, r, is: 𝑞(𝑡′, 𝑠|𝑡, 𝑏) = 𝑄𝑠 𝑇𝑢+1 = 𝑡′, 𝑆𝑢+1 = 𝑠|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏
Example robot
𝑞(𝑡′, 𝑠|𝑡, 𝑏) = 𝑄𝑠 𝑇𝑢+1 = 𝑡′, 𝑆𝑢+1|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏
𝑞(2, −1|1, 𝑠𝑗ℎ𝑢) = 1 𝑞(4, −1|1, 𝑒𝑝𝑥𝑜) = 1 𝑞(4, −1|1, 𝑣𝑞) = 0
2 1 6 3 4 5
0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5 2
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
2
- 1
- 1
The Markov Decision Processes
- Given any current state and action, s and a, together with any
next state, s’, the expected value of next reward is: 𝑠(𝑡, 𝑏, 𝑡′) = 𝐹 𝑆𝑢+1|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏, 𝑇𝑢+1 = 𝑡′
Example robot
𝑠 1, 𝑠𝑗ℎ𝑢, 2 = −1
𝑠(𝑡, 𝑏, 𝑡′) = 𝐹 𝑆𝑢+1|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏, 𝑇𝑢+1 = 𝑡′
𝑠 1, 𝑒𝑝𝑥𝑜, 4 = −1
2 1 6 3 4 5
0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.25 0.5 2
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
2
- 1
- 1
𝑠 5, 𝑠𝑗ℎ𝑢, 6 = 2
Value functions
- Value functions estimate how good it is for the agent to be in
a given state (state-value function) or how good it is to perform a certain action in a given state (action-value function)
- Value functions are defined with respect to particular policies
- The value of a state s under a policy π is the expected return
when starting in s and following π thereafter: 𝑤𝜌 𝑡 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡
- vπ is called the state-value function for policy π
State-value function
Property of state-value function
- Bellman equation for vπ
- Expresses a relationship between the value of a state and
the value of its successor states
𝑤𝜌 𝑡 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡 = 𝜌(𝑏|𝑡)
𝑏
𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤𝜌 (𝑡′)
𝑡′,𝑠
Example state-value function
𝑤𝜌 𝑡 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡 = 𝜌(𝑏|𝑡)
𝑏
𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤𝜌 (𝑡′)
𝑡′,𝑠 𝑤𝜌 1 = 3 ∗ (0.25 ∗ 1 ∗ −1 + 𝑤𝜌 1 ) + 0.25 ∗ 1 ∗ (−1 + 𝑤𝜌 2 ) 1 2 3
3 1 2
0.25 0.25 0.25 0.75 0.5
- 1
- 1
- 1
- 1
2
𝛿 = 1
𝑤𝜌 2 = 2 ∗ (0.25 ∗ 1 ∗ −1 + 𝑤𝜌 2 ) + 0.25 ∗ 1 ∗ −1 + 𝑤𝜌 1 + 0.25 ∗ 1 ∗ (2 + 𝑤𝜌 3 ) 𝑤𝜌 3 = 0
𝒘𝝆 𝟐 = −𝟘 𝒘𝝆 𝟑 = −𝟔 𝒘𝝆 𝟒 = 𝟏
Action-value function
- The value of the expected return taking action a in state s
under policy π
- 𝑟𝜌 𝑡, 𝑏 = 𝐹𝜌 𝐻𝑢|𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏
- qπ is called the action-value function for policy π
Optimal policy
A policy π is better or equal to a policy π’ if the state-value function is greater or equal to that of π’
- 𝜌 ≥ 𝜌′𝑗𝑔 𝑏𝑜𝑒 𝑝𝑜𝑚𝑧 𝑗𝑔 𝑤𝜌(𝑡) ≥ 𝑤𝜌′ 𝑡 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇
Optimal state-value function
- 𝑤∗ 𝑡 = max
𝜌
𝑤𝜌 𝑡 , 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇
Optimal action-value function
- 𝑟∗ 𝑡, 𝑏 = max
𝜌
𝑟𝜌 𝑡, 𝑏 , 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇 𝑏𝑜𝑒 𝑏 ∈ 𝐵(𝑡)
Bellman optimality equation
- Without a reference to any specific policy
Bellman optimality equation for v*
- 𝑤∗ 𝑡 = max
𝑏∈𝐵(𝑡)
𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤∗ (𝑡′)
𝑡′,𝑠
Bellman optimality equation for v*
𝑤∗ 𝑡 = max
𝑏∈𝐵(𝑡) 𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤∗ (𝑡′) 𝑡′,𝑠
𝑤∗ 1 = max 1 ∗ −1 + 𝑤∗ 1 1 ∗ −1 + 𝑤∗ 1 1 ∗ −1 + 𝑤∗ 1 1 ∗ (−1 + 𝑤∗ 2 )
1 2 3
3 1 2
- 1
- 1
- 1
- 1
2
𝛿 = 1
𝑤∗ 2 = max 1 ∗ −1 + 𝑤∗ 2 1 ∗ −1 + 𝑤∗ 2 1 ∗ −1 + 𝑤∗ 1 1 ∗ 2 + 𝑤∗ 3 𝑤∗ 3 = 0
𝒘∗ 𝟐 =? 𝒘∗ 𝟑 =?
up down left right up down left right actions:
Bellman optimality equation
Bellman optimality equation for q*
- 𝑟∗ 𝑡, 𝑏 =
𝑞(𝑡′, 𝑠|𝑡, 𝑏) 𝑠 + 𝛿 max
𝑏′ 𝑟∗ (𝑡′, 𝑏′) 𝑡′,𝑠
Bellman optimality equation
- System of nonlinear equations, one for each state
- N states: there are N equations and N unknowns
- If we know 𝑞 𝑡′, 𝑠 𝑡, 𝑏 and 𝑠(𝑡, 𝑏, 𝑡′) then in principle one
can solve this system of equations
- If we have v*
it is relatively easy to determine an optimal
policy
- 9
- 5
- 3
- 5
- 3
- 2
- 3
- 2
v* π*
Assumptions for solving the Bellman
- ptimality equation
- Markov property
- We know the dynamics of the environment
- We have enough computational resources to complete the
computation of the solution
- Problem: Long computational time
- Solution: Dynamic programming
Dynamic Programming
Dynamic Programming
Collection of algorithms that can be used to compute
- ptimal policies given a perfect model of the
environment as a Markov decision process Problem of classic DP algorithms: They are only of limited utility in reinforcement learning:
- Assumption of perfect model
- Great computational expense
Key Idea of Dynamic Programming
Goal: Find optimal policy Problem: Solve the Bellman optimality equation 𝑤∗ 𝑡 = max
𝑏
𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤∗ 𝑡′ ]
𝑡′,𝑠
Solution methods:
- Direct search
- Linear programming
- Dynamic programming
Key Idea of Dynamic Programming
Key idea of DP (and of reinforcement learning in general): Use of value functions to organize and structure the search for good policies Dynamic programming approach: Introduce two concepts:
- Policy evaluation
- Policy improvement
Use those concepts to get an optimal policy
Assumptions
We always assume that the environment is a finite MDP, i.e:
- State, action and reward sets S, A(s) and R, for 𝑡 ∈S, are
finite
- Dynamics are given by a set of probabilities 𝑞(𝑡′, 𝑠|𝑡, 𝑏), for
all 𝑡 ∈S, 𝑏 ∈A(s), r ∈R , and 𝑡′ ∈ 𝑇+ (𝑇+ is S plus a terminal state if the problem is episodic)
Policy Evaluation
How to compute state-value function 𝑤𝜌 for an arbitrary policy 𝜌: Recall Bellman equation: 𝑤𝜌 𝑡 = 𝜌 𝑏 𝑡 𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝜌 𝑡′ ]
𝑡′,𝑠 𝑏
Existence and uniqueness of 𝑤𝜌 are guaranteed if:
- Either 𝛿 < 1 or
- Eventual termination is guaranteed from all states under policy 𝜌
Iterative Policy Evaluation
Consider iterative solution methods for Bellman equation: Consider a sequence of approximate value functions 𝑤0, 𝑤1, 𝑤2, …, each mapping 𝑇+ to ℝ . Initial approximation, 𝑤0, is chosen arbitrarily (except that the terminal states, if any, must be given value 0) . Subsequently, use the Bellman equation for 𝑤𝜌 as an update rule: 𝑤𝑙+1 𝑡 = 𝜌 𝑏 𝑡 𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝑙 𝑡′ ]
𝑡′,𝑠 𝑏
for all 𝑡 ∈S.
Iterative Policy Evaluation
𝑤𝑙+1 𝑡 = 𝜌 𝑏 𝑡 𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝑙 𝑡′ ]
𝑡′,𝑠 𝑏
Convergence: One can show that the sequence 𝑤𝑙 converges to 𝑤𝜌 as 𝑙 → ∞ under the same conditions that guarantee the existence
- f 𝑤𝜌, i.e.
- Either 𝛿 < 1 or
- Eventual termination is guaranteed from all states under
policy 𝜌
Consider the robot example: Goal: reach top left or bottom right corner →( Nonterminal states are S = {2, 3, ..., 15})
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
actions Reward is -1 for all transition
Example: Iterative Policy Evaluation
Example: Iterative Policy Evaluation
Recall: can choose initial approximation arbitrarily (except for terminal state) → choose 𝑤0 𝑡 = 0 for all states 𝑡 ∈ 𝑇+ = {1,2, … , 16}
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
𝑤0 for the random policy:
Example: Iterative Policy Evaluation
Let’s calculate 𝑤1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Example: Iterative Policy Evaluation
Let’s calculate 𝑤1: 𝑡 = 6 𝑤1 6 = 𝜌(𝑏|6) 𝑞 𝑡′, 𝑠 6, 𝑏 [𝑠 + 𝛿𝑤0 𝑡′ ]
𝑡′,𝑠 𝑏∈{𝑣,𝑒,𝑚,𝑠}
= 0.25 ∗ −𝑞 2 6, 𝑣 − 𝑞 10 6, 𝑒 − 𝑞 5 6, 𝑚 − 𝑞 7 6, 𝑠 = 0.25 ∗ {−1 − 1 − 1 − 1} = −1
= 0.25 ∀𝑏 = −1 = 0 ∀𝑡′
𝑤1 6 = −1 Analogously for all non-terminal states 𝑡 ∈ 𝑇: 𝑤1 𝑡 = −1 = 𝜌(𝑏|6) 𝑞 𝑡′ 6, 𝑏 [𝑠 + 𝛿𝑤0 𝑡′ ]
𝑡′ 𝑏∈{𝑣,𝑒,𝑚,𝑠}
Example: Iterative Policy Evaluation
Let’s calculate 𝑤1: For the terminal states 1 and 16 the process terminates, i.e. for 𝑡 ∈ {1, 16}: 𝑞 𝑡′ 𝑡, 𝑏 = 0 ∀𝑡′ ∈ 𝑇, 𝑏 ∈ {𝑣, 𝑒, 𝑚, 𝑠} 𝑤1 for the random policy:
0.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
0.0
𝑤𝑙 1 , 𝑤𝑙 16 = 0 ∀𝑙
Example: Iterative Policy Evaluation
Let’s calculate 𝑤2: 𝑡 = 6: 𝑤2 6 = 𝜌(𝑏|6) 𝑞 𝑡′ 6, 𝑏 [𝑠 + 𝛿𝑤1 𝑡′ ]
𝑡′ 𝑏∈{𝑣,𝑒,𝑚,𝑠}
= 0.25 ∗ {𝑞 2 6, 𝑣 −1 − 𝛿 + 𝑞 10 6, 𝑒 −1 − 𝛿 + 𝑞 5 6, 𝑚 −1 − 𝛿 + 𝑞 7 6, 𝑠 −1 − 𝛿 } = 0.25 ∗ {−2 − 2 − 2 − 2} = −2
= 0.25 ∀𝑏 = −1 = −1, 𝑡′ ∈ 𝑇 0, 𝑡′ ∈ 𝑇+\𝑇
𝛿 = 1
𝑤2 6 = −2
Example: Iterative Policy Evaluation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Analogously, we get for all red colored states s 𝒘𝟑 𝒕 = −𝟑
Example: Iterative Policy Evaluation
Let’s calculate 𝑤2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Example: Iterative Policy Evaluation
Let’s calculate 𝑤2: 𝑡 = 2: 𝑤2 2 = 𝜌(𝑏|2) 𝑞 𝑡′ 2, 𝑏 [𝑠 + 𝛿𝑤1 𝑡′ ]
𝑡′ 𝑏∈{𝑣,𝑒,𝑚,𝑠}
= 0.25 ∗ {𝑞 2 2, 𝑣 −1 − 𝛿 + 𝑞 6 2, 𝑒 −1 − 𝛿 + 𝑞 1 2, 𝑚 −1 − 𝛿 ∗ 0 + 𝑞 3 2, 𝑠 −1 − 𝛿 } = 0.25 ∗ {−2 − 2 − 1 − 2} = −1.75
= 0.25 ∀𝑏 = −1 = −1, 𝑡′ ∈ 𝑇 0, 𝑡′ ∈ 𝑇+\𝑇
𝛿 = 1
𝑤2 2 = −1.75
Example: Iterative Policy Evaluation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Analogously, we get for all blue colored states s 𝒘𝟑 𝒕 = −𝟐. 𝟖𝟔
Example: Iterative Policy Evaluation
𝑤2 for the random policy:
0.0
- 1.7
- 2.0
- 2.0
- 1.7
- 2.0
- 2.0
- 2.0
- 2.0
- 2.0
- 2.0
- 1.7
- 2.0
- 2.0
- 1.7
0.0
Example: Iterative Policy Evaluation
𝑤𝑙 for the random policy:
0.0
- 1.7 -2.0 -2.0
- 1.7 -2.0 -2.0 -2.0
- 2.0 -2.0 -2.0 -1.7
- 2.0 -2.0 -1.7
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
𝑙 = 0
0.0
- 1.0 -1.0 -1.0
- 1.0 -1.0 -1.0 -1.0
- 1.0 -1.0 -1.0 -1.0
- 1.0 -1.0 -1.0
0.0
𝑙 = 1 𝑙 = 2
0.0
- 2.4 -2.9 -3.0
- 2.4 -2.9 -3.0 -2.9
- 2.9 -3.0 -2.9 -2.4
- 3.0 -2.9 -2.4
0.0
𝑙 =3
0.0
- 6.1 -8.4 -9.0
- 6.1 -7.7 -8.4 -8.4
- 8.4 -8.4 -7.7 -6.1
- 9.0 -8.4 -6.1
0.0
⋯
𝑙 = 10
0.0
- 14. -20. -22.
- 14. -18. -20. -20.
- 20. -20. -18. -14.
- 22. -20. -14.
0.0
𝑙 = ∞
⋯
𝑤𝜌
Policy Evaluation
Reason for computing value function 𝑤𝜌 for a policy 𝜌: → Finding better policies → Policy improvement
- Suppose we have determined the value function 𝑤𝜌 for an
arbitrary deterministic policy 𝜌
- Should we change the policy to deterministically choose an
action 𝑏 ≠ 𝜌(𝑡) for some state s?
- What we know: how good it is to follow the current policy
from s : 𝑤𝜌(𝑡)
- What we want to know: would it be better or worse to change
to the new policy?
Policy Improvement
Would it be better or worse to change to the new policy ? (new policy: for some s choose action 𝑏 ≠ 𝜌(𝑡)) → Consider selecting 𝑏 in s and thereafter following the existing policy 𝜌: value of this way of behaving is: 𝑟𝜌 𝑡, 𝑏 = 𝐹𝜌 𝑆𝑢+1 + 𝛿𝑤𝜌 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏] = 𝑞 𝑡′, 𝑠 𝑡, 𝑏) [𝑠 + 𝛿𝑤𝜌 𝑡′ ]
𝑡′,𝑠
→ If this is greater than 𝑤𝜌(𝑡), i.e., if it is better to select 𝑏 once in s and thereafter follow 𝜌 than it would be to follow 𝜌 all the time, then we would expect the new policy to be better than 𝜌
Policy Improvement
Policy Improvement Theorem
Let 𝜌 and 𝜌′ be any pair of deterministic policies such that, for all 𝑡 ∈ 𝑇, 𝑟𝜌 𝑡, 𝜌′(𝑡) ≥ 𝑤𝜌 𝑡 . Then the policy 𝜌′ must be as good as, or better than, 𝜌. That is, it must obtain greater or equal expected return from all states 𝑡 ∈ 𝑇: 𝑤𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡 . Moreover, if there is strict inequality of (1) at any state then there must be strict inequality of (2) at at least one state. (1) (2)
For situation before:
- Suppose we have a deterministic policy 𝜌, and a new policy 𝜌′
that equals 𝜌 except for one state 𝑡 for which 𝜌′ 𝑡 = 𝑏 ≠ 𝜌(𝑡)
- Suppose 𝑟𝜌 𝑡, 𝑏 ≥ 𝑤𝜌 𝑡 , i.e. (1) is satisfied
𝑞𝑝𝑚𝑗𝑑𝑧 𝑗𝑛𝑞𝑠𝑝𝑤. 𝑢ℎ𝑛.
𝜌′ is as good as, or better than, 𝜌
Policy Improvement
Claim: 𝑟𝜌 𝑡, 𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡 1 𝑤𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡
Policy Improvement Theorem: Proof
Proof: 𝑤𝜌 𝑡 ≤ 𝑟𝜌 𝑡, 𝜌′ 𝑡 = 𝑭𝜌[𝑆𝑢+1 + 𝛿 𝑤𝜌 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝜌′ 𝑡 = 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑤𝜌 𝑇𝑢+1 |𝑇𝑢 = 𝑡] ≤ 𝑭𝜌′[𝑆𝑢+1 + 𝛿𝑟𝜌 𝑇𝑢+1, 𝜌′ 𝑇𝑢+1 |𝑇𝑢 = 𝑡] = 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑭𝜌′[𝑆𝑢+2 + 𝛿𝑤𝜌 𝑇𝑢+2 ]|𝑇𝑢 = 𝑡] = 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 + 𝛿2𝑤𝜌 𝑇𝑢+2 |𝑇𝑢 = 𝑡] ≤ 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 +𝛿2𝑟𝜌 𝑇𝑢+2, 𝜌′ 𝑇𝑢+2 |𝑇𝑢 = 𝑡]
= 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2+𝛿2𝑭𝜌′[𝑆𝑢+3 + 𝛿𝑤𝜌 𝑇𝑢+3 ]|𝑇𝑢 = 𝑡]
= 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 + 𝛿2𝑆𝑢+3 + 𝛿3𝑤𝜌(𝑇𝑢+3)|𝑇𝑢 = 𝑡]
⋮
= 𝑭𝜌′[𝑆𝑢+1 + 𝛿 𝑆𝑢+2 + 𝛿2𝑆𝑢+3 + 𝛿3𝑆𝑢+4 + ⋯ 𝑇𝑢 = 𝑡 = 𝑭𝜌′ [𝐻𝑢|𝑇𝑢 = 𝑡] = 𝑤𝜌′ 𝑡
(1) (1) (1)
= 𝐻𝑢
- What we have seen: Given a (deterministic) policy and its
value function we can easily evaluate a change in the policy at a single state
- What if we allow changes at all states?
→For a given (deterministic) policy 𝜌 select at each state 𝑡 ∈ 𝑇 the action that appears best according to 𝑟𝜌 𝑡, 𝑏 → i.e., consider the new greedy policy 𝜌′, given by 𝜌′ 𝑡 = argmax
𝑏
𝑟𝜌(𝑡, 𝑏) → take action that looks best in the short term – after one step of lookahead – according to 𝑤𝜌
Policy Improvement
(3)
By construction, the greedy policy 𝜌′ fulfills the condition 𝑟𝜌 𝑡, 𝜌′(𝑡) ≥ 𝑤𝜌 𝑡
policy impr. theorem the policy 𝜌′ is as good as, or better than, the original
policy
Policy Improvement
(1)
The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.
Policy Improvement
Suppose the new greedy policy, 𝜌′, is as good as, but not better than, the old policy 𝜌. Then 𝑤𝜌 = 𝑤𝜌′ , and from 𝜌′ 𝑡 = argmax
𝑏
𝑟𝜌(𝑡, 𝑏) it follows that for all 𝑡 ∈ 𝑇: 𝑤𝜌′ 𝑡 = max
𝑏
𝐹 𝑆𝑢+1 + 𝛿𝑤𝜌′ 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏 = max
𝑏
𝑞 𝑡′, 𝑠 𝑡, 𝑏 𝑠 + 𝛿𝑤𝜌′ 𝑡 .
𝑡′,𝑠
This is the Bellman optimality equation, and therefore, 𝑤𝜌′ must be 𝑤∗, and both 𝜌 and 𝜌′ must be optimal policies.
Policy Improvement
(3)
- All the ideas of policy improvement can be extended to
stochastic policies. (A stochastic policy 𝜌 specifies probabilities 𝜌(𝑏|𝑡) for taking each action 𝑏 in each state 𝑡.)
- In particular, the policy improvement theorem holds also for
stochastic policies, under the natural definition: 𝑟𝜌 𝑡, 𝜌′(𝑡) = 𝜌′ 𝑏 𝑡 𝑟𝜌(𝑡, 𝑏)
𝑏
.
Policy Improvement
Example: Policy Improvement
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
random policy 𝜌 value function 𝑤𝜌 policy improvement
Example: Policy Improvement
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
random policy 𝜌 value function 𝑤𝜌 policy improvement 2
Example: Policy Improvement
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
random policy 𝜌 value function 𝑤𝜌 policy improvement 3
Example: Policy Improvement
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
random policy 𝜌 value function 𝑤𝜌 policy improvement 4
Example: Policy Improvement
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
random policy 𝜌 value function 𝑤𝜌 policy improvement
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′:
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′: Is 𝜌′ a better policy than the random policy 𝜌?
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′: 𝑤𝜌′ 2 = 𝜌′ 𝑏 2 𝑞 𝑡′ 2, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]
𝑡′ 𝑏∈{𝑚}
= 𝜌′ 𝑚 2 𝑞 𝑡′ 2, 𝑚 [−1 + 𝑤𝜌′ 𝑡′ ]
𝑡′
= −1 + 𝑤𝜌′(1) = −1
= 0 𝑔𝑝𝑠 𝑡′ ∈ {2,3, … , 16} 1 𝑔𝑝𝑠 𝑡′ = 1
Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :
= 1
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′: 𝑤𝜌′ 3 = 𝜌′ 𝑏 3 𝑞 𝑡′ 3, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]
𝑡′ 𝑏∈{𝑚}
= −1 + 𝑤𝜌′(2) = −1 − 1 = −2
= 0 𝑔𝑝𝑠 𝑡′ ∈ {1,3,4 … , 16} 1 𝑔𝑝𝑠 𝑡′ = 2
Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :
= 1
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′: 𝑤𝜌′ 6 = 𝜌′ 𝑏 6 𝑞 𝑡′ 6, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]
𝑡′ 𝑏∈{𝑚,𝑣}
= 0.5 ∗ {𝑞 5 6, 𝑚 ∗ [−1 + 𝑤𝜌′ 5 ] +𝑞 2 6, 𝑣 ∗ [−1 + 𝑤𝜌′ 2 ]} = 0.5 ∗ {−2 − 2} = −2
= 0 𝑔𝑝𝑠 𝑡′ ∈ 𝑇{2,5} 1 𝑔𝑝𝑠 𝑡′ ∈ {2,5}
Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :
= 0.5
= −1 = −1
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′: 𝑤𝜌′ 4 = 𝜌′ 𝑏 4 𝑞 𝑡′ 4, 𝑏 [−1 + 𝑤𝜌′ 𝑡′ ]
𝑡′ 𝑏∈{𝑚,𝑒}
= 0.5 ∗ {𝑞 3 4, 𝑚 ∗ [−1 + 𝑤𝜌′ 3 ] +𝑞 8 4, 𝑒 ∗ [−1 + 𝑤𝜌′ 8 ]} = 0.5 ∗ {−3 − 3} = −3 Is 𝜌′ a better policy than the random policy 𝜌? Let’s calculate 𝑤𝜌′ :
= 0.5 = −2 = −2
Example: Policy Improvement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
New policy 𝜌′:
0.0
- 1.0
- 2.0
- 3.0
- 1.0
- 2.0
- 3.0
- 2.0
- 2.0
- 3.0
- 2.0
- 1.0
- 3.0
- 2.0
- 1.0
0.0
Value function 𝑤𝜌′:
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
Value function 𝑤𝜌: Since 𝑤𝜌 𝑡 ≤ −14 for all non-terminal states, and 𝑤𝜌′ 𝑡 ≥ −3 for all non-terminal states Clearly 𝑤𝜌′ 𝑡 ≥ 𝑤𝜌 𝑡 ∀ 𝑡 ∈ 𝑇
𝜌′ is better than 𝜌
Policy Iteration
Policy iteration is a way of finding an optimal policy: Once a policy 𝜌 has been improved using 𝑤𝜌 to yield a better policy 𝜌′ we can then compute 𝑤𝜌′ and improve it again to yield an even better policy 𝜌′′ Thus, we can obtain a sequence of monotonically improving policies and value functions: 𝜌0
𝐹
→ 𝑤𝜌0
𝐽
→ 𝜌1
𝐹
→ 𝑤𝜌1
𝐽
→ 𝜌2
𝐹
→...
𝐽
→ 𝜌∗
𝐹
→ 𝑤∗
𝐹
→ denotes a policy evaluation and
𝐽
→ denotes a policy improvement
Policy Iteration
𝜌0
𝐹
→ 𝑤𝜌0
𝐽
→ 𝜌1
𝐹
→ 𝑤𝜌1
𝐽
→ 𝜌2
𝐹
→...
𝐽
→ 𝜌∗
𝐹
→ 𝑤∗ Because a finite MDP has only a finite number of policies, the policy iteration has to converge to an optimal policy and optimal value function in a finite number of iterations.
Policy iteration often converges in surprisingly few iterations:
Example: Policy Iteration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Take the random policy as 𝜌0
Policy iteration often converges in surprisingly few iterations:
Example: Policy Iteration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
𝜌0
𝐹
→
0.0
- 14.
- 20.
- 22.
- 14.
- 18.
- 20.
- 20.
- 20.
- 20.
- 18.
- 14.
- 22.
- 20.
- 14.
0.0
𝐽
→
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
𝜌1 𝑤𝜌0
Each of its iterations involves policy evaluation, which itself is an iterative process, that may require multiple sweeps through the state set.
Policy Iteration: Drawback
Exact convergence to 𝑤𝜌 occurs only in the limit in iterative policy evaluation. Do we really need exact convergence? → No Value iteration: stop policy evaluation after just one sweep of the state set.
Policy Iteration: Drawback
Value iteration can be written as a simple backup operation that combines the policy improvement and truncated policy evaluation steps: 𝑤𝑙+1 𝑡 = max
𝑏
𝐹 𝑆𝑢+1 + 𝛿𝑤𝑙 𝑇𝑢+1 𝑇𝑢 = 𝑡, 𝐵𝑢 = 𝑏] = max
𝑏
𝑞 𝑡′, 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤𝑙 𝑡′ ]
𝑡′,𝑠
, for all 𝑡 ∈ 𝑇. The sequence {𝑤𝑙} converges to 𝑤∗ under the same assumptions that guarantee the existence of 𝑤∗, i.e.
- Either 𝛿 < 1 or
- Eventual termination is guaranteed from all states under the
- ptimal policy
Value Iteration
(4)
Asynchronous Dynamic Programming
Major drawback of DP methods discussed so far: → Require sweeps over the whole state set → If state set is very large: single sweep already prohibitively expensive «Solution»: Asynchronous DP algorithms
Asynchronous DP algorithms:
- Are iterative DP algorithms that are not organized in terms of
systematic sweeps of the state set.
- Back up the values of states in any order whatsovever, using
whatever values of other states happen to be available.
- Must continue to back up the values of all the states to
converge correctly (can’t ignore any state after some point in the computation).
Asynchronous Dynamic Programming
Version of asynchronous value iteration: On each step k it only backs up the value of one state 𝑡𝑙, using the value iteration backup: 𝑤𝑙+1 𝑡𝑙 = max
𝑏
𝐹 𝑆𝑢+1 + 𝛿𝑤𝑙 𝑇𝑢+1 𝑇𝑢 = 𝑡𝑙, 𝐵𝑢 = 𝑏] If 0 ≤ 𝛿 < 1, convergence to 𝑤∗ is guaranteed given only that all states occur in the sequence {𝑡𝑙} infinitely often.
Asynchronous Dynamic Programming
(4)
Asynchronous algorithms make it easier to intermix computation with real-time interaction: To solve a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually experiencing the MDP → Experience can be used to determine states to which DP algorithm applies its backups At the same time, the latest value and policy information from the algorithm can guide the agent’s decision-making.
Asynchronous Dynamic Programming
Generalized Policy Iteration
Policy iteration consists of two interacting processes:
- Policy evaluation: making value function consistent with the
current policy
- Policy improvement: making the policy greedy w.r.t. the
current value function
Generalized policy iteration (GPI) refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.
Generalized Policy Iteration
Interacting processes: Policy evaluation & policy improvement
- In policy iteration, theses two processes alternate, each
completing before the other begins.
- In value iteration, only one iteration of policy evaluation is
performed in between each policy improvement.
- In asynchronous DP methods, the evaluation and
improvement processes are interleaved at an even finer grain. As long as both processes continue to update all states, the ultimate result is typically the same: convergence to optimal value function and an optimal policy.
Generalized Policy Iteration
Almost all reinforcement learning methods are well described as GPI:
Generalized Policy Iteration
It is easy to see that if both the evaluation process and the improvement process stabilize, then the value function and policy must be optimal:
- Value function stabilizes only when it is consistent with
current policy
- Policy stabilices only when it is greedy w.r.t. the current value
function → Both processes stabilize only when a policy has been found that is greedy w.r.t. its own value function → Bellman optimality equation holds → Policy and value funtion are optimal
Generalized Policy Iteration
Evaluation and improvement processes in GPI: Both competing and cooperating
Generalized Policy Iteration
Pull in opposing directions Interact to find optimal solution
Efficiency of Dynamic Programming
A DP method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is 𝑛𝑜
- 𝑜 = number of states
- 𝑛 = number of actions
→ DP is exponentially faster than any direct search in policy space could be
- In practice, DP methods can be used with today’s computers
to solve MDPs with millions of states.
- Both policy and value iteration are widely used, and it is not
clear which, if either, is better in general.
- In practice, these methods usually converge much faster than
their theoretical worst-case run times.
- On problems with large state spaces, asynchronous DP