CS287 Fall 2019 Lecture 2 Markov Decision Processes and Exact - - PowerPoint PPT Presentation

cs287 fall 2019 lecture 2 markov decision processes and
SMART_READER_LITE
LIVE PREVIEW

CS287 Fall 2019 Lecture 2 Markov Decision Processes and Exact - - PowerPoint PPT Presentation

CS287 Fall 2019 Lecture 2 Markov Decision Processes and Exact Solution Methods Pieter Abbeel UC Berkeley EECS Outline for Todays Lecture Markov Decision Processes (MDPs) n Exact Solution Methods n Value Iteration n Policy


slide-1
SLIDE 1

CS287 Fall 2019 – Lecture 2 Markov Decision Processes and Exact Solution Methods

Pieter Abbeel UC Berkeley EECS

slide-2
SLIDE 2

n

Markov Decision Processes (MDPs)

n

Exact Solution Methods

n

Value Iteration

n

Policy Iteration

n

Linear Programming

n

Maximum Entropy Formulation

n

Entropy

n

Max-ent Formulation

n

Intermezzo on Constrained Optimization

n

Max-Ent Value Iteration

Outline for Today’s Lecture

slide-3
SLIDE 3

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process

Assumption: agent gets to observe the state

slide-4
SLIDE 4

Markov Decision Process (S, A, T, R, γ, H)

Given:

n

S: set of states

n

A: set of actions

n

T: S x A x S x {0,1,…,H} à [0,1] Tt(s,a,s’) = P(st+1 = s’ | st = s, at =a)

n

R: S x A x S x {0, 1, …, H} à Rt(s,a,s’) = reward for (st+1 = s’, st = s, at =a)

n

γ in (0,1]: discount factor H: horizon over which the agent will act Goal:

n

Find π*: S x {0, 1, …, H} à A that maximizes expected sum of rewards, i.e.,

R

slide-5
SLIDE 5

MDP (S, A, T, R, γ, H), goal:

q Cleaning robot q Walking robot q Pole balancing q Games: tetris, backgammon

Examples

q Server management q Shortest path problems q Model for animals, people

slide-6
SLIDE 6

Canonical Example: Grid World

§ The agent lives in a grid § Walls block the agent’s path § The agent’s actions do not always go as planned:

§ 80% of the time, the action North takes the agent North (if there is no wall there) § 10% of the time, North takes the agent West; 10% East § If there is a wall in the direction the agent would have been taken, the agent stays put

§ Big rewards come at the end

slide-7
SLIDE 7

Solving MDPs

n

In an MDP, we want to find an optimal policy p*: S x 0:H → A

n

A policy p gives an action for each state for each time

n

An optimal policy maximizes expected sum of rewards

n

Contrast: If environment were deterministic, then would just need an optimal plan, or sequence of actions, from start to a goal

t=0 t=1 t=2 t=3 t=4 t=5=H

slide-8
SLIDE 8

n

Markov Decision Processes (MDPs)

n

Exact Solution Methods

n

Value Iteration

n

Policy Iteration

n

Linear Programming

n

Maximum Entropy Formulation

n

Entropy

n

Max-ent Formulation

n

Intermezzo on Constrained Optimization

n

Max-Ent Value Iteration

Outline for Today’s Lecture

For now: discrete state-action spaces as they are simpler to get the main concepts across. We will consider continuous spaces next lecture!

slide-9
SLIDE 9
slide-10
SLIDE 10

Value Iteration

Algorithm:

Start with for all s. For i = 1, … , H For all states s in S: This is called a value update or Bellman update/back-up

= expected sum of rewards accumulated starting from state s, acting optimally for i steps = optimal action when in state s and getting to act for i steps

slide-11
SLIDE 11

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-12
SLIDE 12

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-13
SLIDE 13

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-14
SLIDE 14

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-15
SLIDE 15

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-16
SLIDE 16

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-17
SLIDE 17

Value Iteration in Gridworld

noise = 0.2, γ =0.9, two terminal states with R = +1 and -1

slide-18
SLIDE 18

§ Now we know how to act for infinite horizon with discounted rewards!

§ Run value iteration till convergence. § This produces V*, which in turn tells us how to act, namely following:

§ Note: the infinite horizon optimal policy is stationary, i.e., the optimal action at a state s is the same action at all times. (Efficient to store!)

Value Iteration Convergence

  • Theorem. Value iteration converges. At convergence, we have found the
  • ptimal value function V* for the discounted infinite horizon problem, which

satisfies the Bellman equations

slide-19
SLIDE 19

n

= expected sum of rewards accumulated starting from state s, acting optimally for steps

n

= expected sum of rewards accumulated starting from state s, acting optimally for H steps

n

Additional reward collected over time steps H+1, H+2, … goes to zero as H goes to infinity Hence

For simplicity of notation in the above it was assumed that rewards are always greater than or equal to zero. If rewards can be negative, a similar argument holds, using max |R| and bounding from both sides.

Convergence: Intuition

V ∗

H(s)

V ∗(s)

γH+1R(sH+1) + γH+2R(sH+2) + . . . ≤ γH+1Rmax + γH+2Rmax + . . . = γH+1 1 − γ Rmax

V ∗

H H→∞

− − − − → V ∗

slide-20
SLIDE 20

Convergence and Contractions

n

Definition: max-norm:

n

Definition: An update operation is a γ-contraction in max-norm if and only if for all Ui, Vi:

n

Theorem: A contraction converges to a unique fixed point, no matter initialization.

n

Fact: the value iteration update is a γ-contraction in max-norm

n

Corollary: value iteration converges to a unique fixed point

n

Additional fact:

n

I.e. once the update is small, it must also be close to converged

slide-21
SLIDE 21

(a) Prefer the close exit (+1), risking the cliff (-10) (b) Prefer the close exit (+1), but avoiding the cliff (-10) (c) Prefer the distant exit (+10), risking the cliff (-10) (d) Prefer the distant exit (+10), avoiding the cliff (-10)

Exercise 1: Effect of Discount and Noise

(1) γ = 0.1, noise = 0.5 (2) γ = 0.99, noise = 0 (3) γ = 0.99, noise = 0.5 (4) γ = 0.1, noise = 0

slide-22
SLIDE 22

(a) Prefer close exit (+1), risking the cliff (-10)

  • (4) γ = 0.1, noise = 0

Exercise 1 Solution

slide-23
SLIDE 23

(b) Prefer close exit (+1), avoiding the cliff (-10)

  • (1) γ = 0.1, noise = 0.5

Exercise 1 Solution

slide-24
SLIDE 24

(c) Prefer distant exit (+1), risking the cliff (-10)

  • (2) γ = 0.99, noise = 0

Exercise 1 Solution

slide-25
SLIDE 25

(d) Prefer distant exit (+1), avoid the cliff (-10)

  • (3) γ = 0.99, noise = 0.5

Exercise 1 Solution

slide-26
SLIDE 26

n

Markov Decision Processes (MDPs)

n

Exact Solution Methods

n

Value Iteration

n

Policy Iteration

n

Linear Programming

n

Maximum Entropy Formulation

n

Entropy

n

Max-ent Formulation

n

Intermezzo on Constrained Optimization

n

Max-Ent Value Iteration

Outline for Today’s Lecture

For now: discrete state-action spaces as they are simpler to get the main concepts across. We will consider continuous spaces next lecture!

slide-27
SLIDE 27
slide-28
SLIDE 28

Policy Evaluation

n

Recall value iteration iterates:

n

Policy evaluation: At convergence:

slide-29
SLIDE 29

Exercise 2

slide-30
SLIDE 30

Policy Iteration

n

Repeat until policy converges

n

At convergence: optimal policy; and converges faster under some conditions

One iteration of policy iteration:

slide-31
SLIDE 31

Policy Evaluation Revisited

n

Idea 1: modify Bellman updates

n

Idea 2: it is just a linear system, solve with Matlab (or whatever) variables: Vπ(s) constants: T, R

slide-32
SLIDE 32

Proof sketch: (1) Guarantee to converge: In every step the policy improves. This means that a given policy can be encountered at most once. This means that after we have iterated as many times as there are different policies, i.e., (number actions)(number states), we must be done and hence have converged. (2) Optimal at convergence: by definition of convergence, at convergence πk+1(s) = πk(s) for all states s. This means Hence satisfies the Bellman equation, which means is equal to the optimal value function V*.

Policy Iteration Guarantees

  • Theorem. Policy iteration is guaranteed to converge and at convergence, the current policy

and its value function are the optimal policy and the optimal value function!

Policy Iteration iterates over:

slide-33
SLIDE 33

n

Markov Decision Processes (MDPs)

n

Exact Solution Methods

n

Value Iteration

n

Policy Iteration

n

Linear Programming

n

Maximum Entropy Formulation

n

Entropy

n

Max-ent Formulation

n

Intermezzo on Constrained Optimization

n

Max-ent Value Iteration

Outline for Today’s Lecture

For now: discrete state-action spaces as they are simpler to get the main concepts across. We will consider continuous spaces next lecture!

slide-34
SLIDE 34

n

What if optimal path becomes blocked? Optimal policy fails.

n

Is there any way to solve for a distribution rather than single solution? à more robust

Obstacles Gridworld

slide-35
SLIDE 35

What if we could find a “set of solutions”?

slide-36
SLIDE 36

n Entropy = measure of uncertainty over random variable X

= number of bits required to encode X (on average)

Entropy

slide-37
SLIDE 37

E.g. binary random variable

Entropy

slide-38
SLIDE 38

Entropy

slide-39
SLIDE 39

n Regular formulation: n Max-ent formulation:

Maximum Entropy MDP

slide-40
SLIDE 40

n But first need intermezzo on constrained optimization…

Max-ent Value Iteration

slide-41
SLIDE 41
slide-42
SLIDE 42

n Original problem: n Lagrangian: n At optimum:

Constrained Optimization

slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45

Max-ent for 1-step problem

slide-46
SLIDE 46

Max-ent for 1-step problem

= softmax

slide-47
SLIDE 47
slide-48
SLIDE 48

Max-ent Value Iteration

= 1-step problem (with Q instead of r), so we can directly transcribe solution:

slide-49
SLIDE 49

Maxent in Our Obstacles Gridworld (T=1)

slide-50
SLIDE 50

Maxent in Our Obstacles Gridworld (T=1e-2)

slide-51
SLIDE 51

Maxent in Our Obstacles Gridworld (T=0)

slide-52
SLIDE 52

n

Markov Decision Processes (MDPs)

n

Exact Solution Methods

n

Value Iteration

n

Policy Iteration

n

Linear Programming

n

Maximum Entropy Formulation

n

Entropy

n

Max-ent Formulation

n

Intermezzo on Constrained Optimization

n

Max-ent Value Iteration

Outline for Today’s Lecture

For now: discrete state-action spaces as they are simpler to get the main concepts across. We will consider continuous spaces next lecture!

slide-53
SLIDE 53
slide-54
SLIDE 54

n Recall, at value iteration convergence we have n LP formulation to find V*:

μ0 is a probability distribution over S, with μ0(s)> 0 for all s in S.

Infinite Horizon Linear Program

  • Theorem. V* is the solution to the above LP.
slide-55
SLIDE 55

Theorem Proof

slide-56
SLIDE 56

n How about:

Exercise 3

slide-57
SLIDE 57
slide-58
SLIDE 58

n Interpretation:

n n Equation 2: ensures that λ has the above meaning n Equation 1: maximize expected discounted sum of rewards

n Optimal policy:

Dual Linear Program

slide-59
SLIDE 59

n

Markov Decision Processes (MDPs)

n

Exact Solution Methods

n

Value Iteration

n

Policy Iteration

n

Linear Programming

n

Maximum Entropy Formulation

n

Entropy

n

Max-ent Formulation

n

Intermezzo on Constrained Optimization

n

Max-ent Value Iteration

Outline for Today’s Lecture

For now: discrete state-action spaces as they are simpler to get the main concepts across. We will consider continuous spaces next lecture!

slide-60
SLIDE 60

n

Optimal control: provides general computational approach to tackle control problems.

n

Dynamic programming / Value iteration

n

Discrete state spaces – Exact methods

n

Continuous state spaces – Approximate solutions through discretization

n

Large state spaces – Approximate solutions through function approximation

n

Linear systems – Closed form exact solution with LQR

n

Nonlinear systems – How to extend the exact solutions for linear systems:

n

Local linearization

n

iLQR, Differential dynamic programming

n

Optimal Control through Nonlinear Optimization

n

Shooting <> Collocation formulations

n

Model Predictive Control (MPC)

n

Examples:

Today and Forthcoming Lectures