Markov Decision Processes Mausam CSE 515 Operations Research - - PowerPoint PPT Presentation

markov decision processes mausam cse 515
SMART_READER_LITE
LIVE PREVIEW

Markov Decision Processes Mausam CSE 515 Operations Research - - PowerPoint PPT Presentation

Markov Decision Processes Mausam CSE 515 Operations Research Machine Graph Learning Theory Control Markov Decision Process Economics Theory Neuroscience Robotics /Psychology Artificial Intelligence model the sequential decision


slide-1
SLIDE 1

Markov Decision Processes Mausam CSE 515

slide-2
SLIDE 2

Markov Decision Process

Operations Research Artificial Intelligence Machine Learning Graph Theory Robotics Neuroscience /Psychology Control Theory Economics

model the sequential decision making of a rational agent.

slide-3
SLIDE 3

A Statistician’s view to MDPs

Markov Chain One-step Decision Theory Markov Decision Process

  • sequential process
  • models state transitions
  • autonomous process
  • one-step process
  • models choice
  • maximizes utility
  • Markov chain + choice
  • Decision theory + sequentiality
  • sequential process
  • models state transitions
  • models choice
  • maximizes utility

s s s u s s u a a

slide-4
SLIDE 4

A Planning View

What action next?

Percepts Actions

Environment

Static vs. Dynamic Fully vs. Partially Observable Perfect vs. Noisy Deterministic vs. Stochastic Instantaneous vs. Durative Predictable vs. Unpredictable

slide-5
SLIDE 5

Classical Planning

What action next?

Percepts Actions

Environment

Static Fully Observable Perfect Predictable Instantaneous Deterministic

slide-6
SLIDE 6

Deterministic, fully observable

slide-7
SLIDE 7

Stochastic Planning: MDPs

What action next?

Percepts Actions

Environment

Static Fully Observable Perfect Stochastic Instantaneous Unpredictable

slide-8
SLIDE 8

Stochastic, Fully Observable

slide-9
SLIDE 9

Markov Decision Process (MDP)

  • S: A set of states
  • A: A set of actions
  • Pr(s’|s,a): transition model
  • C(s,a,s’): cost model
  • G: set of goals
  • s0: start state
  • : discount factor
  • R(s,a,s’): reward model

factored Factored MDP absorbing/ non-absorbing

slide-10
SLIDE 10

Objective of an MDP

  • Find a policy : S → A
  • which optimizes
  • minimizes

expected cost to reach a goal

  • maximizes

expected reward

  • maximizes

expected (reward-cost)

  • given a ____ horizon
  • finite
  • infinite
  • indefinite
  • assuming full observability

discounted

  • r

undiscount.

slide-11
SLIDE 11

Role of Discount Factor ()

  • Keep the total reward/total cost finite
  • useful for infinite horizon problems
  • Intuition (economics):
  • Money today is worth more than money tomorrow.
  • Total reward: r1 + r2 + 2r3 + …
  • Total cost: c1 + c2 + 2c3 + …
slide-12
SLIDE 12

Examples of MDPs

  • Goal-directed, Indefinite Horizon, Cost Minimization MDP
  • <S, A, Pr, C, G, s0>
  • Most often studied in planning, graph theory communities
  • Infinite Horizon, Discounted Reward Maximization MDP
  • <S, A, Pr, R, >
  • Most often studied in machine learning, economics, operations

research communities

  • Goal-directed, Finite Horizon, Prob. Maximization MDP
  • <S, A, Pr, G, s0, T>
  • Also studied in planning community
  • Oversubscription Planning: Non absorbing goals, Reward Max. MDP
  • <S, A, Pr, G, R, s0>
  • Relatively recent model

most popular

slide-13
SLIDE 13

Bellman Equations for MDP1

  • <S, A, Pr, C, G, s0>
  • Define J*(s) {optimal cost} as the minimum

expected cost to reach a goal from this state.

  • J* should satisfy the following equation:
slide-14
SLIDE 14

Bellman Equations for MDP2

  • <S, A, Pr, R, s0, >
  • Define V*

V*(s) {optimal value} as the maximum um expected discou

  • unte

nted d reward from this state.

  • V* should satisfy the following equation:
slide-15
SLIDE 15

Bellman Equations for MDP3

  • <S, A, Pr, G, s0, T>
  • Define P*

P*(s,t) ,t) {optimal prob} as the maximum expected probability to reach a goal from this state starting at tth

th timest

step ep.

  • P* should satisfy the following equation:
slide-16
SLIDE 16

Bellman Backup (MDP2)

  • Given an estimate of V* function (say Vn)
  • Backup Vn function at state s
  • calculate a new estimate (Vn+1) :
  • Qn+1(s,a) : value/cost of the strategy:
  • execute action a in s, execute n subsequently
  • n = argmaxa∈Ap(s)Qn(s,a)

V

R V

ax

slide-17
SLIDE 17

Bellman Backup

V0= 0 V0= 1 V0= 2

Q1(s,a1) = 2 + 0  Q1(s,a2) = 5 +  0.9£ 1 +  0.1£ 2 Q1(s,a3) = 4.5 + 2  max

V1= = 6.5

(~1) agreed

edy = a3

5

a2 a1 a3 s0 s1 s2 s3

slide-18
SLIDE 18

Value iteration [Bellman’57]

  • assign an arbitrary assignment of V0 to each state.
  • repeat
  • for all states s
  • compute Vn+1(s) by Bellman backup at s.
  • until maxs |Vn+1(s) – Vn(s)| < 

Iteration n+1 Residual(s) -convergence

slide-19
SLIDE 19

Comments

  • Decision-theoretic Algorithm
  • Dynamic Programming
  • Fixed Point Computation
  • Probabilistic version of Bellman-Ford Algorithm
  • for shortest path computation
  • MDP1 : Stochastic Shortest Path Problem
  • Time Complexity
  • one iteration: O(|S|2|A|)
  • number of iterations: poly(|S|, |A|, 1/(1-))
  • Space Complexity: O(|S|)
  • Factored MDPs
  • exponential space, exponential time
slide-20
SLIDE 20

Convergence Properties

  • Vn → V* in the limit as n→1
  • -convergence: Vn function is within  of V*
  • Optimality: current policy is within 2/(1-) of optimal
  • Monotonicity
  • V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)
  • V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above)
  • otherwise Vn non-monotonic
slide-21
SLIDE 21

Policy Computation Optimal policy is stationary and time-independent.

  • for infinite/indefinite horizon problems

Policy Evaluation A system of linear equations in |S| variables. ax ax R V

R V

V

slide-22
SLIDE 22

Changing the Search Space

  • Value Iteration
  • Search in value space
  • Compute the resulting policy
  • Policy Iteration
  • Search in policy space
  • Compute the resulting value
slide-23
SLIDE 23

Policy iteration [Howard’60]

  • assign an arbitrary assignment of 0 to each state.
  • repeat
  • Policy Evaluation: compute Vn+1: the evaluation of n
  • Policy Improvement: for all states s
  • compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
  • until n+1 = n

Advantage

  • searching in a finite (policy) space as opposed to

uncountably infinite (value) space ⇒ convergence faster.

  • all other properties follow!

costly: O(n3) approximate by value iteration using fixed policy Modified Policy Iteration

slide-24
SLIDE 24

Modified Policy iteration

  • assign an arbitrary assignment of 0 to each state.
  • repeat
  • Policy Evaluation: compute Vn+1 the approx. evaluation of n
  • Policy Improvement: for all states s
  • compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
  • until n+1 = n

Advantage

  • probably the most competitive synchronous dynamic

programming algorithm.

slide-25
SLIDE 25

Asynchronous Value Iteration

  • States may be backed up in any order
  • instead of an iteration by iteration
  • As long as all states backed up infinitely often
  • Asynchronous Value Iteration converges to optimal
slide-26
SLIDE 26

Asynch VI: Prioritized Sweeping

  • Why backup a state if values of successors same?
  • Prefer backing a state
  • whose successors had most change
  • Priority Queue of (state, expected change in value)
  • Backup in the order of priority
  • After backing a state update priority queue
  • for all predecessors
slide-27
SLIDE 27

Asynch VI: Real Time Dynamic Programming

[Barto, Bradtke, Singh’95]

  • Trial: simulate greedy policy starting from start state;

perform Bellman backup on visited states

  • RTDP: repeat Trials until value function converges
slide-28
SLIDE 28

Min ? ? s0 Vn Vn Vn Vn Vn Vn Vn Qn+1(s0,a) Vn+1(s0) agreedy = a2

RTDP Trial

Goal

a1 a2 a3 ?

slide-29
SLIDE 29

Comments

  • Properties
  • if all states are visited infinitely often then Vn → V*
  • Advantages
  • Anytime: more probable states explored quickly
  • Disadvantages
  • complete convergence can be slow!
slide-30
SLIDE 30

Reinforcement Learning

slide-31
SLIDE 31

Reinforcement Learning

  • Still have an MDP
  • Still looking for policy 
  • New twist: don’t know Pr and/or R
  • i.e. don’t know which states are good
  • and what actions do
  • Must actually try out actions to learn
slide-32
SLIDE 32

Model based methods

  • Visit different states, perform different actions
  • Estimate Pr and R
  • Once model built, do planning using V.I. or
  • ther methods
  • Con: require _huge_ amounts of data
slide-33
SLIDE 33

Model free methods

  • Directly learn Q*(s,a) values
  • sample = R(s,a,s’) + maxa’Qn(s’,a’)
  • Nudge the old estimate towards the new sample
  • Qn+1(s,a)  (1-)Qn(s,a) + [sample]
slide-34
SLIDE 34

Properties

  • Converges to optimal if
  • If you explore enough
  • If you make learning rate () small enough
  • But not decrease it too quickly
  • ∑i(s,a,i) = ∞
  • ∑i2(s,a,i) < ∞

where i is the number of visits to (s,a)

slide-35
SLIDE 35

Model based vs. Model Free RL

  • Model based
  • estimate O(|S|2|A|) parameters
  • requires relatively larger data for learning
  • can make use of background knowledge easily
  • Model free
  • estimate O(|S||A|) parameters
  • requires relatively less data for learning
slide-36
SLIDE 36

Exploration vs. Exploitation

  • Exploration: choose actions that visit new states in
  • rder to obtain more data for better learning.
  • Exploitation: choose actions that maximize the

reward given current learnt model.

  • -greedy
  • Each time step flip a coin
  • With prob , take an action randomly
  • With prob 1- take the current greedy action
  • Lower  over time
  • increase exploitation as more learning has happened
slide-37
SLIDE 37

Q-learning

  • Problems
  • Too many states to visit during learning
  • Q(s,a) is still a BIG table
  • We want to generalize from small set of training examples
  • Techniques
  • Value function approximators
  • Policy approximators
  • Hierarchical Reinforcement Learning
slide-38
SLIDE 38

Task Hierarchy: MAXQ Decomposition [Dietterich’00]

Root Take Give Navigate(loc) Deliver Fetch Extend-arm Extend-arm Grab Release Movee Movew Moves Moven Children of a task Children of a task are unordered

slide-39
SLIDE 39

Partially Observable Markov Decision Processes

slide-40
SLIDE 40

Partially Observable MDPs

What action next?

Percepts Actions

Environment

Static Partially Observable Noisy Stochastic Instantaneous Unpredictable

slide-41
SLIDE 41

Stochastic, Fully Observable

slide-42
SLIDE 42

Stochastic, Partially Observable

slide-43
SLIDE 43

POMDPs

  • In POMDPs we apply the very same idea as in MDPs.
  • Since the state is not observable,

the agent has to make its decisions based on the belief state which is a posterior distribution over states.

  • Let b be the belief of the agent about the current state
  • POMDPs compute a value function over belief space:

γ a b, a a

slide-44
SLIDE 44

POMDPs

  • Each belief is a probability distribution,
  • value fn is a function of an entire probability distribution.
  • Problematic, since probability distributions are continuous.
  • Also, we have to deal with huge complexity of belief spaces.
  • For finite worlds with finite state, action, and observation

spaces and finite horizons,

  • we can represent the value functions by piecewise linear

functions.

slide-45
SLIDE 45

Applications

  • Robotic control
  • helicopter maneuvering, autonomous vehicles
  • Mars rover - path planning, oversubscription planning
  • elevator planning
  • Game playing - backgammon, tetris, checkers
  • Neuroscience
  • Computational Finance, Sequential Auctions
  • Assisting elderly in simple tasks
  • Spoken dialog management
  • Communication Networks – switching, routing, flow control
  • War planning, evacuation planning