Planning and Optimization F1. Markov Decision Processes Malte - - PowerPoint PPT Presentation

planning and optimization
SMART_READER_LITE
LIVE PREVIEW

Planning and Optimization F1. Markov Decision Processes Malte - - PowerPoint PPT Presentation

Planning and Optimization F1. Markov Decision Processes Malte Helmert and Thomas Keller Universit at Basel November 27, 2019 Motivation Markov Decision Process Policy Summary Content of this Course Foundations Logic Classical


slide-1
SLIDE 1

Planning and Optimization

  • F1. Markov Decision Processes

Malte Helmert and Thomas Keller

Universit¨ at Basel

November 27, 2019

slide-2
SLIDE 2

Motivation Markov Decision Process Policy Summary

Content of this Course

Planning Classical Foundations Logic Heuristics Constraints Probabilistic Explicit MDPs Factored MDPs

slide-3
SLIDE 3

Motivation Markov Decision Process Policy Summary

Content of this Course: Explicit MDPs

Explicit MDPs Foundations Linear Programing Policy Iteration Value Iteration

slide-4
SLIDE 4

Motivation Markov Decision Process Policy Summary

Motivation

slide-5
SLIDE 5

Motivation Markov Decision Process Policy Summary

Limitations of Classical Planning

timetable for astronauts on ISS

slide-6
SLIDE 6

Motivation Markov Decision Process Policy Summary

Generalization of Classical Planning: Temporal Planning

timetable for astronauts on ISS concurrency required for some experiments

  • ptimize makespan
slide-7
SLIDE 7

Motivation Markov Decision Process Policy Summary

Limitations of Classical Planning

kinematics of robotic arm

slide-8
SLIDE 8

Motivation Markov Decision Process Policy Summary

Generalization of Classical Planning: Numeric Planning

kinematics of robotic arm state space is continuous preconditions and effects described by complex functions

slide-9
SLIDE 9

Motivation Markov Decision Process Policy Summary

Limitations of Classical Planning

1 2 3 4 5 1 2 3 4 5

satellite takes images of patches on earth

slide-10
SLIDE 10

Motivation Markov Decision Process Policy Summary

Generalization of Classical Planning: MDPs

1 2 3 4 5 1 2 3 4 5

satellite takes images of patches on earth weather forecast is uncertain find solution with lowest expected cost

slide-11
SLIDE 11

Motivation Markov Decision Process Policy Summary

Limitations of Classical Planning

Chess

slide-12
SLIDE 12

Motivation Markov Decision Process Policy Summary

Generalization of Classical Planning: Multiplayer Games

Chess there is an opponent with a contradictory objective

slide-13
SLIDE 13

Motivation Markov Decision Process Policy Summary

Limitations of Classical Planning

Solitaire

slide-14
SLIDE 14

Motivation Markov Decision Process Policy Summary

Generalization of Classical Planning: POMDPs

Solitaire some state information cannot be observed must reason over belief for good behaviour

slide-15
SLIDE 15

Motivation Markov Decision Process Policy Summary

Limitations of Classical Planning

many applications are combinations of these all of these are active research areas we focus on one of them: probabilistic planning with Markov decision processes MDPs are closely related to games (Why?)

slide-16
SLIDE 16

Motivation Markov Decision Process Policy Summary

Markov Decision Process

slide-17
SLIDE 17

Motivation Markov Decision Process Policy Summary

Markov Decision Processes

Markov decision processes (MDPs) studied since the 1950s Work up to 1980s mostly on theory and basic algorithms for small to medium sized MDPs ( Part F) Today, focus on large, factored MDPs ( Part G) Fundamental datastructure for reinforcement learning (not covered in this course) and for probabilistic planning different variants exist

slide-18
SLIDE 18

Motivation Markov Decision Process Policy Summary

Reminder: Transition Systems

Definition (Transition System) A transition system is a 6-tuple T = S, L, c, T, s0, S⋆ where S is a finite set of states, L is a finite set of (transition) labels, c : L → R+

0 is a label cost function,

T ⊆ S × L × S is the transition relation, s0 ∈ S is the initial state, and S⋆ ⊆ S is the set of goal states.

slide-19
SLIDE 19

Motivation Markov Decision Process Policy Summary

Reminder: Transition System Example

LR LL TL RL TR RR

Logistics problem with one package, one truck, two locations: location of package: {L, R, T} location of truck: {L, R}

slide-20
SLIDE 20

Motivation Markov Decision Process Policy Summary

Stochastic Shortest Path Problem

Definition (Stochastic Shortest Path Problem) A stochastic shortest path problem (SSP) is a 6-tuple T = S, L, c, T, s0, S⋆, where S is a finite set of states, L is a finite set of (transition) labels (or actions), c : L → R+

0 is a label cost function,

T : S × L × S → [0, 1] is the transition function, s0 ∈ S is the initial state, and S⋆ ⊆ S is the set of goal states. For all s ∈ S and ℓ ∈ L with T(s, ℓ, s′) > 0 for some s′ ∈ S, we require

s′∈S T(s, ℓ, s′) = 1.

Note: An SSP is the probabilistic pendant of a transition system.

slide-21
SLIDE 21

Motivation Markov Decision Process Policy Summary

Reminder: Transition System Example

LR LL TL RL TR RR

.8 .2 .2 .8

Logistics problem with one package, one truck, two locations: location of package: {L, R, T} location of truck: {L, R} if truck moves with package, 20% chance of losing package

slide-22
SLIDE 22

Motivation Markov Decision Process Policy Summary

Markov Decision Process

Definition (Markov Decision Process) A (discounted reward) Markov decision process (MDP) is a 6-tuple T = S, L, R, T, s0, γ, where S is a finite set of states, L is a finite set of (transition) labels (or actions), R : S × L → R is the reward function, T : S × L × S → [0, 1] is the transition function, s0 ∈ S is the initial state, and γ ∈ (0, 1) is the discount factor. For all s ∈ S and ℓ ∈ L with T(s, ℓ, s′) > 0 for some s′ ∈ S, we require

s′∈S T(s, ℓ, s′) = 1.

slide-23
SLIDE 23

Motivation Markov Decision Process Policy Summary

Example: Grid World

1 2 3 4 1 2 3 s0 −1 +1 moving north goes east with probability 0.4

  • nly applicable action in (4,2) and (4,3) is collect, which

sets position back to (1,1) gives reward of +1 in (4,3) gives reward of −1 in (4,2)

slide-24
SLIDE 24

Motivation Markov Decision Process Policy Summary

Terminology (1)

If p := T(s, ℓ, s′) > 0, we write s

p:ℓ

− − → s′ or s

p

− → s′ if not interested in ℓ. If T(s, ℓ, s′) = 1, we also write s

− → s′ or s → s′ if not interested in ℓ. If T(s, ℓ, s′) > 0 for some s′ we say that ℓ is applicable in s. The set of applicable actions in s is L(s). We assume that L(s) = ∅ for all s ∈ S.

slide-25
SLIDE 25

Motivation Markov Decision Process Policy Summary

Terminology (2)

the successor set of s and ℓ is succ(s, ℓ) = {s′ ∈ S | T(s, ℓ, s′) > 0} s′ is a successor of s if s′ ∈ succ(s, ℓ) for some ℓ with s′ ∼ succ(s, ℓ) we denote that successor s′ ∈ succ(s, ℓ) of s and ℓ is sampled according to probability distribution T

slide-26
SLIDE 26

Motivation Markov Decision Process Policy Summary

Terminology (3)

s′ is reachable from s if there exists a sequence of transitions s0 p1:ℓ1 − − − → s1, . . . , sn−1 pn:ℓn − − − → sn s.t. s0 = s and sn = s′

Note: n = 0 possible; then s = s′ s0, . . . , sn is called (state) path from s to s′ ℓ1, . . . , ℓn is called (action) path from s to s′ length of path is n cost of path in SSP is n

i=1 c(ℓi) and

reward of path in MDP is n

i=1 γi−1R(si−1, ℓi)

s′ is reached from s through this path with probability n

i=1 pi

slide-27
SLIDE 27

Motivation Markov Decision Process Policy Summary

Policy

slide-28
SLIDE 28

Motivation Markov Decision Process Policy Summary

Solutions in SSPs

LR LL TL RL TR RR

move-L, pickup, move-R, drop

solution in deterministic transition systems is plan, i.e., a goal path from s0 to some s⋆ ∈ S⋆ cheapest plan is optimal solution deterministic agent that executes plan will reach goal

slide-29
SLIDE 29

Motivation Markov Decision Process Policy Summary

Solutions in SSPs

LR LL TL RL TR RR

move-L, pickup, move-R, drop

.8 .2 can’t drop! .2 .8

probabilistic agent will not reach goal or cannot execute plan non-determinism can lead to different outcome than anticipated in plan require a more general solution: a policy

slide-30
SLIDE 30

Motivation Markov Decision Process Policy Summary

Solutions in SSPs

LR

move-L

LL

pickup

TL

move-R

RL TR

drop

RR

.8 .2 .2 .8

policy must be allowed to be cyclic policy must be able to branch over outcomes policy assigns applicable actions to states

slide-31
SLIDE 31

Motivation Markov Decision Process Policy Summary

Policy for SSPs

Definition (Policy for SSPs) Let T = S, L, c, T, s0, S⋆ be an SSP. A policy for T is a mapping π : S → L ∪ {⊥} such that π(s) ∈ L(s) ∪ {⊥} for all s. The set of reachable states Sπ(s) from s under π is defined recursively as the smallest set satisfying the rules s ∈ Sπ(s) and succ(s′, π(s′)) ⊆ Sπ(s) for all s′ ∈ Sπ(s) \ S⋆ where π(s′) = ⊥. If π(s′) = ⊥ for all s′ ∈ Sπ(s), then π is executable in s.

slide-32
SLIDE 32

Motivation Markov Decision Process Policy Summary

Policy Representation

size of explicit representation of executable policy π is |Sπ(s0)|

  • ften, |Sπ(s0)| similar to |S|

compact policy representation, e.g. via value function approximation or neural networks, is active research area ⇒ not covered in this course instead, we consider small state spaces for basic algorithms

  • r online planning where planning for the current state s0 is

interleaved with execution of π(s0)

slide-33
SLIDE 33

Motivation Markov Decision Process Policy Summary

Policy for MDPs

Definition (Policy for MDPs) Let T = S, L, R, T, s0, γ be an MDP. A policy for T is a mapping π : S → L ∪ {⊥} such that π(s) ∈ L(s) ∪ {⊥} for all s. The set of reachable states Sπ(s) from s under π is defined recursively as the smallest set satisfying the rules s ∈ Sπ(s) and succ(s′, π(s′)) ⊆ Sπ(s) for all s′ ∈ Sπ(s) where π(s′) = ⊥. If π(s′) = ⊥ for all s′ ∈ Sπ(s), then π is executable in s.

slide-34
SLIDE 34

Motivation Markov Decision Process Policy Summary

Summary

slide-35
SLIDE 35

Motivation Markov Decision Process Policy Summary

Summary

Many planning scenarios beyond classical planning Part F and G are on probabilistic planning SSPs are classical planning + probabilistic transition function MDPs allow state-dependent rewards that are discounted over an infinite horizon Solutions of SSPs and MDPs are policies Policies consider branching and cycles