AI-based Mobile Robotics Planning and Control: Markov Decision - - PowerPoint PPT Presentation

ai based mobile robotics
SMART_READER_LITE
LIVE PREVIEW

AI-based Mobile Robotics Planning and Control: Markov Decision - - PowerPoint PPT Presentation

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next?


slide-1
SLIDE 1

CSE-571 AI-based Mobile Robotics

Planning and Control: Markov Decision Processes

slide-2
SLIDE 2

Planning

What action next?

Percepts Actions

Environment

Static vs. Dynamic Full vs. Partial satisfaction Fully vs. Partially Observable Perfect vs. Noisy Deterministic vs. Stochastic Discrete vs. Continuous Outcomes Predictable vs. Unpredictable

slide-3
SLIDE 3

Classical Planning

What action next?

Percepts Actions

Environment

Static Full Fully Observable Perfect Predictable Discrete Deterministic

slide-4
SLIDE 4

Stochastic Planning

What action next?

Percepts Actions

Environment

Static Full Fully Observable Perfect Stochastic Unpredictable Discrete

slide-5
SLIDE 5

Deterministic, fully observable

slide-6
SLIDE 6

Stochastic, Fully Observable

slide-7
SLIDE 7

Stochastic, Partially Observable

slide-8
SLIDE 8

Markov Decision Process (MDP)

  • S: A set of states
  • A: A set of actions
  • Pr(s’|s,a): transition model
  • C(s,a,s’): cost model
  • G: set of goals
  • s0: start state
  • : discount factor
  • R(s,a,s’): reward model
slide-9
SLIDE 9

Role of Discount Factor ()

  • Keep the total reward/total cost finite
  • useful for infinite horizon problems
  • sometimes indefinite horizon: if there are deadends
  • Intuition (economics):
  • Money today is worth more than money tomorrow.
  • Total reward: r1 + r2 + 2r3 + …
  • Total cost: c1 + c2 + 2c3 + …
slide-10
SLIDE 10

Objective of a Fully Observable MDP

  • Find a policy : S → A
  • which optimises
  • minimises

expected cost to reach a goal

  • maximises

expected reward

  • maximises

expected (reward-cost)

  • given a ____ horizon
  • finite
  • infinite
  • indefinite
  • assuming full observability

discounted

  • r

undiscount.

slide-11
SLIDE 11

Examples of MDPs

  • Goal-directed, Indefinite Horizon, Cost Minimisation MDP
  • <S, A, Pr, C, G, s0>
  • Infinite Horizon, Discounted Reward Maximisation MDP
  • <S, A, Pr, R, >
  • Reward = t

t trt

  • Goal-directed, Finite Horizon, Prob. Maximisation MDP
  • <S, A, Pr, G, s0, T>
slide-12
SLIDE 12
  • <S, A, Pr, C, G, s0>
  • Define J*(s) {optimal cost} as the minimum

expected cost to reach a goal from this state.

  • J* should satisfy the following equation:

Bellman Equations for MDP1

Q*(s,a)

slide-13
SLIDE 13
  • <S, A, Pr, R, s0, >
  • Define V*

V*(s) {optimal val alue ue} as the ma maxim imum um expected di disco counted unted rew ewar ard from this state.

  • V* should satisfy the following equation:

Bellman Equations for MDP2

slide-14
SLIDE 14
  • Given an estimate of V* function (say Vn)
  • Backup Vn function at state s
  • calculate a new estimate (Vn+1

+1) :

  • Qn+1(s,a) : value/cost of the strategy:
  • execute action a in s, execute n subsequently
  • n = argmaxa∈Ap(s)Qn(s,a) (greedy action)

Bellman Backup

slide-15
SLIDE 15

Bellman Backup

V0= 20 V0= 2 V0= 3

Q1(s,a1) = 20 + 5 Q1(s,a2) = 20 + 0.9£ 2 + 0.1£ 3 Q1(s,a3) = 4 + 3 max

V1= 25

agreedy

dy = a

= a1 20

a2 a1 a3 s0 s1 s2 s3 ?

slide-16
SLIDE 16

Value iteration [Bellman’57]

  • assign an arbitrary assignment of V0 to each non-goal state.
  • repeat
  • for all states s

compute Vn+1(s) by Bellman backup at s.

  • until maxs |Vn+1(s) – Vn(s)| < 

Iteration n+1 Residual(s) -convergence

slide-17
SLIDE 17

Complexity of value iteration

  • One iteration takes O(|A||S|2) time.
  • Number of iterations required
  • poly(|S|,|A|,1/(1-γ))
  • Overall:
  • the algorithm is polynomial in state space
  • thus exponential in number of state variables.
slide-18
SLIDE 18

Policy Computation Optimal policy is stationary and time-independent.

  • for infinite/indefinite horizon problems

Policy Evaluation A system of linear equations in |S| variables.

slide-19
SLIDE 19

Markov Decision Process (MDP)

s2 s3 s4 s5 s1

0.7 0.3 0.9 0.1 0.3 0.3 0.4 0.99 0.01 0.2 0.8 r=-10 r=20 r=0 r=1 r=0

slide-20
SLIDE 20

Value Function and Policy

  • Value residual and policy residual
slide-21
SLIDE 21

Changing the Search Space

  • Value Iteration
  • Search in value space
  • Compute the resulting policy
  • Policy Iteration [Howard’60]
  • Search in policy space
  • Compute the resulting value
slide-22
SLIDE 22

Policy iteration [Howard’60]

  • assign an arbitrary assignment of 0 to each state.
  • repeat
  • compute Vn+1: the evaluation of n
  • for all states s

compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)

  • until n+1 = n

Advantage

  • searching in a finite (policy) space as opposed to

uncountably infinite (value) space ⇒ convergence faster.

  • all other properties follow!

costly: O(n3) approximate by value iteration using fixed policy Modified Policy Iteration

slide-23
SLIDE 23

LP Formulation minimise s2S

2S V*(s)

under constraints: for every s, a V*(s) ≥ R(s) + s’2S

2S Pr(s’|a,s)V*(s’)

A big LP. So other tricks used to solve it!

slide-24
SLIDE 24

   =

   N n A a t n

a n n V

n

' ) ( 1

) , , | ' Pr( max ) ( x x

x

 

   

X x'

x' x' x' x x' d V R n a n

t n n

) ( ) ( ) ' , , , | Pr(

' '

Hybrid Markov decision process:

Markov state = (n, x), where n is the discrete component (set of fluents) and .

Bellman’s equation:

l

  x

Hybrid MDPs

slide-25
SLIDE 25

   =

   N n A a t n

a n n V

n

' ) ( 1

) , , | ' Pr( max ) ( x x

x

 

   

X x'

x' x' x' x x' d V R n a n

t n n

) ( ) ( ) ' , , , | Pr(

' '

Hybrid Markov decision process:

Markov state = (n, x), where n is the discrete component (set of fluents) and .

Bellman’s equation:

l

  x

Hybrid MDPs

slide-26
SLIDE 26
  • discrete-discrete

constant-discrete [Feng et.al.’04] constant-constant [Li&Littman’05]

Convolutions

slide-27
SLIDE 27

Result of convolutions

discrete constant linear discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic value function probability density function

slide-28
SLIDE 28

Value Iteration for Motion Planning

(assumes knowledge of robot’s location)

slide-29
SLIDE 29

Frontier-based Exploration

  • Every unknown location is a target point.
slide-30
SLIDE 30

Manipulator Control Arm with two joints Configuration space

slide-31
SLIDE 31

Manipulator Control Path State space Configuration space

slide-32
SLIDE 32

Manipulator Control Path State space Configuration space

slide-33
SLIDE 33

Collision Avoidance via Planning

  • Potential field methods have local minima
  • Perform efficient path planning in the local perceptual

space

  • Path costs depend on length and closeness to
  • bstacles

[Konolige, Gradient method]

slide-34
SLIDE 34

Paths and Costs

  • Path is list of points P={p1, p2,… pk}
  • pk is only point in goal set
  • Cost of path is separable into intrinsic cost at each point

along with adjacency cost of moving from one point to the next

  • Adjacency cost typically Euclidean distance
  • Intrinsic cost typically occupancy, distance to obstacle

 

 =

i i i i i

p p A p I P F ) , ( ) ( ) (

1

slide-35
SLIDE 35

Navigation Function

  • Assignment of potential field value to every

element in configuration space [Latombe, 91].

  • Goal set is always downhill, no local minima.
  • Navigation function of a point is cost of minimal

cost path that starts at that point.

) ( min

k P k

P F N

k

=

slide-36
SLIDE 36

Computation of Navigation Function

  • Initialization
  • Points in goal set  0 cost
  • All other points  infinite cost
  • Active list  goal set
  • Repeat
  • Take point from active list and update neighbors
  • If cost changes, add the point to the active list
  • Until active list is empty
slide-37
SLIDE 37

Challenges

  • Where do we get the state space from?
  • Where do we get the model from?
  • What happens when the world is slightly

different?

  • Where does reward come from?
  • Co

Cont ntinuo inuous us sta tate te var aria iables bles

  • Co

Cont ntinuo inuous us ac action tion spa pace ce

slide-38
SLIDE 38

How to solve larger problems?

  • If deterministic problem
  • Use dijkstra’s algorithm
  • If no back-edge
  • Use backward Bellman updates
  • Prioritize Bellman updates
  • to maximize information flow
  • If known initial state
  • Use dynamic programming + heuristic search
  • LAO*, RTDP and variants
  • Divide an MDP into sub-MDPs are solve the hierarchy
  • Aggregate states with similar values
  • Relational MDPs
slide-39
SLIDE 39

Approximations: n-step lookahead

  • n=1 : greedy
  • 1(s) = argmaxa R(s,a)
  • n-step lookahead
  • n(s) = argmaxa Vn(s)
slide-40
SLIDE 40

Approximation: Incremental approaches

Deterministic planner deterministic relaxation Stochastic simulation Identify weakness plan Solve/Merge

slide-41
SLIDE 41

Approximations: Planning and Replanning

Deterministic planner deterministic relaxation Execute the action plan send the state reached

slide-42
SLIDE 42

SA-1

CSE-571 AI-based Mobile Robotics

Planning and Control: (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes

slide-43
SLIDE 43

Reinforcement Learning

  • Still have an MDP
  • Still looking for policy 
  • New twist: don’t know Pr and/or R
  • i.e. don’t know which states are good
  • And what actions do
  • Must actually try actions and states out to learn
slide-44
SLIDE 44

Model based methods

  • Visit different states, perform different actions
  • Estimate Pr and R
  • Once model built, do planning using V.I. or
  • ther methods
  • Cons: require _huge_ amounts of data
slide-45
SLIDE 45

Model free methods

  • TD learning
  • Directly learn Q*(s,a) values
  • sample = R(s,a,s’) + maxa’Qn(s’,a’)
  • Nudge the old estimate towards the new sample
  • Qn+1(s,a) Ã (1-)Qn(s,a) + [sample]
slide-46
SLIDE 46

Properties

  • Converges to optimal if
  • If you explore enough
  • If you make learning rate () small enough
  • But not decrease it too quickly
slide-47
SLIDE 47

Exploration vs. Exploitation

  • -greedy
  • Each time step flip a coin
  • With prob , action randomly
  • With prob 1- take the current greedy action
  • Lower  over time to increase exploitation as

more learning has happened

slide-48
SLIDE 48

Q-learning

  • Problems
  • Too many states to visit during learning
  • Q(s,a) is a BIG table
  • We want to generalize from small set of training

examples

  • Solutions
  • Value function approximators
  • Policy approximators
  • Hierarchical Reinforcement Learning
slide-49
SLIDE 49

Task Hierarchy: MAXQ Decomposition [Dietterich’00]

Root Take Give Navigate(loc) Deliver Fetch Extend-arm Extend-arm Grab Release Movee Movew Moves Moven Children of a unordered Children of a task are unordered

slide-50
SLIDE 50

MAXQ Decomposition

  • Augment the state s by adding the subtask i: [s,i].
  • Define C([s,i],j) as the reward received in i after j

finishes.

  • Q([s,Fetch],Navigate(prr)) =

V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))

  • Express V in terms of C
  • Learn C, instead of learning Q

Reward received Reward received while navigating Reward received Reward received after navigation

slide-51
SLIDE 51

MAXQ Decomposition (contd)

  • State Abstraction
  • Finding irrelevant actions
  • Finding funnel actions
slide-52
SLIDE 52

POMDPs: Recall example

slide-53
SLIDE 53

Partially Observable Markov Decision Processes

slide-54
SLIDE 54

29.11.2007 CSE-571- AI-based Mobile Robotics 54

POMDPs

 In POMDPs we apply the very same idea as in

MDPs.

 Since the state is not observable, the agent has

to make its decisions based on the belief state which is a posterior distribution over states.

 Let b be the belief of the agent about the state

under consideration.

 POMDPs compute a value function over belief

space:

slide-55
SLIDE 55

29.11.2007 CSE-571- AI-based Mobile Robotics 55

Problems

 Each belief is a probability distribution, thus,

each value in a POMDP is a function of an entire probability distribution.

 This is problematic, since probability

distributions are continuous.

 Additionally, we have to deal with the huge

complexity of belief spaces.

 For finite worlds with finite state, action, and

measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions.

slide-56
SLIDE 56

29.11.2007 CSE-571- AI-based Mobile Robotics 56

An Illustrative Example

2

x

1

x

3

u

8 .

2

z

1

z

3

u

2 . 8 . 2 . 7 . 3 . 3 . 7 .

measurements action u3 state x2 payoff measurements

1

u

2

u

1

u

2

u

100  50  100 100

actions u1, u2 payoff state x1

1

z

2

z

slide-57
SLIDE 57

29.11.2007 CSE-571- AI-based Mobile Robotics 57

The Parameters of the Example

 The actions u1 and u2 are terminal actions.  The action u3 is a sensing action that potentially

leads to a state transition.

 The horizon is finite and =1.

slide-58
SLIDE 58

29.11.2007 CSE-571- AI-based Mobile Robotics 58

Payoff in POMDPs

 In MDPs, the payoff (or return)

depended on the state of the system.

 In POMDPs, however, the true state

is not exactly known.

 Therefore, we compute the

expected payoff by integrating

  • ver all states:
slide-59
SLIDE 59

29.11.2007 CSE-571- AI-based Mobile Robotics 59

Payoffs in Our Example (1)

 If we are totally certain that we are in state x1 and

execute action u1, we receive a reward of -100

 If, on the other hand, we definitely know that we

are in x2 and execute u1, the reward is +100.

 In between it is the linear combination of the

extreme values weighted by the probabilities

slide-60
SLIDE 60

29.11.2007 CSE-571- AI-based Mobile Robotics 60

Payoffs in Our Example (2)

slide-61
SLIDE 61

29.11.2007 CSE-571- AI-based Mobile Robotics 61

The Resulting Policy for T=1

 Given we have a finite POMDP with

T=1, we would use V1(b) to determine the optimal policy.

 In our example, the optimal policy

for T=1 is

 This is the upper thick graph in the

diagram.

slide-62
SLIDE 62

29.11.2007 CSE-571- AI-based Mobile Robotics 62

Piecewise Linearity, Convexity

 The resulting value function V1(b) is

the maximum of the three functions at each point

 It is piecewise linear and convex.

slide-63
SLIDE 63

29.11.2007 CSE-571- AI-based Mobile Robotics 63

Pruning

 If we carefully consider V1(b), we see

that only the first two components contribute.

 The third component can therefore

safely be pruned away from V1(b).

slide-64
SLIDE 64

29.11.2007 CSE-571- AI-based Mobile Robotics 64

Increasing the Time Horizon

 Assume the robot can make an observation before

deciding on an action.

V1(b)

slide-65
SLIDE 65

29.11.2007 CSE-571- AI-based Mobile Robotics 65

Increasing the Time Horizon

 Assume the robot can make an observation before

deciding on an action.

 Suppose the robot perceives z1 for which

p(z1 | x1)=0.7 and p(z1| x2)=0.3.

 Given the observation z1 we update the belief using

Bayes rule.

3 . 4 . ) 1 ( 3 . 7 . ) ( ) ( ) 1 ( 3 . ' ) ( 7 . '

1 1 1 1 1 1 2 1 1 1

 =   =  = = p p p z p z p p p z p p p

slide-66
SLIDE 66

29.11.2007 CSE-571- AI-based Mobile Robotics 66

Value Function

b’(b|z1) V1(b) V1(b|z1)

slide-67
SLIDE 67

29.11.2007 CSE-571- AI-based Mobile Robotics 67

Increasing the Time Horizon

 Assume the robot can make an observation before

deciding on an action.

 Suppose the robot perceives z1 for which

p(z1 | x1)=0.7 and p(z1| x2)=0.3.

 Given the observation z1 we update the belief using

Bayes rule.

 Thus V1(b | z1) is given by

slide-68
SLIDE 68

29.11.2007 CSE-571- AI-based Mobile Robotics 68

Expected Value after Measuring

 Since we do not know in advance

what the next measurement will be, we have to compute the expected belief  

  

= = =

=         = = =

2 1 1 1 1 2 1 1 1 1 2 1 1 1 1

) | ( ) ( ) | ( ) ( ) | ( ) ( )] | ( [ ) (

i i i i i i i i i z

p x z p V z p p x z p V z p z b V z p z b V E b V

slide-69
SLIDE 69

29.11.2007 CSE-571- AI-based Mobile Robotics 69

Expected Value after Measuring

 Since we do not know in advance

what the next measurement will be, we have to compute the expected belief

slide-70
SLIDE 70

29.11.2007 CSE-571- AI-based Mobile Robotics 70

Resulting Value Function

 The four possible combinations yield the

following function which then can be simplified and pruned.

slide-71
SLIDE 71

29.11.2007 CSE-571- AI-based Mobile Robotics 71

Value Function

b’(b|z1) p(z1) V1(b|z1) p(z2) V2(b|z2) \bar{V}1(b)

slide-72
SLIDE 72

29.11.2007 CSE-571- AI-based Mobile Robotics 72

State Transitions (Prediction)

 When the agent selects u3 its state

potentially changes.

 When computing the value

function, we have to take these potential state changes into account.

slide-73
SLIDE 73

29.11.2007 CSE-571- AI-based Mobile Robotics 73

Resulting Value Function after executing u3

 Taking the state transitions into account,

we finally obtain.

slide-74
SLIDE 74

29.11.2007 CSE-571- AI-based Mobile Robotics 74

Value Function after executing u3

\bar{V}1(b) \bar{V}1(b|u3)

slide-75
SLIDE 75

29.11.2007 CSE-571- AI-based Mobile Robotics 75

Value Function for T=2

 Taking into account that the agent can

either directly perform u1 or u2 or first u3 and then u1 or u2, we obtain (after pruning)

slide-76
SLIDE 76

29.11.2007 CSE-571- AI-based Mobile Robotics 76

Graphical Representation

  • f V2(b)

u1 optimal u2 optimal unclear

  • utcome of

measuring is important here

slide-77
SLIDE 77

29.11.2007 CSE-571- AI-based Mobile Robotics 77

Deep Horizons and Pruning

 We have now completed a full backup

in belief space.

 This process can be applied

recursively.

 The value functions for T=10 and

T=20 are

slide-78
SLIDE 78

29.11.2007 CSE-571- AI-based Mobile Robotics 78

Deep Horizons and Pruning

slide-79
SLIDE 79

29.11.2007 CSE-571- AI-based Mobile Robotics 79

slide-80
SLIDE 80

29.11.2007 CSE-571- AI-based Mobile Robotics 80

Why Pruning is Essential

 Each update introduces additional linear

components to V.

 Each measurement squares the number of

linear components.

 Thus, an unpruned value function for T=20

includes more than 10547,864 linear functions.

 At T=30 we have 10561,012,337 linear functions.  The pruned value functions at T=20, in

comparison, contains only 12 linear components.

 The combinatorial explosion of linear components

in the value function are the major reason why POMDPs are impractical for most applications.

slide-81
SLIDE 81

29.11.2007 CSE-571- AI-based Mobile Robotics 81

POMDP Summary

 POMDPs compute the optimal action in

partially observable, stochastic domains.

 For finite horizon problems, the resulting

value functions are piecewise linear and convex.

 In each iteration the number of linear

constraints grows exponentially.

 POMDPs so far have only been applied

successfully to very small state spaces with small numbers of possible

  • bservations and actions.
slide-82
SLIDE 82

29.11.2007 CSE-571- AI-based Mobile Robotics 82

POMDP Approximations

 Point-based value iteration  QMDPs  AMDPs

slide-83
SLIDE 83

29.11.2007 CSE-571- AI-based Mobile Robotics 83

Point-based Value Iteration

 Maintains a set of example beliefs  Only considers constraints that

maximize value function for at least

  • ne of the examples
slide-84
SLIDE 84

29.11.2007 CSE-571- AI-based Mobile Robotics 84

Point-based Value Iteration

Exact value function PBVI Value functions for T=30

slide-85
SLIDE 85

29.11.2007 CSE-571- AI-based Mobile Robotics 85

Example Application

slide-86
SLIDE 86

29.11.2007 CSE-571- AI-based Mobile Robotics 86

Example Application

slide-87
SLIDE 87

29.11.2007 CSE-571- AI-based Mobile Robotics 87

QMDPs

 QMDPs only consider state

uncertainty in the first step

 After that, the world becomes fully

  • bservable.
slide-88
SLIDE 88

29.11.2007 CSE-571- AI-based Mobile Robotics 88

=

 =

N j i j j i i

x u x p x V u x r u x Q

1

) , | ( ) ( ) , ( ) , (

= N j i i u

u x Q p

1

) , ( max arg

slide-89
SLIDE 89

29.11.2007 CSE-571- AI-based Mobile Robotics 89

Augmented MDPs

 Augmentation adds uncertainty

component to state space, e.g.

 Planning is performed by MDP in

augmented state space

 Transition, observation and payoff

models have to be learned

 =         = dx x b x b x H x H x b b

b b x

) ( log ) ( ) ( , ) ( ) ( max arg