AI-based Mobile Robotics Planning and Control: Markov Decision - - PowerPoint PPT Presentation
AI-based Mobile Robotics Planning and Control: Markov Decision - - PowerPoint PPT Presentation
CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next?
Planning
What action next?
Percepts Actions
Environment
Static vs. Dynamic Full vs. Partial satisfaction Fully vs. Partially Observable Perfect vs. Noisy Deterministic vs. Stochastic Discrete vs. Continuous Outcomes Predictable vs. Unpredictable
Classical Planning
What action next?
Percepts Actions
Environment
Static Full Fully Observable Perfect Predictable Discrete Deterministic
Stochastic Planning
What action next?
Percepts Actions
Environment
Static Full Fully Observable Perfect Stochastic Unpredictable Discrete
Deterministic, fully observable
Stochastic, Fully Observable
Stochastic, Partially Observable
Markov Decision Process (MDP)
- S: A set of states
- A: A set of actions
- Pr(s’|s,a): transition model
- C(s,a,s’): cost model
- G: set of goals
- s0: start state
- : discount factor
- R(s,a,s’): reward model
Role of Discount Factor ()
- Keep the total reward/total cost finite
- useful for infinite horizon problems
- sometimes indefinite horizon: if there are deadends
- Intuition (economics):
- Money today is worth more than money tomorrow.
- Total reward: r1 + r2 + 2r3 + …
- Total cost: c1 + c2 + 2c3 + …
Objective of a Fully Observable MDP
- Find a policy : S → A
- which optimises
- minimises
expected cost to reach a goal
- maximises
expected reward
- maximises
expected (reward-cost)
- given a ____ horizon
- finite
- infinite
- indefinite
- assuming full observability
discounted
- r
undiscount.
Examples of MDPs
- Goal-directed, Indefinite Horizon, Cost Minimisation MDP
- <S, A, Pr, C, G, s0>
- Infinite Horizon, Discounted Reward Maximisation MDP
- <S, A, Pr, R, >
- Reward = t
t trt
- Goal-directed, Finite Horizon, Prob. Maximisation MDP
- <S, A, Pr, G, s0, T>
- <S, A, Pr, C, G, s0>
- Define J*(s) {optimal cost} as the minimum
expected cost to reach a goal from this state.
- J* should satisfy the following equation:
Bellman Equations for MDP1
Q*(s,a)
- <S, A, Pr, R, s0, >
- Define V*
V*(s) {optimal val alue ue} as the ma maxim imum um expected di disco counted unted rew ewar ard from this state.
- V* should satisfy the following equation:
Bellman Equations for MDP2
- Given an estimate of V* function (say Vn)
- Backup Vn function at state s
- calculate a new estimate (Vn+1
+1) :
- Qn+1(s,a) : value/cost of the strategy:
- execute action a in s, execute n subsequently
- n = argmaxa∈Ap(s)Qn(s,a) (greedy action)
Bellman Backup
Bellman Backup
V0= 20 V0= 2 V0= 3
Q1(s,a1) = 20 + 5 Q1(s,a2) = 20 + 0.9£ 2 + 0.1£ 3 Q1(s,a3) = 4 + 3 max
V1= 25
agreedy
dy = a
= a1 20
a2 a1 a3 s0 s1 s2 s3 ?
Value iteration [Bellman’57]
- assign an arbitrary assignment of V0 to each non-goal state.
- repeat
- for all states s
compute Vn+1(s) by Bellman backup at s.
- until maxs |Vn+1(s) – Vn(s)| <
Iteration n+1 Residual(s) -convergence
Complexity of value iteration
- One iteration takes O(|A||S|2) time.
- Number of iterations required
- poly(|S|,|A|,1/(1-γ))
- Overall:
- the algorithm is polynomial in state space
- thus exponential in number of state variables.
Policy Computation Optimal policy is stationary and time-independent.
- for infinite/indefinite horizon problems
Policy Evaluation A system of linear equations in |S| variables.
Markov Decision Process (MDP)
s2 s3 s4 s5 s1
0.7 0.3 0.9 0.1 0.3 0.3 0.4 0.99 0.01 0.2 0.8 r=-10 r=20 r=0 r=1 r=0
Value Function and Policy
- Value residual and policy residual
Changing the Search Space
- Value Iteration
- Search in value space
- Compute the resulting policy
- Policy Iteration [Howard’60]
- Search in policy space
- Compute the resulting value
Policy iteration [Howard’60]
- assign an arbitrary assignment of 0 to each state.
- repeat
- compute Vn+1: the evaluation of n
- for all states s
compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
- until n+1 = n
Advantage
- searching in a finite (policy) space as opposed to
uncountably infinite (value) space ⇒ convergence faster.
- all other properties follow!
costly: O(n3) approximate by value iteration using fixed policy Modified Policy Iteration
LP Formulation minimise s2S
2S V*(s)
under constraints: for every s, a V*(s) ≥ R(s) + s’2S
2S Pr(s’|a,s)V*(s’)
A big LP. So other tricks used to solve it!
=
N n A a t n
a n n V
n
' ) ( 1
) , , | ' Pr( max ) ( x x
x
X x'
x' x' x' x x' d V R n a n
t n n
) ( ) ( ) ' , , , | Pr(
' '
Hybrid Markov decision process:
Markov state = (n, x), where n is the discrete component (set of fluents) and .
Bellman’s equation:
l
x
Hybrid MDPs
=
N n A a t n
a n n V
n
' ) ( 1
) , , | ' Pr( max ) ( x x
x
X x'
x' x' x' x x' d V R n a n
t n n
) ( ) ( ) ' , , , | Pr(
' '
Hybrid Markov decision process:
Markov state = (n, x), where n is the discrete component (set of fluents) and .
Bellman’s equation:
l
x
Hybrid MDPs
- discrete-discrete
constant-discrete [Feng et.al.’04] constant-constant [Li&Littman’05]
Convolutions
Result of convolutions
discrete constant linear discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic value function probability density function
Value Iteration for Motion Planning
(assumes knowledge of robot’s location)
Frontier-based Exploration
- Every unknown location is a target point.
Manipulator Control Arm with two joints Configuration space
Manipulator Control Path State space Configuration space
Manipulator Control Path State space Configuration space
Collision Avoidance via Planning
- Potential field methods have local minima
- Perform efficient path planning in the local perceptual
space
- Path costs depend on length and closeness to
- bstacles
[Konolige, Gradient method]
Paths and Costs
- Path is list of points P={p1, p2,… pk}
- pk is only point in goal set
- Cost of path is separable into intrinsic cost at each point
along with adjacency cost of moving from one point to the next
- Adjacency cost typically Euclidean distance
- Intrinsic cost typically occupancy, distance to obstacle
=
i i i i i
p p A p I P F ) , ( ) ( ) (
1
Navigation Function
- Assignment of potential field value to every
element in configuration space [Latombe, 91].
- Goal set is always downhill, no local minima.
- Navigation function of a point is cost of minimal
cost path that starts at that point.
) ( min
k P k
P F N
k
=
Computation of Navigation Function
- Initialization
- Points in goal set 0 cost
- All other points infinite cost
- Active list goal set
- Repeat
- Take point from active list and update neighbors
- If cost changes, add the point to the active list
- Until active list is empty
Challenges
- Where do we get the state space from?
- Where do we get the model from?
- What happens when the world is slightly
different?
- Where does reward come from?
- Co
Cont ntinuo inuous us sta tate te var aria iables bles
- Co
Cont ntinuo inuous us ac action tion spa pace ce
How to solve larger problems?
- If deterministic problem
- Use dijkstra’s algorithm
- If no back-edge
- Use backward Bellman updates
- Prioritize Bellman updates
- to maximize information flow
- If known initial state
- Use dynamic programming + heuristic search
- LAO*, RTDP and variants
- Divide an MDP into sub-MDPs are solve the hierarchy
- Aggregate states with similar values
- Relational MDPs
Approximations: n-step lookahead
- n=1 : greedy
- 1(s) = argmaxa R(s,a)
- n-step lookahead
- n(s) = argmaxa Vn(s)
Approximation: Incremental approaches
Deterministic planner deterministic relaxation Stochastic simulation Identify weakness plan Solve/Merge
Approximations: Planning and Replanning
Deterministic planner deterministic relaxation Execute the action plan send the state reached
SA-1
CSE-571 AI-based Mobile Robotics
Planning and Control: (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes
Reinforcement Learning
- Still have an MDP
- Still looking for policy
- New twist: don’t know Pr and/or R
- i.e. don’t know which states are good
- And what actions do
- Must actually try actions and states out to learn
Model based methods
- Visit different states, perform different actions
- Estimate Pr and R
- Once model built, do planning using V.I. or
- ther methods
- Cons: require _huge_ amounts of data
Model free methods
- TD learning
- Directly learn Q*(s,a) values
- sample = R(s,a,s’) + maxa’Qn(s’,a’)
- Nudge the old estimate towards the new sample
- Qn+1(s,a) Ã (1-)Qn(s,a) + [sample]
Properties
- Converges to optimal if
- If you explore enough
- If you make learning rate () small enough
- But not decrease it too quickly
Exploration vs. Exploitation
- -greedy
- Each time step flip a coin
- With prob , action randomly
- With prob 1- take the current greedy action
- Lower over time to increase exploitation as
more learning has happened
Q-learning
- Problems
- Too many states to visit during learning
- Q(s,a) is a BIG table
- We want to generalize from small set of training
examples
- Solutions
- Value function approximators
- Policy approximators
- Hierarchical Reinforcement Learning
Task Hierarchy: MAXQ Decomposition [Dietterich’00]
Root Take Give Navigate(loc) Deliver Fetch Extend-arm Extend-arm Grab Release Movee Movew Moves Moven Children of a unordered Children of a task are unordered
MAXQ Decomposition
- Augment the state s by adding the subtask i: [s,i].
- Define C([s,i],j) as the reward received in i after j
finishes.
- Q([s,Fetch],Navigate(prr)) =
V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))
- Express V in terms of C
- Learn C, instead of learning Q
Reward received Reward received while navigating Reward received Reward received after navigation
MAXQ Decomposition (contd)
- State Abstraction
- Finding irrelevant actions
- Finding funnel actions
POMDPs: Recall example
Partially Observable Markov Decision Processes
29.11.2007 CSE-571- AI-based Mobile Robotics 54
POMDPs
In POMDPs we apply the very same idea as in
MDPs.
Since the state is not observable, the agent has
to make its decisions based on the belief state which is a posterior distribution over states.
Let b be the belief of the agent about the state
under consideration.
POMDPs compute a value function over belief
space:
29.11.2007 CSE-571- AI-based Mobile Robotics 55
Problems
Each belief is a probability distribution, thus,
each value in a POMDP is a function of an entire probability distribution.
This is problematic, since probability
distributions are continuous.
Additionally, we have to deal with the huge
complexity of belief spaces.
For finite worlds with finite state, action, and
measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions.
29.11.2007 CSE-571- AI-based Mobile Robotics 56
An Illustrative Example
2
x
1
x
3
u
8 .
2
z
1
z
3
u
2 . 8 . 2 . 7 . 3 . 3 . 7 .
measurements action u3 state x2 payoff measurements
1
u
2
u
1
u
2
u
100 50 100 100
actions u1, u2 payoff state x1
1
z
2
z
29.11.2007 CSE-571- AI-based Mobile Robotics 57
The Parameters of the Example
The actions u1 and u2 are terminal actions. The action u3 is a sensing action that potentially
leads to a state transition.
The horizon is finite and =1.
29.11.2007 CSE-571- AI-based Mobile Robotics 58
Payoff in POMDPs
In MDPs, the payoff (or return)
depended on the state of the system.
In POMDPs, however, the true state
is not exactly known.
Therefore, we compute the
expected payoff by integrating
- ver all states:
29.11.2007 CSE-571- AI-based Mobile Robotics 59
Payoffs in Our Example (1)
If we are totally certain that we are in state x1 and
execute action u1, we receive a reward of -100
If, on the other hand, we definitely know that we
are in x2 and execute u1, the reward is +100.
In between it is the linear combination of the
extreme values weighted by the probabilities
29.11.2007 CSE-571- AI-based Mobile Robotics 60
Payoffs in Our Example (2)
29.11.2007 CSE-571- AI-based Mobile Robotics 61
The Resulting Policy for T=1
Given we have a finite POMDP with
T=1, we would use V1(b) to determine the optimal policy.
In our example, the optimal policy
for T=1 is
This is the upper thick graph in the
diagram.
29.11.2007 CSE-571- AI-based Mobile Robotics 62
Piecewise Linearity, Convexity
The resulting value function V1(b) is
the maximum of the three functions at each point
It is piecewise linear and convex.
29.11.2007 CSE-571- AI-based Mobile Robotics 63
Pruning
If we carefully consider V1(b), we see
that only the first two components contribute.
The third component can therefore
safely be pruned away from V1(b).
29.11.2007 CSE-571- AI-based Mobile Robotics 64
Increasing the Time Horizon
Assume the robot can make an observation before
deciding on an action.
V1(b)
29.11.2007 CSE-571- AI-based Mobile Robotics 65
Increasing the Time Horizon
Assume the robot can make an observation before
deciding on an action.
Suppose the robot perceives z1 for which
p(z1 | x1)=0.7 and p(z1| x2)=0.3.
Given the observation z1 we update the belief using
Bayes rule.
3 . 4 . ) 1 ( 3 . 7 . ) ( ) ( ) 1 ( 3 . ' ) ( 7 . '
1 1 1 1 1 1 2 1 1 1
= = = = p p p z p z p p p z p p p
29.11.2007 CSE-571- AI-based Mobile Robotics 66
Value Function
b’(b|z1) V1(b) V1(b|z1)
29.11.2007 CSE-571- AI-based Mobile Robotics 67
Increasing the Time Horizon
Assume the robot can make an observation before
deciding on an action.
Suppose the robot perceives z1 for which
p(z1 | x1)=0.7 and p(z1| x2)=0.3.
Given the observation z1 we update the belief using
Bayes rule.
Thus V1(b | z1) is given by
29.11.2007 CSE-571- AI-based Mobile Robotics 68
Expected Value after Measuring
Since we do not know in advance
what the next measurement will be, we have to compute the expected belief
= = =
= = = =
2 1 1 1 1 2 1 1 1 1 2 1 1 1 1
) | ( ) ( ) | ( ) ( ) | ( ) ( )] | ( [ ) (
i i i i i i i i i z
p x z p V z p p x z p V z p z b V z p z b V E b V
29.11.2007 CSE-571- AI-based Mobile Robotics 69
Expected Value after Measuring
Since we do not know in advance
what the next measurement will be, we have to compute the expected belief
29.11.2007 CSE-571- AI-based Mobile Robotics 70
Resulting Value Function
The four possible combinations yield the
following function which then can be simplified and pruned.
29.11.2007 CSE-571- AI-based Mobile Robotics 71
Value Function
b’(b|z1) p(z1) V1(b|z1) p(z2) V2(b|z2) \bar{V}1(b)
29.11.2007 CSE-571- AI-based Mobile Robotics 72
State Transitions (Prediction)
When the agent selects u3 its state
potentially changes.
When computing the value
function, we have to take these potential state changes into account.
29.11.2007 CSE-571- AI-based Mobile Robotics 73
Resulting Value Function after executing u3
Taking the state transitions into account,
we finally obtain.
29.11.2007 CSE-571- AI-based Mobile Robotics 74
Value Function after executing u3
\bar{V}1(b) \bar{V}1(b|u3)
29.11.2007 CSE-571- AI-based Mobile Robotics 75
Value Function for T=2
Taking into account that the agent can
either directly perform u1 or u2 or first u3 and then u1 or u2, we obtain (after pruning)
29.11.2007 CSE-571- AI-based Mobile Robotics 76
Graphical Representation
- f V2(b)
u1 optimal u2 optimal unclear
- utcome of
measuring is important here
29.11.2007 CSE-571- AI-based Mobile Robotics 77
Deep Horizons and Pruning
We have now completed a full backup
in belief space.
This process can be applied
recursively.
The value functions for T=10 and
T=20 are
29.11.2007 CSE-571- AI-based Mobile Robotics 78
Deep Horizons and Pruning
29.11.2007 CSE-571- AI-based Mobile Robotics 79
29.11.2007 CSE-571- AI-based Mobile Robotics 80
Why Pruning is Essential
Each update introduces additional linear
components to V.
Each measurement squares the number of
linear components.
Thus, an unpruned value function for T=20
includes more than 10547,864 linear functions.
At T=30 we have 10561,012,337 linear functions. The pruned value functions at T=20, in
comparison, contains only 12 linear components.
The combinatorial explosion of linear components
in the value function are the major reason why POMDPs are impractical for most applications.
29.11.2007 CSE-571- AI-based Mobile Robotics 81
POMDP Summary
POMDPs compute the optimal action in
partially observable, stochastic domains.
For finite horizon problems, the resulting
value functions are piecewise linear and convex.
In each iteration the number of linear
constraints grows exponentially.
POMDPs so far have only been applied
successfully to very small state spaces with small numbers of possible
- bservations and actions.
29.11.2007 CSE-571- AI-based Mobile Robotics 82
POMDP Approximations
Point-based value iteration QMDPs AMDPs
29.11.2007 CSE-571- AI-based Mobile Robotics 83
Point-based Value Iteration
Maintains a set of example beliefs Only considers constraints that
maximize value function for at least
- ne of the examples
29.11.2007 CSE-571- AI-based Mobile Robotics 84
Point-based Value Iteration
Exact value function PBVI Value functions for T=30
29.11.2007 CSE-571- AI-based Mobile Robotics 85
Example Application
29.11.2007 CSE-571- AI-based Mobile Robotics 86
Example Application
29.11.2007 CSE-571- AI-based Mobile Robotics 87
QMDPs
QMDPs only consider state
uncertainty in the first step
After that, the world becomes fully
- bservable.
29.11.2007 CSE-571- AI-based Mobile Robotics 88
=
=
N j i j j i i
x u x p x V u x r u x Q
1
) , | ( ) ( ) , ( ) , (
= N j i i u
u x Q p
1
) , ( max arg
29.11.2007 CSE-571- AI-based Mobile Robotics 89
Augmented MDPs
Augmentation adds uncertainty
component to state space, e.g.
Planning is performed by MDP in
augmented state space
Transition, observation and payoff
models have to be learned
= = dx x b x b x H x H x b b
b b x