AI-based Mobile Robotics Planning and Control: Markov Decision - PowerPoint PPT Presentation

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes

Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next? Deterministic vs. Stochastic Perfect vs. Noisy Percepts Actions Full vs. Partial satisfaction

Classical Planning Static Predictable Environment Fully Observable Discrete What action next? Deterministic Perfect Percepts Actions Full

Stochastic Planning Static Unpredictable Environment Fully Observable Discrete What action next? Stochastic Perfect Percepts Actions Full

Deterministic, fully observable

Stochastic, Fully Observable

Stochastic, Partially Observable

Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model  C (s,a,s’): cost model  G : set of goals  s 0 : start state   : discount factor  R ( s,a,s’): reward model

Role of Discount Factor (  )  Keep the total reward/total cost finite • useful for infinite horizon problems • sometimes indefinite horizon: if there are deadends  Intuition (economics): • Money today is worth more than money tomorrow.  Total reward: r 1 +  r 2 +  2 r 3 + …  Total cost: c 1 +  c 2 +  2 c 3 + …

Objective of a Fully Observable MDP  Find a policy  : S → A  which optimises • minimises expected cost to reach a goal discounted • maximises expected reward or undiscount. • maximises expected (reward-cost)  given a ____ horizon • finite • infinite • indefinite  assuming full observability

Examples of MDPs  Goal-directed, Indefinite Horizon, Cost Minimisation MDP • < S , A , P r, C , G , s 0 >  Infinite Horizon, Discounted Reward Maximisation MDP • < S , A , P r, R ,  > • Reward =  t t  t r t  Goal-directed, Finite Horizon, Prob. Maximisation MDP • < S , A , P r, G , s 0 , T>

Bellman Equations for MDP 1  < S , A , P r, C , G , s 0 >  Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state.  J* should satisfy the following equation: Q*(s,a)

Bellman Equations for MDP 2  < S , A , P r, R , s 0,  >  Define V* V*(s) {optimal val alue ue} as the ma maxim imum um expected di disco counted unted rew ewar ard from this state.  V* should satisfy the following equation:

Bellman Backup  Given an estimate of V* function (say V n )  Backup V n function at state s • calculate a new estimate (V n+1 +1 ) :  Q n+1 (s,a) : value/cost of the strategy: • execute action a in s, execute  n subsequently •  n = argmax a ∈ Ap(s) Q n (s,a) (greedy action)

Bellman Backup Q 1 (s,a 1 ) = 20 + 5 max Q 1 (s,a 2 ) = 20 + 0.9 £ 2 + 0.1 £ 3 a greedy dy = a = a 1 Q 1 (s,a 3 ) = 4 + 3 a 1 s 1 V 0 = 20 V 1 = 25 20 s 0 a 2 ? s 2 V 0 = 2 a 3 s 3 V 0 = 3

Value iteration [Bellman’57]  assign an arbitrary assignment of V 0 to each non-goal state.  repeat • for all states s Iteration n+1 compute V n+1 (s) by Bellman backup at s.  until max s |V n+1 (s) – V n (s)| <  Residual(s)  -convergence

Complexity of value iteration  One iteration takes O(| A || S | 2 ) time.  Number of iterations required • poly(| S |,| A |,1/(1- γ))  Overall: • the algorithm is polynomial in state space • thus exponential in number of state variables.

Policy Computation Optimal policy is stationary and time-independent. • for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in | S | variables.

Markov Decision Process (MDP) r=1 0.01 s 2 0.9 0.7 0.1 0.3 0.99 r=0 s 3 0.3 s 1 r=20 0.3 0.4 0.2 s 5 s 4 r=0 r=-10 0.8

Value Function and Policy  Value residual and policy residual

Changing the Search Space  Value Iteration • Search in value space • Compute the resulting policy  Policy Iteration [Howard’60] • Search in policy space • Compute the resulting value

Policy iteration [Howard’60]  assign an arbitrary assignment of  0 to each state.  repeat costly: O(n 3 ) • compute V n+1 : the evaluation of  n • for all states s compute  n+1 (s): argmax a 2 Ap(s) Q n+1 (s,a)  until  n+1 =  n approximate Modified by value iteration Policy Iteration using fixed policy Advantage  searching in a finite (policy) space as opposed to uncountably infinite (value) space ⇒ convergence faster.  all other properties follow!

LP Formulation minimise  s 2S 2S V*(s) under constraints: for every s, a V*(s) ≥ R (s) +  s’ 2S 2S P r(s’|a,s)V*(s’) A big LP. So other tricks used to solve it!

Hybrid MDPs Hybrid Markov decision process: Markov state = ( n , x ), where n is the discrete component l   x (set of fluents) and . Bellman’s equation:    t 1 = V ( x ) max Pr( n ' | n , x , a )  n   a A ( x ) n  n ' N     t  Pr( x' | n , x , a , n ' ) R ( x' ) V ( x' ) d x'  n ' n '   X x'

Convolutions discrete-discrete    constant-discrete [Feng et.al.’04] constant-constant [Li&Littman’05]

Result of convolutions value function discrete constant linear probability density function discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic

Value Iteration for Motion Planning (assumes knowledge of robot’s location)

Frontier-based Exploration • Every unknown location is a target point.

Manipulator Control Arm with two joints Configuration space

Manipulator Control Path State space Configuration space

Collision Avoidance via Planning  Potential field methods have local minima  Perform efficient path planning in the local perceptual space  Path costs depend on length and closeness to obstacles [Konolige, Gradient method]

Paths and Costs  Path is list of points P= { p 1 , p 2 ,… p k }  p k is only point in goal set  Cost of path is separable into intrinsic cost at each point along with adjacency cost of moving from one point to the next   =  F ( P ) I ( p ) A ( p , p )  i i i 1 i i • Adjacency cost typically Euclidean distance • Intrinsic cost typically occupancy, distance to obstacle

Navigation Function • Assignment of potential field value to every element in configuration space [Latombe, 91]. • Goal set is always downhill, no local minima. • Navigation function of a point is cost of minimal cost path that starts at that point. = N min F ( P ) k k P k

Computation of Navigation Function • Initialization • Points in goal set  0 cost • All other points  infinite cost • Active list  goal set • Repeat • Take point from active list and update neighbors • If cost changes, add the point to the active list • Until active list is empty

Challenges  Where do we get the state space from?  Where do we get the model from?  What happens when the world is slightly different?  Where does reward come from?  Co Cont ntinuo inuous us sta tate te var aria iables bles  Co Cont ntinuo inuous us ac action tion spa pace ce

How to solve larger problems?  If deterministic problem • Use dijkstra’s algorithm  If no back-edge • Use backward Bellman updates  Prioritize Bellman updates • to maximize information flow  If known initial state • Use dynamic programming + heuristic search • LAO*, RTDP and variants  Divide an MDP into sub-MDPs are solve the hierarchy  Aggregate states with similar values  Relational MDPs

Approximations: n-step lookahead  n=1 : greedy •  1 (s) = argmax a R (s,a)  n-step lookahead •  n (s) = argmax a V n (s)

Approximation: Incremental approaches deterministic relaxation Deterministic planner plan Stochastic simulation Identify weakness Solve/Merge

Approximations: Planning and Replanning deterministic relaxation Deterministic planner send the state reached plan Execute the action

CSE-571 AI-based Mobile Robotics Planning and Control: (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes SA-1

Reinforcement Learning  Still have an MDP • Still looking for policy   New twist: don’t know P r and/or R • i.e. don’t know which states are good • And what actions do  Must actually try actions and states out to learn

Model based methods  Visit different states, perform different actions  Estimate P r and R  Once model built, do planning using V.I. or other methods  Cons: require _huge_ amounts of data

AI-based Mobile Robotics Planning and Control: Markov Decision - PowerPoint PPT Presentation

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next?

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Mobile & Service Robotics

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

MOBILE ADVERTISING Agenda Get off to a mobile start with Media Impact! Why mobile? MI

The Robot Operating System (ROS) Introduction, Concepts and Examples Stefano Rosa, 8/5/2015

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Mobile & Service Robotics

Mobile & Service Robotics Mobile & Service Robotics Introduction and Locomotio

Component-based Robotics Middleware Software Development and Integration in Robotics (SDIR V)

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon

Care Provision An Experimental Investigation Sheheryar Banuri (World Bank) Angela de Oliveira

Intrinsic Ar-39 & Ar-42 Juergen Reichenbacher Calibration Workshop Fermilab, 27-July-2017

Parent Education Night November 5th, 2020 With Sierra Daniels, School Counselor And Todd

100 Million Friends You Can Never Know Adding COPPA compliant social networking to Poptropica

A REINFORCEMENT LEARNING PERSPECTIVE ON AGI Itamar Arel, Machine Intelligence Lab

Matthew Series Lesson #065 February 1, 2015 Dean Bible Ministries www.deanbibleministries.org

Deep Reinforcement Learning Introduction and State-of-the-art Arjun Chandra Research Scientist

AI-based Mobile Robotics Planning and Control: Markov Decision - PowerPoint PPT Presentation

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next?

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Sensors for Robotics

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Mobile &amp; Service Robotics

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

MOBILE ADVERTISING Agenda Get off to a mobile start with Media Impact! Why mobile? MI

The Robot Operating System (ROS) Introduction, Concepts and Examples Stefano Rosa, 8/5/2015

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Mobile &amp; Service Robotics

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Introduction and Locomotio

Component-based Robotics Middleware Software Development and Integration in Robotics (SDIR V)

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon

Care Provision An Experimental Investigation Sheheryar Banuri (World Bank) Angela de Oliveira

Intrinsic Ar-39 &amp; Ar-42 Juergen Reichenbacher Calibration Workshop Fermilab, 27-July-2017

Parent Education Night November 5th, 2020 With Sierra Daniels, School Counselor And Todd

100 Million Friends You Can Never Know Adding COPPA compliant social networking to Poptropica

A REINFORCEMENT LEARNING PERSPECTIVE ON AGI Itamar Arel, Machine Intelligence Lab

Matthew Series Lesson #065 February 1, 2015 Dean Bible Ministries www.deanbibleministries.org

Deep Reinforcement Learning Introduction and State-of-the-art Arjun Chandra Research Scientist

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Mobile & Service Robotics

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Mobile & Service Robotics

Mobile & Service Robotics Mobile & Service Robotics Introduction and Locomotio

Intrinsic Ar-39 & Ar-42 Juergen Reichenbacher Calibration Workshop Fermilab, 27-July-2017