for AI and Robotics Exploration and information gathering - - PowerPoint PPT Presentation

for ai and robotics
SMART_READER_LITE
LIVE PREVIEW

for AI and Robotics Exploration and information gathering - - PowerPoint PPT Presentation

Statistical Filtering and Control for AI and Robotics Exploration and information gathering Alessandro Farinelli Outline POMDPs The POMDP model Finite world POMDP algorithm Point based value iteration Exploration


slide-1
SLIDE 1

Statistical Filtering and Control for AI and Robotics

Alessandro Farinelli

Exploration and information gathering

slide-2
SLIDE 2

Outline

  • POMDPs

– The POMDP model – Finite world POMDP algorithm – Point based value iteration

  • Exploration

– Information gain – Exploration in occupancy grid maps – Extension to MRS

  • Acknowledgment: material based on

– Thrun, Burgard, Fox; Probabilistic Robotics

slide-3
SLIDE 3

POMDPs

  • In POMDPs we apply the same idea as in MDPs.
  • Since the state is not observable, the agent has to make its

decisions based on the belief state which is a posterior distribution over states.

  • Let b be the belief of the agent about the state under

consideration.

  • POMDPs compute a value function over belief space:

       

 ' 1

' ) , | ' ( ) ' ( ) , ( max ) (

b T u T

db b u b p b V u b r b V 

slide-4
SLIDE 4

Problems

  • Each belief is a probability distribution, thus, each value in a

POMDP is a function of an entire probability distribution.

  • This is problematic, since probability distributions are

continuous.

  • Additionally, we have to deal with the huge complexity of

belief spaces.

  • For finite worlds with finite state, action, and measurement

spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions.

– Possible because Expectation is a linear operator

slide-5
SLIDE 5

Example

2

x

1

x

3

u

8 .

2

z

1

z

3

u

2 . 8 . 2 . 7 . 3 . 3 . 7 .

measurements action u3 state x2 payoff measurements

1

u

2

u

1

u

2

u

100  50  100 100

actions u1, u2 payoff state x1

1

z

2

z

slide-6
SLIDE 6

Discussion on the example

  • The two states have different optimal actions

– u2 in x1 and u1 in x2

  • Action u3 is non deterministic, it flips state and

acquires knowledge with a small cost

– z1 increases confidence of being in x1 – z2 increases confidence of being in x2 – cost is -1 (see later)

  • Two states: belief is p1 = p(x1)

– p(x2) = 1-p1 –

 

u  1 ; : 

slide-7
SLIDE 7

Payoff in POMDPs

   

 

         

u x r p u x r p dx x p u x r u b r u x r E u b r

x x

, , ' ' , ' , , ,

2 2 1 1 '

   

  • In MDPs, the payoff (or reward) depends on the

state of the system.

  • In POMDPs the true state is not exactly known.
  • Therefore, we compute the expected payoff by

integrating over all states:

slide-8
SLIDE 8

Payoffs in the example I

         

1 , 1 50 100 , 1 100 100 ,

3 1 1 2 1 1 1

         u b r p p u b r p p u b r

  • If we are in x1 and execute u1 we receive -100
  • If we are in x2 and execute u1 we receive +100
  • When we are not certain of state we have a linear

combination weighted with the probabilities:

slide-9
SLIDE 9

Payoffs in the example II

slide-10
SLIDE 10
  • Finte POMDP with T=1, use V1(b) to determine the
  • ptimal policy:

– Choose best next action among u1,u2,u3

  • In our example, the optimal policy for T=1 is
  • This is the upper thick graph in the diagram.

The resulting policy for T=1

 

        7 3 if 7 3 if

1 2 1 1

p u p u b 

slide-11
SLIDE 11

Piecewise, linearity and convexity

  • The resulting value function V1(b) is the maximum of

the three functions at each point

  • It is piecewise linear and convex.

     

                 1 1 50 100 1 100 100 max

1 1 1 1 1

p p p p b V

slide-12
SLIDE 12

Pruning

  • Only the first two components contribute.
  • The third component can be pruned away from V1(b).
  • Pruning is crucial to have an efficient solution approach

      

          

1 1 1 1 1

1 50 100 1 100 100 max p p p p b V

slide-13
SLIDE 13

Increasing the time horizon

  • Assume robot can make an observation before acting
  • Sensing will provide a better belief, how much better?

V1(b)

slide-14
SLIDE 14

Sensing

  • Suppose the robot perceives z1.
  • Recall:

– p(z1 | x1)=0.7 and p(z1| x2)=0.3.

  • Given the observation z1 we update the belief using Bayes rule.

) ( ) 1 ( 3 . ) ( ) ( ) | ( ) | ( ' ) ( 7 . ) ( ) ( ) | ( ) | ( '

1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1

z p p z p x p x z p z x p p z p p z p x p x z p z x p p       

slide-15
SLIDE 15

Value Function considering z1

b’=p(x1|z1) V1(b) V1(b|z1)

project

slide-16
SLIDE 16

Computing the new value function

  • Suppose the robot perceives z1.
  • We update the belief using Bayes rule
  • We can compute V1(b | z1) by replacing p1 with p’1:

   

                                ) 1 ( 50 70 ) 1 ( 30 70 max 1 ) ( ) 1 ( 3 . 50 ) ( 7 . 100 ) ( ) 1 ( 3 . 100 ) ( 7 . 100 max |

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

p p p p z p z p p z p p z p p z p p z b V

slide-17
SLIDE 17

Expected value after measuring

  • We do not know in advance what will be the next measurement
  • Need to compute the expectation

 

 

 

  

  

           

2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1

) | ( ) ( ) | ( ) ( ) | ( ) ( ) | (

i i i i i i i i z

p x z p V z p p x z p V z p z b V z p z b V E b V

slide-18
SLIDE 18

Expected value after measuring

  • We do not know in advance what will be the next measurement
  • Need to compute the expectation

 

 

                     

) | ( ) ( 1 1 1 1 ) | ( ) ( 1 1 1 1 2 1 1 1 1

2 1 2 1 1 1

) 1 ( 35 30 ) 1 ( 70 30 max ) 1 ( 15 70 ) 1 ( 30 70 max ) | ( ) ( ) | (

z b V z p z b V z p i i i z

p p p p p p p p z b V z p z b V E b V                          

slide-19
SLIDE 19

Resulting value function

  • Need to consider the four possible combinations and find the max
  • As before we can perform pruning

 

                                                       ) 1 ( 50 100 ) 1 ( 55 40 ) 1 ( 100 100 max ) 1 ( 35 30 ) 1 ( 15 70 ) 1 ( 70 30 ) 1 ( 15 70 ) 1 ( 35 30 ) 1 ( 30 70 ) 1 ( 70 30 ) 1 ( 30 70 max

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

p p p p p p p p p p p p p p p p p p p p p p b V

slide-20
SLIDE 20

Value Function considering sensing

p(z1) V1(b|z1) p(z2) V2(b|z2)

u1 u2

unclear

slide-21
SLIDE 21

State transition

  • Need to consider how actions affect the state
  • In our case u1 and u2 leads to final states and are deterministic
  • u3 has a non deterministic effect on the state

 

 

 

1 1 1 1 3 2 1 1 3 1 1 2 1 3 1 3 1 1

6 . 8 . ) 1 ( 8 . 2 . ) 1 )( , | ' ( ) , | ' ( , | ' , | ' ' p p p p u x x p p u x x p p u x x p u x x p E p

i i i

         

slide-22
SLIDE 22

State transition

 

 

1 3 1 1

6 . 8 . , | ' ' p u x x p E p    

1

p

1

' p

slide-23
SLIDE 23

Resulting value function after u3

  • Considering the state transition we can compute
  • Substitute p’1 in p1

 

3 1

|u b V

 

                                    ) 1 ( 70 20 ) 1 ( 43 52 ) 1 ( 60 60 max ) ' 1 ( 50 ' 100 ) ' 1 ( 55 ' 40 ) ' 1 ( 100 ' 100 max |

1 1 1 1 1 1 1 1 1 1 1 1 3 1

p p p p p p p p p p p p u b V

slide-24
SLIDE 24

Value Function considering u3

project

u1 u2

unclear

u2 u1

unclear

slide-25
SLIDE 25

Resulting value function for T=2

  • can execute any of the three

actions u1, u2, u3

  • need to discount cost for u3

 

                                                 ) 1 ( 42 52 ) 1 ( 50 100 ) 1 ( 100 100 max ) 1 ( 69 21 ) 1 ( 42 52 ) 1 ( 61 59 ) 1 ( 50 100 ) 1 ( 100 100 max

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

p p p p p p p p p p p p p p p p b V

slide-26
SLIDE 26

Graphical representation for V2(b)

  • utcome of

measurement is important here

u1 optimal u2 optimal unclear

slide-27
SLIDE 27

Deep horizons and pruning

  • We have now completed a full backup in belief space.
  • This process can be applied recursively.
  • The value functions for T=10 and T=20 are
slide-28
SLIDE 28

Importance of pruning

 

b V2

 

b V1

 

b V

1

slide-29
SLIDE 29
slide-30
SLIDE 30

Why pruning is essential

  • Each update introduces additional linear components to V.
  • Each measurement squares the number of linear components.
  • Thus, an un-pruned value function for T=20 includes more than

10547,864 linear functions.

  • At T=30 we have 10561,012,337 linear functions.
  • The pruned value functions at T=20, in comparison, contains
  • nly 13 linear components.
  • The combinatorial explosion of linear components in the value

function are the major reason why POMDPs are impractical for most applications.

  • Can use approximations

– Exploiting the structure of the domain

slide-31
SLIDE 31

Point based value iteration

  • One of many approaches to approximate POMDPs
  • PBVI: maintains a set of example beliefs

– Belief points

  • Only considers constraints that maximize value

function for at least one of the examples

– V contains only constraints that are supported by belief points

slide-32
SLIDE 32

Point based value iteration: Example

  • Same domain with two state but det. transition prob.
  • PBVI: simple point set B={p1=0., p1 = 0.1, … p1 = 1.}

Value iteration with pruning T=30 PBVI T=30

slide-33
SLIDE 33

Example Application for PBVI

  • Intrusion detection: robot is well localized, intruder

position uncertain (particle filter)

  • Fairly easy to define reasonable set of belief points
slide-34
SLIDE 34

PBVI: policy I

slide-35
SLIDE 35

PBVI: policy II

slide-36
SLIDE 36

PBVI: policy III

Time to clear the room is (with high likelihood) not sufficient for intruder to pass by the corridor. The POMDP policy finds the best policy to detect the intruder considering uncertainty of state space.

slide-37
SLIDE 37

Exploration

  • Exploration is a crucial task for robotics
  • Exploration: information gathering

– Find an intruder – Active localization – Acquire a map of a static environment

  • POMPDs naturally consider information gathering

– Just need to build an appropriate reward function (e.g. reduction in entropy) – Not practical for most realistic applications

  • We will consider practical algorithms for exploratoin

– Most of them are greedy

slide-38
SLIDE 38

Information gain

  • Entropy: expected information of a probability

distribution

  • Maximum for uniform distributions
  • Minimum for point-mass

 

   

 

     

x p x p

x p x p x H dx x p x p x H x p E ) ( log ) ( ) ( log ) ( ) ( log

slide-39
SLIDE 39

Conditional Entropy

  • Need to consider information after executing actions

and acquiring measurements

  • Denote the belief resulting from executing u and

acquiring measurement z under the belief b

  • Conditional entropy

) , , | ' ( ) ' )( , , ( b u z x p x u z b B 

 

'

' ) ' )( , , ( log ) ' )( , , ( ) , | ' (

x b

dx x u z b B x u z b B u z x H

slide-40
SLIDE 40

Conditional Entropy over the control

  • We can not choose the measurement, only the

control action: need to integrate z out to obtain

  • This is done by exploiting the structure of the

application domain

  • Information gain: reduction in entropy

) | ' ( u x Hb ) | ' ( ) ( ) ( u x H x H u I

b b b

 

slide-41
SLIDE 41

Greedy Techniques

  • Exploration as a decision-theoretic problem:

– Choose action that maximizes the expected utility

  • Expected utility for action u:

– information gain minus cost – must find a tradeoff between cost and gain

 

               

cost expected gain n informatio expected

) ( ) , ( ) | ' ( ) ( max arg ) (

  

x b b u

dx x b u x r u x H x H b  

slide-42
SLIDE 42

Why Greedy ?

Long action sequences might be not executable

slide-43
SLIDE 43

Monte Carlo Exploration

slide-44
SLIDE 44

Issues with Monte Carlo Exploration

  • Sampling measurement z is not practical
  • Most domains exhibit a huge amount of possible
  • bservations
  • Need to exploit domain structure to overcome this
slide-45
SLIDE 45

Exploration for learning occupancy grid maps

  • Exploration applied to mapping
  • Considering occupancy grid map
slide-46
SLIDE 46

Occupancy grid maps

  • Introduced by Moravec and Elfes in 1985
  • Represent environment by a grid

  • Estimate the probability that a location is occupied by

an obstacle .

  • Key assumptions

– Occupancy values of individual cells are independent – Positions are known, map is static

 

i

m m 

i t t i

p x z p  ) , | (

: 1 : 1

m ] 1 , [ 

i

m

i t t i t t

x z p x z m p ) , | ( ) , | (

: 1 : 1 : 1 : 1

m

slide-47
SLIDE 47

Updating occupancy grid maps: example

slide-48
SLIDE 48

Occupancy grid maps: example

CAD map

  • ccupancy grid map
slide-49
SLIDE 49

Exploring occupancy grid maps

  • Grey areas are not explored
  • Greedy technique:
  • Go to closest unexplored

location, where information gain is maximal

  • Compute gain:
  • Gain per grid cell (not per

robot action!)

  • Entropy
  • Expected Information gain
  • Binary gain
slide-50
SLIDE 50

Entropy to compute gain

The brighter a location the higher the entropy Entropy Occupancy map

 

) 1 log( ) 1 ( log

i i i i i p

p p p p H      m

slide-51
SLIDE 51

Information gain

  • Entropy does not consider the information a robot

would acquire when close to a cell

  • Recall that the information gain is:
  • In our case this reduces to the entropy before

measuring and the entropy after acquiring the possible measurement

) | ' ( ) ( ) ( u x H x H u I

b b b

  )] ( [ ) (

' i p i p

H E H I

i

m m

m

 

slide-52
SLIDE 52

 

) ' 1 log( ) ' 1 ( ' log ' ) 1 )( 1 ( ' ) 1 )( 1 (

' i i i i i p i t i t i t i i t i t

p p p p H p p p p p p p p p p p p             

 

m

Computing the entropy of the posterior

  • Probability of correct sensing
  • Probability of measuring occupied
  • Posterior for occupancy update

p

t

p ' p

slide-53
SLIDE 53
  • We can compute the entropy of posterior for

measuring free

  • The expected entropy is then
  • We can then compute the gain

 

 

   

i p i p i p

H p H p H E m m m

' ' '    

 

Computing the information gain

) (

' i p

H m

)] ( [ ) (

' i p i p

H E H I

i

m m

m

 

slide-54
SLIDE 54

Difference between gain and entropy

  • Entropy is very similar to information gain
  • Usually entropy is good enough for exploration
slide-55
SLIDE 55

Binary gain

  • Extremely simple, extremely popular
  • Divide cells in two classes:

– Explored: updated at least once – Unexplored: ever updated

  • Frontier based exploration
slide-56
SLIDE 56

Using the information maps

  • Need to build a navigation function to drive

the robot based on information maps

  • Exploration action:

– Move to loc. (x,y) – Acquire info in a small radius

  • Binary gain and value iteration
  • r encodes the cost

 

 

      

 

) ( if ) ( ) ( if ) ( ) , ( max

1 ) ( i i i j T j i adj j i T

I I I V r V

i

m m m m m m m

m

slide-57
SLIDE 57

Value function: example

  • Value function at convergence for binary gain
  • Very crude approximation but work well in practice
slide-58
SLIDE 58

Exploration path: example

slide-59
SLIDE 59

Extension to MRS

  • K robots can explore more than K times faster
  • Need to coordinate: avoid conflicts and maximize gain
  • Simple approach: greedy task allocation

– assign frontiers to different robots – greedily maximize exploration effect

slide-60
SLIDE 60

Greedy coordinated exploration

slide-61
SLIDE 61

MRS exploration without coordination

  • Both robots choose same target location

– They are at same distance and do not coordinate

slide-62
SLIDE 62

MRS coordinated exploration

  • First robot chooses and rule out goal location for

second robot

– Joint exploration will be much more effective

slide-63
SLIDE 63

Extension for MRS coordination

  • Greedy task allocation can easily fail

– Consider swapping the order of execution for robots

  • Very restrictive assumptions

– i.e., robots share the same map

  • Several extensions:

– Use optimal task assignment (e.g., Hungarian method) – Negotiation over tasks during execution (e.g., auctions) – Do not share maps continuously (e.g., plan for meetings) – …

slide-64
SLIDE 64

Summary

  • POMDPs

– provide optimal policy considering belief states – are extremely hard to solve – effective for finite worlds and low dimensions

  • Exploration

– POMDPs can represent the exploration problem – In most practical application need to exploit domain knowledge to have tractable algorithms – Entropy to guide the search – Very often simple approaches (e.g., binary gain) are very effective and extremely efficient – Interesting extensions for MRS

slide-65
SLIDE 65

References and Further Readings

Material for the slides

  • Thrun, Burgard, Fox; Probabilistic Robotics ((Chapter 15.1—

15.3,15.5,17.1,17.2,17.4))

Further readings

  • Chapter 16, Approximate POMDP techniques
  • Fielded POMDPs (Pineau et al 2003; Roy et al. 2000)
  • Policy search (Ng et al. 2003)
  • Learning for POMDPs (Littman et al. 2001)
  • Frontier based exploration (Yamuchi et al. 1999)
  • Next best view point (Whaite and Ferrie 1997)
  • Cooperative exploration (Burgard et al. 2004)