Bayesian Reinforcement Learning in Continuous POMDPs Stphane Ross 1 - - PowerPoint PPT Presentation

bayesian reinforcement learning in continuous pomdps
SMART_READER_LITE
LIVE PREVIEW

Bayesian Reinforcement Learning in Continuous POMDPs Stphane Ross 1 - - PowerPoint PPT Presentation

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion Bayesian Reinforcement Learning in Continuous POMDPs Stphane Ross 1 , Brahim Chaib-draa 2 and Joelle Pineau 1 1 School of Computer Science, McGill University,


slide-1
SLIDE 1

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian Reinforcement Learning in Continuous POMDPs

Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1

1School of Computer Science, McGill University, Canada 2Department of Computer Science, Laval University, Canada

December 19th, 2007

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 1 / 18

slide-2
SLIDE 2

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Motivation

How should robots make decisions when : Environment is partially observable and continuous Poor model of sensors and actuators Parts of the model has to be learnt entirely during execution (e.g. users’ preferences/behavior) such as to maximize expected long-term rewards ? Typical Examples :

[Rottmann]

Solution : Bayesian Reinforcement Learning !

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 2 / 18

slide-3
SLIDE 3

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Partially Observable Markov Decision Processes

POMDP : (S, A, T, R, Z, O, γ, b0) S : Set of states A : Set of actions T(s, a, s′) = Pr(s′|s, a), the transition probabilities R(s, a) ∈ R, the immediate rewards Z : Set of observations O(s′, a, z) = Pr(z|s′, a), the observation probabilities γ : discount factor b0 : Initial state distribution Belief monitoring via Bayes rule : bt(s′) = ηO(s′, at−1, zt)

s∈S T(s, at−1, s′)bt−1(s)

Value function : V ∗(b) = maxa∈A

  • R(b, a) + γ

z∈Z Pr(z|b, a)V ∗(τ(b, a, z))

  • Bayes-Adaptive POMDP

Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 3 / 18

slide-4
SLIDE 4

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian Reinforcement Learning

General Idea : Define prior distributions over all unknown parameters. Maintain posteriors via Baye’s rule as experience is acquired. Plan considering posterior distribution over model. Allows us to : Learn the system at same time we achieve the task efficiently. Tradeoff optimally exploration and exploitation. Consider model uncertainty during planning. Include prior knowledge explicitly.

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 4 / 18

slide-5
SLIDE 5

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Finite MDPs

In Finite MDPs (T unknown) :

([Dearden 99], [Duff 02], [Poupart 06])

To learn T : Maintain counts φa

ss′ of number of times s a

→ s′ observed, starting from prior φ0. Counts define Dirichlet prior/posterior over T. Planning according to φ is a MDP problem itself : S′ : physical state (s ∈ S) + information state (φ) T ′(s, φ, a, s′, φ′) = Pr(s′, φ′|s, φ, a) = Pr(s′|s, φ, a) Pr(φ′|φ, s, a, s′) =

φa

ss′

  • s′′∈S φa

ss′′ I(φ′, φ + δa

ss′)

V ∗(s, φ) = maxa∈A

  • R(s, a) + γ

s′∈S φa

ss′

  • s′′∈S φa

ss′′ V ∗(s′, φ + δa

ss′)

  • Bayes-Adaptive POMDP

Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 5 / 18

slide-6
SLIDE 6

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Finite POMDPs

In Finite POMDPs (T, O unknown) :

([Ross 07])

Let : φa

ss′ : number of times s a

→ s′ observed. ψa

sz : number of times z observed in s after doing a.

Given action-observation sequence, use Bayes rule to maintain belief

  • ver (s, φ, ψ).

⇒ Decision under partial observability of (s, φ, ψ) is a POMDP itself : S′ : physical state (s ∈ S) + information state (φ, ψ) P′(s, φ, ψ, a, s′, φ′, ψ′, z) = Pr(s′, φ′, ψ′, z|s, φ, ψ, a) = Pr(s′|s, φ, a) Pr(z|ψ, s′, a) Pr(φ′|φ, s, a, s′) Pr(ψ′|ψ, a, s′, z) =

φa

ss′

  • s′′∈S φa

ss′′

ψa

s′z

  • z′∈Z ψa

s′z′ I(φ′, φ + δa

ss′)I(ψ′, ψ + δa s′z)

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 6 / 18

slide-7
SLIDE 7

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Example

Tiger domain with unknown sensor accuracy : Suppose prior ψ0 = (5, 3), b0 = (0.5, 0.5) Sequence of action-observation is : { (Listen,l), (Listen,l), (Listen,l), (Right,-) }

b0 : Pr(L, < 5, 3 >) = 1

2

Pr(R, < 5, 3 >) = 1

2

b1 : Pr(L, < 6, 3 >) = 5

8

Pr(R, < 5, 4 >) = 3

8

b3 : Pr(L, < 8, 3 >) = 7

9

Pr(R, < 5, 6 >) = 2

9

b4 : Pr(L, < 8, 3 >) =

7 18

Pr(L, < 5, 6 >) =

2 18

Pr(R, < 8, 3 >) =

7 18

Pr(R, < 5, 6 >) =

2 18

0.2 0.4 0.6 0.8 1 1 2 3 4 accuracy b0(L,⋅) b3(L,⋅) b3(R,⋅) b4(L,⋅)

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 7 / 18

slide-8
SLIDE 8

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Continuous Domains

In robotics, continuous domains are common (continuous state, continuous action, continuous observations). Could discretize the problem and apply our current method, but : Combinatorial explosion or poor precision Can require lots of training data (visit every small cell) Can we extend Bayesian RL to continuous domains ?

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 8 / 18

slide-9
SLIDE 9

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous Domains ?

Can’t use counts (Dirichlet distribution) to learn about the model. We assume a parametric form for transtion and observation model. For instance, in the Gaussian case : S ⊂ Rm, A ⊂ Rn, Z ⊂ Rp st+1 = gT(st, at, Xt) zt+1 = gO(st+1, at, Yt) where Xt ∼ N(µX, ΣX), Yt ∼ N(µY, ΣY), and gT, gO are arbitrary functions (possibly non-linear). We assume gT, gO are known, but that the parameters µX, ΣX, µY, ΣY are unknown. Relevant statistics depends on the parametric form.

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 9 / 18

slide-10
SLIDE 10

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous Domains ?

µ, Σ can be learned by maintaining sample mean ˆ µ and sample covariance ˆ Σ. These define a Normal-Wishart posterior over µ, Σ : µ|Σ = R ∼ N(ˆ µ, R

ν )

Σ−1 ∼ Wishart(α, τ −1) where : ν : number of observations for ˆ µ α : degree of freedom of ˆ Σ τ = αˆ Σ These can be updated easily after observation X = x : ˆ µ′ = ν ˆ

µ+x ν+1

ν′ = ν + 1 α′ = α + 1 τ ′ = τ +

ν ν+1(ˆ

µ − x)(ˆ µ − x)T

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 10 / 18

slide-11
SLIDE 11

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous POMDP

Let’s define φ = (ˆ µX, νX, αX, τX) : the posterior over (µX, ΣX) ψ = (ˆ µY, νY, αY, τY) : the posterior over (µY, ΣY) U : the update function of φ,ψ, i.e. U(φ, x) = φ′ and U(ψ, y) = ψ′ Bayes-Adaptive Continuous POMDP : (S′, A′, Z ′, P′, R′) S′ = S × R|X|+|X|2+2 × R|Y|+|Y|2+2 A′ = A Z ′ = Z P′(s, φ, ψ, a, s′, φ′, ψ′, z) = I(gT(s, a, x), s′)I(gO(s′, a, y), z)I(φ′, U(φ, x))I(ψ′, U(ψ, y))fX|φ(x)fY|ψ(y) R′(s, φ, ψ, a) = R(s, a) where x = (νX + 1)ˆ µ′

X − νX ˆ

µX and y = (νY + 1)ˆ µ′

Y − νY ˆ

µY.

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 11 / 18

slide-12
SLIDE 12

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous POMDP

Monte Carlo Belief monitoring : (1 extra assumption : gO(s, a, ·) is 1-1 transformation of Y)

1

Sample (s, φ, ψ) ∼ bt

2

Sample (µX, ΣX) ∼ NW(φ)

3

Sample X ∼ N(µX, ΣX)

4

Compute s′ = gT(s, at, X)

5

Find unique Y s.t. zt+1 = gO(s′, at, Y)

6

Compute φ′ = U(φ, X), ψ′ = U(ψ, Y)

7

Sample (µY, ΣY) ∼ NW(ψ)

8

Add f(Y|µY, ΣY) to particle bt+1(s′, φ′, ψ′)

9

Repeat until K particles in bt+1

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 12 / 18

slide-13
SLIDE 13

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous POMDP

Monte Carlo Online Planning (Receding Horizon Control) :

b0 b1 b2 b3 a1 a2 an

  • 1
  • 2
  • n

...

...

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 13 / 18

slide-14
SLIDE 14

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Experiments

Simple Robot Navigation Task : S : (x, y) position A : (v, θ) (velocity v ∈ [0, 1] and angle θ ∈ [0, 2π]) Z : Noisy (x, y) position gT(s, a, X) = s + v cos θ − sin θ sin θ cos θ

  • X

gO(s′, a, Y) = s′ + Y R(s, a) = I(||s − sGOAL||2 < 0.25) γ = 0.85

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 14 / 18

slide-15
SLIDE 15

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Robot Navigation Task

We choose exact parameters : µX = 0.8 0.3

  • µY =
  • ΣX =

0.04 −0.01 −0.01 0.01

  • ΣY =

0.01 0.01

  • Starting with prior based on 10 "artificial" samples :

ˆ µX =

  • 1
  • ˆ

µY =

  • ˆ

ΣX =

  • 0.04

−0.01 −0.01 0.16

  • ˆ

ΣY = 0.16 0.16

  • Such that φ0 = (ˆ

µX, 10, 9, 9ˆ ΣX), ψ0 = (ˆ µY, 10, 9, 9ˆ ΣY). Each time robot reaches the goal, a new goal is chosen randomly at 5 distance unit from previous goal.

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 15 / 18

slide-16
SLIDE 16

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Robot Navigation Task

Average evolution of the return over time :

50 100 150 200 250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Steps Average Return Prior model Exact Model Learning Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 16 / 18

slide-17
SLIDE 17

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Robot Navigation Task

Average accuracy of the model over time :

50 100 150 200 250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Steps WL1

Model Accuracy is measured as follows : WL1(b) =

  • (s,φ,ψ) b(s, φ, ψ) [||µφ − µX||1 + ||Σφ − ΣX||1 + ||µψ − µY||1 + ||Σψ − ΣY||1]

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 17 / 18

slide-18
SLIDE 18

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Conclusion

We presented a new framework for planning in partially observable and continuous domains with uncertain model parameters. Optimal policy maximizes long-term return given the prior over model parameters. Monte Carlo methods can be used for more tractable approximate belief monitoring and and planning. Interesting future applications human-computer interactions and robotics.

Bayes-Adaptive POMDP Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1 18 / 18