Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - - PowerPoint PPT Presentation

practical open loop optimistic planning
SMART_READER_LITE
LIVE PREVIEW

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - - PowerPoint PPT Presentation

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille Nord Europe 2 Renault Group ECML PKDD 2019 W urzburg, September 2019 Motivation Sequential Decision Making action Agent


slide-1
SLIDE 1

W¨ urzburg, September 2019

Practical Open-Loop Optimistic Planning

Edouard Leurent1,2, Odalric-Ambrym Maillard1

1 SequeL, Inria Lille – Nord Europe 2 Renault Group

ECML PKDD 2019

slide-2
SLIDE 2

Motivation — Sequential Decision Making

Agent Environment state, reward action

Markov Decision Processes

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 2/31

slide-3
SLIDE 3

Motivation — Sequential Decision Making

Agent Environment state, reward action

Markov Decision Processes

  • 1. Observe state s ∈ S;

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 2/31

slide-4
SLIDE 4

Motivation — Sequential Decision Making

Agent Environment state, reward action

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick a discrete action a ∈ A;

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 2/31

slide-5
SLIDE 5

Motivation — Sequential Decision Making

Agent Environment state, reward action

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick a discrete action a ∈ A;
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 2/31

slide-6
SLIDE 6

Motivation — Sequential Decision Making

Agent Environment state, reward action

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick a discrete action a ∈ A;
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;
  • 4. Receive a bounded reward r ∈ [0, 1] drawn from P (r|s, a).

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 2/31

slide-7
SLIDE 7

Motivation — Sequential Decision Making

Agent Environment state, reward action

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick a discrete action a ∈ A;
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;
  • 4. Receive a bounded reward r ∈ [0, 1] drawn from P (r|s, a).

Objective: maximise V = E [∞

t=0 γtrt] Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 2/31

slide-8
SLIDE 8

Motivation — Example

The highway-env environment

We want to handle stochasticity.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 3/31

slide-9
SLIDE 9

Motivation — How to solve MDPs?

Online Planning

◮ we have access to a generative model:

yields samples of s′, r ∼ P (s′, r|s, a) when queried

Agent Environment Planner

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 4/31

slide-10
SLIDE 10

Motivation — How to solve MDPs?

Online Planning

◮ we have access to a generative model:

yields samples of s′, r ∼ P (s′, r|s, a) when queried

Agent Environment Planner state

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 4/31

slide-11
SLIDE 11

Motivation — How to solve MDPs?

Online Planning

◮ we have access to a generative model:

yields samples of s′, r ∼ P (s′, r|s, a) when queried

Agent Environment Planner state

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 4/31

slide-12
SLIDE 12

Motivation — How to solve MDPs?

Online Planning

◮ we have access to a generative model:

yields samples of s′, r ∼ P (s′, r|s, a) when queried

Agent Environment Planner state recommendation

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 4/31

slide-13
SLIDE 13

Motivation — How to solve MDPs?

Online Planning

◮ we have access to a generative model:

yields samples of s′, r ∼ P (s′, r|s, a) when queried

Agent Environment Planner state action recommendation

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 4/31

slide-14
SLIDE 14

Motivation — How to solve MDPs?

Online Planning

◮ we have access to a generative model:

yields samples of s′, r ∼ P (s′, r|s, a) when queried

Agent Environment Planner state, reward state action recommendation

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 4/31

slide-15
SLIDE 15

Motivation — How to solve MDPs?

Online Planning

◮ fixed budget: the model can only be queried n times

Objective: minimize E V ∗ − V (n)

  • Simple Regret rn

An exploration-exploitation problem.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 5/31

slide-16
SLIDE 16

Optimistic Planning

Optimism in the Face of Uncertainty

Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 6/31

slide-17
SLIDE 17

Optimistic Planning

Optimism in the Face of Uncertainty

Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

◮ Either you performed well; Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 6/31

slide-18
SLIDE 18

Optimistic Planning

Optimism in the Face of Uncertainty

Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

◮ Either you performed well; ◮ or you learned something. Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 6/31

slide-19
SLIDE 19

Optimistic Planning

Optimism in the Face of Uncertainty

Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

◮ Either you performed well; ◮ or you learned something.

Instances

◮ Monte-carlo tree search (MCTS) [Coulom 2006]: CrazyStone ◮ Reframed in the bandit setting as UCT [Kocsis and Szepesv´

ari 2006], still very popular (e.g. Alpha Go).

◮ Proved asymptotic consistency, but no regret bound. Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 6/31

slide-20
SLIDE 20

Analysis of UCT

It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O(exp(exp(D))).

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 7/31

slide-21
SLIDE 21

Failing cases of UCT

Not just a theoretical counter-example.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 8/31

slide-22
SLIDE 22

Can we get better guarantees?

OPD: Optimistic Planning for Deterministic systems

◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs

Theorem (OPD sample complexity)

E rn = O

  • n− log 1/γ

log κ

  • , if κ > 1

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 9/31

slide-23
SLIDE 23

Can we get better guarantees?

OPD: Optimistic Planning for Deterministic systems

◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs

Theorem (OPD sample complexity)

E rn = O

  • n− log 1/γ

log κ

  • , if κ > 1

OLOP: Open-Loop Optimistic Planning

◮ Introduced by [Bubeck and Munos 2010] ◮ Extends OPD to the stochastic setting ◮ Only considers open-loop policies, i.e. sequences of actions Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 9/31

slide-24
SLIDE 24

The idea behind OLOP

A direct application of Optimism in the Face of Uncertainty

  • 1. We want

max

a

V (a)

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 10/31

slide-25
SLIDE 25

The idea behind OLOP

A direct application of Optimism in the Face of Uncertainty

  • 1. We want

max

a

V (a)

  • 2. Form upper confidence-bounds of sequence values:

V (a) ≤ Ua w.h.p

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 10/31

slide-26
SLIDE 26

The idea behind OLOP

A direct application of Optimism in the Face of Uncertainty

  • 1. We want

max

a

V (a)

  • 2. Form upper confidence-bounds of sequence values:

V (a) ≤ Ua w.h.p

  • 3. Sample the sequence with highest UCB:

arg max

a

Ua

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 10/31

slide-27
SLIDE 27

The idea behind OLOP

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 11/31

slide-28
SLIDE 28

The idea behind OLOP

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 12/31

slide-29
SLIDE 29

Under the hood

Upper-bounding the value of sequences

V (a) =

follow the sequence

  • h
  • t=1

γtµa1:t +

act optimally

  • t≥h+1

γtµa∗

1:t

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 13/31

slide-30
SLIDE 30

Under the hood

Upper-bounding the value of sequences

V (a) =

follow the sequence

  • h
  • t=1

γt µa1:t

  • ≤Uµ

+

act optimally

  • t≥h+1

γt µa∗

1:t

  • ≤1

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 13/31

slide-31
SLIDE 31

Under the hood

OLOP main tool: the Chernoff-Hoeffding deviation inequality

a (m) Upper bound def

= ˆ µa(m)

Empirical mean

+

  • 2 log M

Ta(m)

  • Confidence interval

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 14/31

slide-32
SLIDE 32

Under the hood

OLOP main tool: the Chernoff-Hoeffding deviation inequality

a (m) Upper bound def

= ˆ µa(m)

Empirical mean

+

  • 2 log M

Ta(m)

  • Confidence interval

OPD: upper-bound all the future rewards by 1

Ua(m) def =

h

  • t=1

γtUµ

a1:t(m)

  • Past rewards

+ γh+1 1 − γ

Future rewards Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 14/31

slide-33
SLIDE 33

Under the hood

OLOP main tool: the Chernoff-Hoeffding deviation inequality

a (m) Upper bound def

= ˆ µa(m)

Empirical mean

+

  • 2 log M

Ta(m)

  • Confidence interval

OPD: upper-bound all the future rewards by 1

Ua(m) def =

h

  • t=1

γtUµ

a1:t(m)

  • Past rewards

+ γh+1 1 − γ

Future rewards

Bounds sharpening

Ba(m) def = inf

1≤t≤L Ua1:t(m) Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 14/31

slide-34
SLIDE 34

OLOP guarantees

Theorem (OLOP Sample complexity)

OLOP satisfies: E rn =     

  • O
  • n− log 1/γ

log κ′

  • ,

if γ √ κ′ > 1

  • O
  • n− 1

2

  • ,

if γ √ κ′ ≤ 1 ”Remarkably, in the case κγ2 > 1, we obtain the same rate for the simple regret as Hren and Munos (2008). Thus, in this case, we can say that planning in stochastic environments is not harder than planning in deterministic environments”.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 15/31

slide-35
SLIDE 35

Does it work?

Our objective: understand and bridge this gap. Make OLOP practical.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 16/31

slide-36
SLIDE 36

What’s wrong with OLOP?

Explanation: inconsistency

◮ Unintended behaviour happens when Uµ a (m) > 1, ∀a.

a (m) = ˆ

µa(m)

∈[0,1]

+

  • 2 log M

Ta(m)

  • >0

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 17/31

slide-37
SLIDE 37

What’s wrong with OLOP?

Explanation: inconsistency

◮ Unintended behaviour happens when Uµ a (m) > 1, ∀a.

a (m) = ˆ

µa(m)

∈[0,1]

+

  • 2 log M

Ta(m)

  • >0

◮ Then the sequence (Ua1:t(m))t is increasing

Ua1:1(m) = γUµ

a1(m) + γ21

+γ31 + . . . Ua1:2(m) = γUµ

a1(m) + γ2 Uµ a2

  • >1

+γ31 + . . .

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 17/31

slide-38
SLIDE 38

What’s wrong with OLOP?

Explanation: inconsistency

◮ Unintended behaviour happens when Uµ a (m) > 1, ∀a.

a (m) = ˆ

µa(m)

∈[0,1]

+

  • 2 log M

Ta(m)

  • >0

◮ Then the sequence (Ua1:t(m))t is increasing

Ua1:1(m) = γUµ

a1(m) + γ21

+γ31 + . . . Ua1:2(m) = γUµ

a1(m) + γ2 Uµ a2

  • >1

+γ31 + . . .

◮ Then Ba(m) = Ua1:1(m) Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 17/31

slide-39
SLIDE 39

What’s wrong with OLOP?

What we were promised

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 18/31

slide-40
SLIDE 40

What’s wrong with OLOP?

What we actually get

OLOP behaves as uniform planning!

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 19/31

slide-41
SLIDE 41

Our contribution: Kullback-Leibler OLOP

We summon the upper-confidence bound from kl-UCB [Capp´ e et al. 2013]: Uµ

a (m) def

= max {q ∈ I : Ta(m)d(ˆ µa(m), q) ≤ f (m)}

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 20/31

slide-42
SLIDE 42

Our contribution: Kullback-Leibler OLOP

We summon the upper-confidence bound from kl-UCB [Capp´ e et al. 2013]: Uµ

a (m) def

= max {q ∈ I : Ta(m)d(ˆ µa(m), q) ≤ f (m)} Algorithm OLOP KL-OLOP Interval I R [0, 1] Divergence d dQUAD dBER f (m) 4 log M 2 log M + 2 log log M dQUAD(p, q) def = 2(p − q)2 dBER(p, q) def = p log p q + (1 − p) log 1 − p 1 − q

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 20/31

slide-43
SLIDE 43

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 21/31

slide-44
SLIDE 44

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

And now,

◮ Uµ a (m) ∈ I = [0, 1], ∀a. Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 21/31

slide-45
SLIDE 45

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

And now,

◮ Uµ a (m) ∈ I = [0, 1], ∀a. ◮ The sequence (Ua1:t(m))t is non-increasing Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 21/31

slide-46
SLIDE 46

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

And now,

◮ Uµ a (m) ∈ I = [0, 1], ∀a. ◮ The sequence (Ua1:t(m))t is non-increasing ◮ Ba(m) = Ua(m), the bound sharpening step is superfluous. Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 21/31

slide-47
SLIDE 47

Sample complexity

Theorem (Sample complexity)

KL-OLOP enjoys the same regret bounds as OLOP. More precisely, KL-OLOP satisfies: E rn =     

  • O
  • n− log 1/γ

log κ′

  • ,

if γ √ κ′ > 1

  • O
  • n− 1

2

  • ,

if γ √ κ′ ≤ 1

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 22/31

slide-48
SLIDE 48

Time complexity

Original KL-OLOP

Compute Ba(m − 1) from (14) for all a ∈ AL

Lazy KL-OLOP Property (Time and memory complexity)

C(Lazy KL-OLOP) C(KL-OLOP) = nK K L

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 23/31

slide-49
SLIDE 49

Experiments — Expanded Trees

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 24/31

slide-50
SLIDE 50

Experiments — Expanded Trees

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 25/31

slide-51
SLIDE 51

Experiments — Expanded Trees

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 26/31

slide-52
SLIDE 52

Experiments — Highway

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 27/31

slide-53
SLIDE 53

Experiments — Gridworld

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 28/31

slide-54
SLIDE 54

Experiments — Stochastic Gridworld

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 29/31

slide-55
SLIDE 55

References

S´ ebastien Bubeck and R´ emi Munos. “Open Loop Optimistic Planning”. In: Proc. of COLT. 2010. Olivier Capp´ e, Aur´ elien Garivier, Odalric-Ambrym Maillard, R´ emi Munos, and Gilles Stoltz. “Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation”. In: The Annals of Statistics 41.3 (2013), pp. 1516–1541. Pierre-Arnaud Coquelin and R´ emi Munos. “Bandit Algorithms for Tree Search”. In: Proc. of UAI (2007). R´ emi Coulom. “Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search”. In: Proc. of International Conference on Computer and Games. 2006. Jean Franc ¸ois Hren and R´ emi Munos. “Optimistic planning of deterministic systems”. In: Lecture Notes in Computer Science (2008). Levente Kocsis and Csaba Szepesv´

  • ari. “Bandit Based

Monte-carlo Planning”. In: Proc. of ECML PKDD. 2006.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 30/31

slide-56
SLIDE 56

Thank You.

Practical Open-Loop Optimistic Planning

ECML PKDD 2019 - 31/31