Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - - PowerPoint PPT Presentation

monte carlo tree search guided by symbolic advice for mdps
SMART_READER_LITE
LIVE PREVIEW

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - - PowerPoint PPT Presentation

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Universit Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13 Markov Decision Process 1 s 0 4 a 1 a 2 1


slide-1
SLIDE 1

Monte Carlo Tree Search guided by Symbolic Advice for MDPs

Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin

Université Libre de Bruxelles

September 16, 2020 HIGHLIGHTS 2020

1/13

slide-2
SLIDE 2

Markov Decision Process

s0 s1 s2 a1 a2 a3 a4

2 3 1 3

1

1 2 1 2 3 4 1 4

Path of length 2: s0

a1

− →

2 3

s1

a3

− →

1 2

s2

2/13

slide-3
SLIDE 3

Markov Decision Process

s0 s1 s2 1 a1 a2 2 a3 1 a4

2 3 1 3

1

1 2 1 2 3 4 1 4

Path of length 2: s0

a1

− →

2 3

s1

a3

− →

1 2

s2 Finite-horizon total reward (horizon H) Val(s0) = supσ:Paths→A E [Reward(p)] where p is a random variable over PathsH(s0, σ) Link with infinite-horizon average reward for H large enough

2/13

slide-4
SLIDE 4

Monte Carlo tree search (MCTS)

s0 s1 s1 s1 s2 a3 s2 s0 s2 a4 a3 s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 v4 a1 v1 s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 a2 v2 Iterative construction of a sparse tree with value estimates

3/13

slide-5
SLIDE 5

Monte Carlo tree search (MCTS)

s0 s1 s1 s1 s2 a3 s2 s0 s2 a4 a3 s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 v4 a1 v1 s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 a2 v2 Iterative construction of a sparse tree with value estimates Selection of a new node simulation

3/13

slide-6
SLIDE 6

Monte Carlo tree search (MCTS)

s0 s1 s1 s1 s2 a3 s2 s0 s2 a4 a3 s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 v ′

4

a1 v ′

1

s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 a2 v2 v Iterative construction of a sparse tree with value estimates Selection of a new node simulation update of the estimates

3/13

slide-7
SLIDE 7

Monte Carlo tree search (MCTS)

s0 s1 s1 s1 s2 a3 s2 s0 s2 a4 a3 s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 v ′

4

a1 v ′

1

s2 s0 s1 s2 a1 s2 a2 s2 s0 s2 a4 a4 a2 v2 With UCT

(Kocsis & Szepesvári, 2006) as the selection strategy:

After a given number of iterations n, MCTS outputs the best action The probability of choosing a suboptimal action converges to zero vi converges to the real value of ai at a speed of (log n)/n

3/13

slide-8
SLIDE 8

Symbolic advice

4/13

slide-9
SLIDE 9

Symbolic advice

s0 s1 s1 s1 X s2 X a3 s2 s0 X s2 X a4 a3 s2 s0 s1

  • s2
  • a1

s2 X a2 s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 H An advice is a subset of PathsH(s0) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . )

5/13

slide-10
SLIDE 10

Symbolic advice

s0 s1 s1 s1 X s2 X a3 s2 s0 X s2 X a4 a3 s2 s0 s1

  • s2
  • a1

s2 X a2 s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 An advice is a subset of PathsH(s0) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . )

5/13

slide-11
SLIDE 11

Symbolic advice

s0 s1 s1 s1 X s2 X a3 s2 s0 X s2 X a4 a3 s2 s0 s1

  • s2
  • a1

s2 X a2 s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 An advice is a subset of PathsH(s0) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) ϕ defines a pruning of the unfolded MDP

5/13

slide-12
SLIDE 12

Symbolic advice

s0 s1 s1 s1

  • s2
  • a3

s2 s0

  • s2
  • a4

a3 s2 s0 s1

  • s2

X a1 s2

  • a2

s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 H Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game does not partially prune stochastic transitions

6/13

slide-13
SLIDE 13

Symbolic advice

s0 s1 s1 s1

  • s2
  • a3

s2 s0

  • s2
  • a4

a3 s2 s0 s1

  • s2

X a1 s2

  • a2

s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 H Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game does not partially prune stochastic transitions

6/13

slide-14
SLIDE 14

Symbolic advice

s0 s1 s1 s1

  • s2
  • a3

s2 s0

  • s2
  • a4

a3 s2 s0 s1 X s2 X a1 s2

  • a2

s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 H Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game does not partially prune stochastic transitions

6/13

slide-15
SLIDE 15

Boolean Solvers

The advice ψ can be encoded as a Boolean Formula

7/13

slide-16
SLIDE 16

Boolean Solvers

The advice ψ can be encoded as a Boolean Formula

QBF solver

A first action a0 is compatible with ϕ iff ∀s1∃a1∀s2 . . . , s0a0s1a1s2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ

7/13

slide-17
SLIDE 17

Boolean Solvers

The advice ψ can be encoded as a Boolean Formula

QBF solver

A first action a0 is compatible with ϕ iff ∀s1∃a1∀s2 . . . , s0a0s1a1s2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ

Weighted sampling

Simulation of safe paths according to ψ Weighted SAT sampling (Chakraborty, Fremont, Meel, Seshia, & Vardi,

2014)

7/13

slide-18
SLIDE 18

MCTS under advice

8/13

slide-19
SLIDE 19

MCTS under advice

s0 s1 s1 s1 X s2 X a3 s2 s0 X s2 X a4 a3 s2 s0 s1

  • s2
  • a1

s2 X a2 s2 s0

  • s2
  • a4

a4 a1 s2 s0 s1 X s2 X a1 s2 X a2 s2 s0 X s2 X a4 a4 a2 Select actions in the unfolding pruned by a selection advice ϕ Simulation is restricted according to a simulation advice ψ

9/13

slide-20
SLIDE 20

MCTS under advice Convergence properties

With UCT

(Kocsis & Szepesvári, 2006) as the selection strategy:

The probability of choosing a suboptimal action converges to zero vi converges to the real value of ai at a speed of (log n)/n The convergence properties are maintained: for all simulation advice for all selection advice which

10/13

slide-21
SLIDE 21

MCTS under advice Convergence properties

With UCT

(Kocsis & Szepesvári, 2006) as the selection strategy:

The probability of choosing a suboptimal action converges to zero vi converges to the real value of ai at a speed of (log n)/n The convergence properties are maintained: for all simulation advice for all selection advice which . . .

are Strongly enforceable advice

10/13

slide-22
SLIDE 22

MCTS under advice Convergence properties

With UCT

(Kocsis & Szepesvári, 2006) as the selection strategy:

The probability of choosing a suboptimal action converges to zero vi converges to the real value of ai at a speed of (log n)/n The convergence properties are maintained: for all simulation advice for all selection advice which

are Strongly enforceable advice satisfy an optimality assumption: does not prune all optimal actions

10/13

slide-23
SLIDE 23

Experimental results

11/13

slide-24
SLIDE 24

Experimental results

Figure: 9x21 maze, 4 random ghosts

Algorithm % of win % of loss % of no result1 % of food eaten MCTS 17 59 24 67 MCTS+Selection advice 25 54 21 71 MCTS+Simulation advice 71 29 88 MCTS+both advice 85 15 94 Human 44 56 75

1after 300 steps

12/13

slide-25
SLIDE 25

Future works

Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice

13/13

slide-26
SLIDE 26

Future works

Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice

Thank You

13/13