Bandit-based Search for Constraint Programming Manuel Loth 1 , 2 , 4 - - PowerPoint PPT Presentation

bandit based search for constraint programming
SMART_READER_LITE
LIVE PREVIEW

Bandit-based Search for Constraint Programming Manuel Loth 1 , 2 , 4 - - PowerPoint PPT Presentation

Bandit-based Search for Constraint Programming Manuel Loth 1 , 2 , 4 , Mich` ele Sebag 2 , 4 , 1 , Youssef Hamadi 3 , 1 , Marc Schoenauer 4 , 2 , 1 , Christian Schulte 5 1 Microsoft-INRIA joint centre 2 LRI, Univ. Paris-Sud and CNRS 3 Microsoft


slide-1
SLIDE 1

Bandit-based Search for Constraint Programming

Manuel Loth1,2,4, Mich` ele Sebag2,4,1, Youssef Hamadi3,1, Marc Schoenauer4,2,1, Christian Schulte5

1Microsoft-INRIA joint centre 2LRI, Univ. Paris-Sud and CNRS 3Microsoft Research Cambridge 4INRIA Saclay 5KTH, Stockholm

Review AERES, Nov. 2013

LABORATOIRE DE RECHERCHE EN

INFORMATIQUE

1 / 23

slide-2
SLIDE 2

Search/Optimization and Machine Learning

Different Learning contexts

◮ Supervised (from examples) vs Reinforcement (from reward) ◮ Off-line (static) vs On-line (while searching)

Here: Use on-line Reinforcement Learning (MCTS) To improve CP search

2 / 23

slide-3
SLIDE 3

Main idea

Constraint Programming

◮ Explore a search tree ◮ Heuristics: (learn to) order

variables & values Monte-Carlo Tree Search

◮ A tree-search method ◮ Breathrough for games and

planning Hybridizing MCTS and CP Bandit-based Search for Constraint Programming

3 / 23

slide-4
SLIDE 4

Overview

MCTS BaSCoP Experimental validation Conclusions and Perspectives

4 / 23

slide-5
SLIDE 5

The Multi-Armed Bandit problem

Lai, Robbins 85

In a casino, one wants to maximize

  • ne’s gains while playing.

Lifelong learning Exploration vs Exploitation Dilemma

◮ Play the best arm so far ?

Exploitation

◮ But there might exist better arms...

Exploration

5 / 23

slide-6
SLIDE 6

The Multi-Armed Bandit problem (2)

◮ K arms, ith arm gives reward 1 with proba. µi, 0 otherwise ◮ At each time t, one selects an arm i∗ t and gets a reward rt

ni,t = number of times i has been selected in [0,t] ˆ µi,t = average reward of arm i in [0,t] Upper Confidence Bound

Auer et al. 2002

Be optimistic when facing the unknown Select argmax

  • ˆ

µi,t + C

  • log( nj,t)

ni,t

  • ǫ-greedy

with probability 1 − ǫ, select argmax {ˆ µi,t}

exploitation

else select an arm uniformly

exploration

6 / 23

slide-7
SLIDE 7

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree

7 / 23

slide-8
SLIDE 8

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-9
SLIDE 9

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-10
SLIDE 10

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-11
SLIDE 11

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-12
SLIDE 12

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-13
SLIDE 13

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-14
SLIDE 14

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-15
SLIDE 15

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based

7 / 23

slide-16
SLIDE 16

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node

7 / 23

slide-17
SLIDE 17

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

7 / 23

slide-18
SLIDE 18

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

7 / 23

slide-19
SLIDE 19

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

7 / 23

slide-20
SLIDE 20

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

7 / 23

slide-21
SLIDE 21

Monte-Carlo Tree Search

Kocsis Szepesv´ ari, 06

UCT == UCB for Trees: gradually grow the search tree

◮ Iterate Tree-Walk

◮ Building Blocks ◮ Select next action

Bandit phase

◮ Add a node

Grow a leaf of the search tree

◮ Select next action bis

Random phase, roll-out

◮ Compute instant reward

Evaluate

◮ Update information in visited nodes

  • f the search tree

Propagate

◮ Returned solution:

◮ Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

7 / 23

slide-22
SLIDE 22

Overview

MCTS BaSCoP Experimental validation Conclusions and Perspectives

8 / 23

slide-23
SLIDE 23

Adaptation

Main issues

◮ Which default policy ?

(random phase)

◮ Which reward ? ◮ Which selection rule ?

Desired

◮ As problem-independent as possible ◮ Compatible with multiple restarts ◮ (some) Guarantees of completeness

9 / 23

slide-24
SLIDE 24

Default policy: Depth-first search (DFS)

◮ Enforces completeness ◮ Accounts for priors about values (some are better than others;

neighborhood of last best solution).

◮ Limited memory resources required:

under each MCTS leaf node, store the current DFS path

(assignments on the left of the DFS path are closed)

10 / 23

slide-25
SLIDE 25

Reward

◮ If multiple restarts, rewards cannot be attached to tree nodes ◮ → rewards attached to elementary assignments

i.e. (variable = value) Guiding principles

◮ Variables: Fail first

existing heuristics perform well

◮ Values: Fail deep −

→ reward(var = val) = 1 if failure deeper than (local) average

  • therwise

Discussion

◮ Compatible with multiple restarts ◮ Noise:

var might occur at different depths

◮ But noise equally affects all val.

11 / 23

slide-26
SLIDE 26

Selection rules

L-value: left-value (0) R-value: right-value (1) Baselines (non-adaptive)

◮ Uniform ◮ ǫ-left: with proba 1 − ǫ select L-value, otherwise, R-value

Adaptive selection rules

◮ UCB: select val maximizing

reward (var = val) + C

  • log(

value n(value))

n(var = val)

◮ UCB-left: same, but Cleft = ρCright, ρ > 1

12 / 23

slide-27
SLIDE 27

Overview

MCTS BaSCoP Experimental validation Conclusions and Perspectives

13 / 23

slide-28
SLIDE 28

Goal of experiments

Compare BaSCoP with baselines

◮ DFS alone ◮ Adaptive and non-adaptive selection rules

Genericity

◮ Robustness wrt multiple restarts ◮ Sensitivity analysis wrt parameters

14 / 23

slide-29
SLIDE 29

Experimental setting

Algorithmic framework: Gecode

http://gecode.org

Top policies Bottom policies non-Adaptive Adaptive Uniform UCB ǫ-Left UCB-Left Depth-First Search Parameters

◮ ǫ .05, .1, .15, .2

ǫ-Left

◮ C .05, .1, .2, .5

UCB

◮ ρ 1, 2, 4, 8

UCB-Left

15 / 23

slide-30
SLIDE 30

Benchmark problems

Job-shop scheduling

◮ 40 Taillard instances

http://mistic.heig-vd.ch/taillard/

◮ Multiple restarts (Luby sequence), neighborhood search ◮ Performance: mean relative error (to best known results)

Car-sequencing

◮ 70 instances, circa 200 n-ary variables ◮ Performance: -violation ◮ No restart

All results averaged over 11 runs

16 / 23

slide-31
SLIDE 31

Structures of visited trees

Uniform UCB ǫ-Left UCB-Left Typical tree shapes for some JSP Taillard 15×20 instance

17 / 23

slide-32
SLIDE 32

Experimental Results

State-of-the-art results on several instances (200 000 tree-walks)

0.01 0.1 20000 40000 60000 80000 100000 mean relative error to best-known solution tree-walks DFS Balanced e-left(e=0.15) UCB(C=0.1) UCB-left(Cl=0.2,Cr=0.1)

Sample result: Mean Relative Error on Taillard 11-20

18 / 23

slide-33
SLIDE 33

Car Sequencing 2/3 2/5

ABS

1/2

ABS ABS ABS

◮ Car assembly line, different options on ordered cars. ◮ Stalls can handle a given number of cars ◮ Arrange car sequence so as not to exceed any capacity

→ minimize number of empty stalls n-ary, no restart, no positional bias of values

19 / 23

slide-34
SLIDE 34

Car Sequencing

5 10 15 20 25 30 35 number of empty stalls instances DFS UCB, C in {0.05,0.1,0.2,0.5}

BaSCoP modestly but significantly better than DFS . . . but both significantly worse than ad hoc heuristics

20 / 23

slide-35
SLIDE 35

Overview

MCTS BaSCoP Experimental validation Conclusions and Perspectives

21 / 23

slide-36
SLIDE 36

Conclusion

BaSCoP integrated in the Gecode framework

◮ Generic heuristics for value ordering ◮ Compatible with multiple restarts ◮ DFS as rollout policy provides completeness guarantees ◮ Improves on DFS on 2/3 benchmark families ◮ State-of-art CP results without any ad-hoc heuristics on JSP

22 / 23

slide-37
SLIDE 37

Perspectives

Extensions

◮ Rank-based reward for values

for n-ary contexts

◮ When no-restart, full MCTS

(rewards attached to partial assignments)

◮ Rewards for variable ordering ◮ Control of the parallelization scheme (adaptive work stealing)

23 / 23