Learning to Branch
Ellen Vitercik Joint work with Nina Balcan, Travis Dick, and Tuomas Sandholm Published in ICML 2018
1
Learning to Branch Ellen Vitercik Joint work with Nina Balcan, - - PowerPoint PPT Presentation
Learning to Branch Ellen Vitercik Joint work with Nina Balcan, Travis Dick, and Tuomas Sandholm Published in ICML 2018 1 Integer Programs (IPs) a maximize subject to {0,1} 2 Facility location
Ellen Vitercik Joint work with Nina Balcan, Travis Dick, and Tuomas Sandholm Published in ICML 2018
1
a maximize π β π subject to π΅π β€ π π β {0,1}π
2
Facility location problems can be formulated as IPs.
3
Clustering problems can be formulated as IPs.
4
Binary classification problems can be formulated as IPs.
5
a maximize π β π subject to π΅π = π π β {0,1}π NP-hard
6
βYou may need to experiment.β
7
8
Delivery company routes trucks daily
Use integer programming to select routes
Demand changes every day
Solve hundreds of similar optimizations
Using this set of typical problems⦠can we learn best parameters?
9
Application- Specific Distribution Algorithm Designer B&B parameters
π΅ 1 , π 1 , π 1 , β¦ , π΅ π , π π , π π How to use samples to find best B&B parameters for my domain?
10
Model has been studied in applied communities [Hutter et al. β09] Application- Specific Distribution Algorithm Designer B&B parameters π΅ 1 , π 1 , π 1 , β¦ , π΅ π , π π , π π
11
Model has been studied from a theoretical perspective [Gupta and Roughgarden β16, Balcan et al., β17] Application- Specific Distribution Algorithm Designer B&B parameters π΅ 1 , π 1 , π 1 , β¦ , π΅ π , π π , π π
12
βBestβ could mean smallest search tree, for example
π΅ 1 , π 1 , π 1 π΅ 2 , π 2 , π 2
13
How to find parameters that are best on average over samples? Will those parameters have high performance in expectation? π΅, π, π
π΅ 1 , π 1 , π 1 π΅ 2 , π 2 , π 2
14
15
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
16
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140
17
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140
18
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 π¦1 = 0 π¦1 = 1
19
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 π¦1 = 0 π¦1 = 1
20
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 π¦1 = 0 π¦1 = 1 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 π¦2 = 0 π¦2 = 1
21
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 π¦1 = 0 π¦1 = 1 π¦2 = 0 π¦2 = 1
22
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 0, 3
5 , 0, 0, 0, 1, 1
116 0, 1, 1
3 , 1, 0, 0, 1
133.3 π¦1 = 0 π¦1 = 1 π¦6 = 0 π¦6 = 1 π¦2 = 0 π¦2 = 1
23
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 0, 1, 1
3 , 1, 0, 0, 1
133.3 π¦1 = 0 π¦1 = 1 π¦6 = 0 π¦6 = 1 π¦2 = 0 π¦2 = 1 0, 3
5 , 0, 0, 0, 1, 1
116
24
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 0, 3
5 , 0, 0, 0, 1, 1
116 0, 1, 1
3 , 1, 0, 0, 1
133.3 0, 1, 0, 1, 1, 0, 1 0, 4
5 , 1, 0, 0, 0, 1
118 π¦1 = 0 π¦1 = 1 π¦6 = 0 π¦6 = 1 π¦2 = 0 π¦2 = 1 π¦3 = 0 π¦3 = 1 133
25
i. LP relaxation solution is integral ii. LP relaxation is infeasible
isnβt better than best- known integral solution max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 0, 3
5 , 0, 0, 0, 1, 1
116 0, 1, 1
3 , 1, 0, 0, 1
133.3 π¦1 = 0 π¦1 = 1 π¦6 = 0 π¦6 = 1 π¦2 = 0 π¦2 = 1 π¦3 = 0 π¦3 = 1 0, 4
5 , 1, 0, 0, 0, 1
118 0, 1, 0, 1, 1, 0, 1 133
26
i. LP relaxation solution is integral ii. LP relaxation is infeasible
isnβt better than best- known integral solution max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 0, 3
5 , 0, 0, 0, 1, 1
116 0, 1, 1
3 , 1, 0, 0, 1
133.3 π¦1 = 0 π¦1 = 1 π¦6 = 0 π¦6 = 1 π¦2 = 0 π¦2 = 1 π¦3 = 0 π¦3 = 1 0, 4
5 , 1, 0, 0, 0, 1
118 0, 1, 0, 1, 1, 0, 1 133 Integral
27
i. LP relaxation solution is integral ii. LP relaxation is infeasible
isnβt better than best- known integral solution max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
1 2 , 1, 0, 0, 0, 0, 1
140 1,
3 5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0,
1 4 , 1
135 1, 0, 0, 1, 0, 1
2 , 1
120 1, 1, 0, 0, 0, 0, 1
3
120 0, 3
5 , 0, 0, 0, 1, 1
116 0, 1, 1
3 , 1, 0, 0, 1
133.3 π¦1 = 0 π¦1 = 1 π¦6 = 0 π¦6 = 1 π¦2 = 0 π¦2 = 1 π¦3 = 0 π¦3 = 1 0, 4
5 , 1, 0, 0, 0, 1
118 0, 1, 0, 1, 1, 0, 1 133
28
i. LP relaxation solution is integral ii. LP relaxation is infeasible
isnβt better than best- known integral solution
This talk: How to choose which variable?
(Assume every other aspect of B&B is fixed.)
29
Variable selection policies can have a huge effect on tree size
30
31
Score-based VSP: At leaf πΉ, branch on variable ππ maximizing πππ©π¬π πΉ, π Many options! Little known about which to use when
1, 3
5 , 0, 0, 0, 0, 1
136 1, 0, 0, 1, 0,
1 2 , 1
120 1, 1, 0, 0, 0, 0,
1 3
120 π¦2 = 0 π¦2 = 1
32
For an IP instance π :
β be π with π¦π set to 0, and let π π + be π with π¦π set to 1
Example.
1 2 , 1, 0, 0, 0, 0, 1
140 max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
ππ π
33
For an IP instance π :
β be π with π¦π set to 0, and let π π + be π with π¦π set to 1
Example.
1 2 , 1, 0, 0, 0, 0, 1
140 1, 3
5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0, 1
4 , 1
135 π¦1 = 0 π¦1 = 1 max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
ππ 1
β
ππ 1
+
ππ π
34
For a IP instance π :
β be π with π¦π set to 0, and let π π + be π with π¦π set to 1
Example.
1 2 , 1, 0, 0, 0, 0, 1
140 1, 3
5 , 0, 0, 0, 0, 1
136 0, 1, 0, 1, 0, 1
4 , 1
135 π¦1 = 0 π¦1 = 1
ππ 1
β
ππ 1
+
ππ π
max (40, 60, 10, 10, 3, 20, 60) β π s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100 π β {0,1}7
The linear rule (parameterized by π) [Linderoth & Savelsbergh, 1999] Branch on variable π¦π maximizing: score π , π = π min ππ β ππ π
β, ππ β ππ π + + (1 β π) max ππ β ππ π β, ππ β ππ π +
35
And many moreβ¦
The (simplified) product rule [Achterberg, 2009] Branch on variable π¦π maximizing: score π , π = ππ β ππ π
β β ππ β ππ π +
The linear rule (parameterized by π) [Linderoth & Savelsbergh, 1999] Branch on variable π¦π maximizing: score π , π = π min ππ β ππ π
β, ππ β ππ π + + (1 β π) max ππ β ππ π β, ππ β ππ π +
36
Given π scoring rules score1, β¦ , scored. Goal: Learn best convex combination π1score1 + β― + ππscored. Branch on variable π¦π maximizing: score π , π = π1score1 π , π + β― + ππscored π , π Our parameterized rule
37
Application- Specific Distribution Algorithm Designer B&B parameters
π΅ 1 , π 1 , π 1 , β¦ , π΅ π , π π , π π How to use samples to find best B&B parameters for my domain?
38
Application- Specific Distribution Algorithm Designer B&B parameters
π΅ 1 , π 1 , π 1 , β¦ , π΅ π , π π , π π π1, β¦ , ππ How to use samples to find best B&B parameters for my domain? π1, β¦ , ππ
39
40
π
Average tree size
41
This has been prior workβs approach [e.g., Achterberg (2009)]. π
Average tree size
42
π Average tree size
43
π Average tree size
This can actually happen!
44
Theorem [informal]. For any discretization: Exists problem instance distribution π inducing this behavior Proof ideas: π βs support consists of infeasible IPs with βeasy outβ variables
B&B takes exponential time unless branches on βeasy outβ variables
B&B only finds βeasy outsβ if uses parameters from specific range
Expected tree size π
45
i. Single-parameter settings ii. Multi-parameter settings
46
Exists π upper bounding the size of largest tree willing to build Common assumption, e.g.:
47
π β [0,1]
Lemma: For any two scoring rules and any IP π , π (# variables)π+2 intervals partition [0,1] such that: For any interval [π, π], B&B builds same tree across all π β π, π Much smaller in our experiments!
48
branch
branch
π π β score1 π , 1 + (1 β π) β score2 π , 1 π β score1 π , 2 + (1 β π) β score2 π , 2 π β score1 π , 3 + (1 β π) β score2 π , 3 branch
Lemma: For any two scoring rules and any IP π , π (# variables)π+2 intervals partition [0,1] such that: For any interval [π, π], B&B builds same tree across all π β π, π
49
π π 2
β
π 2
+
Any π in yellow interval: π¦2 = 0 π¦2 = 1 branch
branch
π branch
Lemma: For any two scoring rules and any IP π , π (# variables)π+2 intervals partition [0,1] such that: For any interval [π, π], B&B builds same tree across all π β π, π
π 2
β
50
Lemma: For any two scoring rules and any IP π , π (# variables)π+2 intervals partition [0,1] such that: For any interval [π, π], B&B builds same tree across all π β π, π
π π β score1 π 2
β, 1 + (1 β π) β score2 π 2 β, 1
π β score1 π 2
β, 3 + (1 β π) β score2 π 2 β, 3
π π 2
β
π 2
+
Any π in yellow interval: π¦2 = 0 π¦2 = 1 branch on π¦2 then π¦3 branch on π¦2 then π¦1
51
Lemma: For any two scoring rules and any IP π , π (# variables)π+2 intervals partition [0,1] such that: For any interval [π, π], B&B builds same tree across all π β π, π
π Any π in blue-yellow interval: branch on π¦2 then π¦3 branch on π¦2 then π¦1 π π 2
β
π 2
+
π¦3 = 0 π¦3 = 1
52
π¦2 = 0 π¦2 = 1
Proof idea.
In each interval, var. selection order fixed
Lemma: For any two scoring rules and any IP π , π (# variables)π+2 intervals partition [0,1] such that: For any interval [π, π], B&B builds same tree across all π β π, π
π branch on π¦2 then π¦3 branch on π¦2 then π¦1
53
Input: Set of IPs sampled from a distribution π For each IP, set π = 0. While π < 1:
πβ²β² β score1 + 1 β πβ²β² β score2 for any πβ²β² β π, πβ² , B&B will build tree π° (takes a little bookkeeping)
Return: Any ΰ· π from the interval minimizing average tree size
π β [0,1]
54
Let ΖΈ π be algorithmβs output given ΰ·¨ π
π3 π2 ln(#variables) samples.
W.h.p., π½π ~π [tree-size(π , ΖΈ π)] β min
πβ 0,1 π½π ~π [treeβsize(π , π)] < π
Proof intuition: Bound algorithm classβs intrinsic complexity (IC)
Learning theory allows us to translate IC to sample complexity
π β [0,1]
55
i. Single-parameter settings ii. Multi-parameter settings
56
Lemma: For any π scoring rules and any IP, a set β of π (# variables)π+2 hyperplanes partitions 0,1 π s.t.: For any connected component π of 0,1 π β β, B&B builds the same tree across all π β π
57
Fix π scoring rules and draw samples π 1, β¦ , π π~π If π = ΰ·¨ π
π3 π2 ln(π β #variables) , then w.h.p., for all π β [0,1]π,
1 π ΰ·
π=1 π
treeβsize(π π, π) β π½π ~π [treeβsize(π , π)] < π Average tree size generalizes to expected tree size
58
59
Let: score1 π , π = min ππ β ππ π
β, ππ β ππ π +
score2 π , π = max ππ β ππ π
β, ππ β ππ π +
Our parameterized rule Branch on variable π¦π maximizing: score π , π = π β score1 π , π + (1 β π) β score2 π , π This is the linear rule [Linderoth & Savelsbergh, 1999]
60
Leyton-Brown, Pearson, and Shoham. Towards a universal test suite for combinatorial auction
βRegionsβ generator: 400 bids, 200 goods, 100 instances βArbitraryβ generator: 200 bids, 100 goods, 100 instances
61
Facility location: 70 facilities, 70 customers, 500 instances Clustering: 5 clusters, 35 nodes, 500 instances Agnostically learning linear separators: 50 points in β2, 500 instances
62
63
64
How can we train faster?
Other tree-building applications can we apply our techniques to?
How can we attack other learning problems in B&B?
65
Thank you! Questions?