[PPT] - Benchmarking Solvers, SAT-style { Martin Nyx Brain, James H. PowerPoint Presentation

SLIDE 1

Benchmarking Solvers, SAT-style

{Martin Nyx Brain, James H. Davenport & Alberto Griggio}1 University of Oxford, University of Bath, Fond. Bruno Kessler Martin.Brain@cs.ox.ac.uk, J.H.Davenport@bath.ac.uk, Griggio@FBK.eu 29 July 2017

1Thanks to EU H2020-FETOPEN-2016-2017-CSA project SC2 (712689)

and the many partners on that project: www.sc-square.org

Davenport Benchmarking Solvers, SAT-style

SLIDE 2

(Caricature of) Attitudes

Guess which is which?

Davenport Benchmarking Solvers, SAT-style

SLIDE 3

(Caricature of) Attitudes

Guess which is which? SC I want to win the next competition, which will have a mixture of hard and easy problems, and be judged on time-to-solve

Davenport Benchmarking Solvers, SAT-style

SLIDE 4

(Caricature of) Attitudes

Guess which is which? SC I want to win the next competition, which will have a mixture of hard and easy problems, and be judged on time-to-solve SC I want to submit a paper with timings that make my algorithm look good (on hard problems, ideally ones

ther people can’t solve)

Davenport Benchmarking Solvers, SAT-style

SLIDE 5

Thesis

The SAT community, and hence the SMT community, have substantial experience in benchmarking solvers against each other

n large sample sets, and publishing summaries, whereas the

computer algebra community tends to time solvers on a small set

f problems, and publishing individual times, with, at best,

selective comparison.

Davenport Benchmarking Solvers, SAT-style

SLIDE 6

Survivor plot

500 1000 1500 2000 0.01 0.1 1 10 100 1000 10000 # of instances time log-accumulated base-newrw-strict-tan-msat base-newrw-strict-tan-cvc4 base-newrw-strict-tan-yices base-newrw-strict-tan-z3 base-newrw-strict-tan-best Davenport Benchmarking Solvers, SAT-style

SLIDE 7

Methodology

Davenport Benchmarking Solvers, SAT-style

SLIDE 8

Methodology

1 For each method separately Davenport Benchmarking Solvers, SAT-style

SLIDE 9

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

Davenport Benchmarking Solvers, SAT-style

SLIDE 10

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

Davenport Benchmarking Solvers, SAT-style

SLIDE 11

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (t1, 1), (t1 + t2, 2) etc., and in general (k

i=1 ti, k).

Davenport Benchmarking Solvers, SAT-style

SLIDE 12

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (t1, 1), (t1 + t2, 2) etc., and in general (k

i=1 ti, k).

2 Place all the plots on the same axes, optionally (as we did)

using a logarithmic scale for time.

Davenport Benchmarking Solvers, SAT-style

SLIDE 13

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (t1, 1), (t1 + t2, 2) etc., and in general (k

i=1 ti, k).

2 Place all the plots on the same axes, optionally (as we did)

using a logarithmic scale for time. N.B. There is therefore no guarantee that the same problems were used to produce time results from different solvers.

Davenport Benchmarking Solvers, SAT-style

SLIDE 14

Cactus Plots [BH15]

100 200 300 1000 2000 3000

HWMCC'15 Cactus SINGLE Track SAT+UNSAT

abcsimple

abcsuprove nuxmv pdtravthrd avy iimc tip2014 v3s blimc aigbmc shiftbmc tip2014bmc nuxmvbmc iproverhc pdtravdeep ricecnu iproverdeephc iproverdeep iprover Davenport Benchmarking Solvers, SAT-style

SLIDE 15

Methodology

Davenport Benchmarking Solvers, SAT-style

SLIDE 16

Methodology

1 For each method separately Davenport Benchmarking Solvers, SAT-style

SLIDE 17

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

Davenport Benchmarking Solvers, SAT-style

SLIDE 18

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

Davenport Benchmarking Solvers, SAT-style

SLIDE 19

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (1, t1, ), (2, t1 + t2, 2) etc., and in general (k, k

i=1 ti).

Davenport Benchmarking Solvers, SAT-style

SLIDE 20

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (1, t1, ), (2, t1 + t2, 2) etc., and in general (k, k

i=1 ti).

Or Plot the points (1, t1), (2, t2) etc., and in general (k, tk).

Davenport Benchmarking Solvers, SAT-style

SLIDE 21

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (1, t1, ), (2, t1 + t2, 2) etc., and in general (k, k

i=1 ti).

Or Plot the points (1, t1), (2, t2) etc., and in general (k, tk).

2 Place all the plots on the same axes Davenport Benchmarking Solvers, SAT-style

SLIDE 22

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (1, t1, ), (2, t1 + t2, 2) etc., and in general (k, k

i=1 ti).

Or Plot the points (1, t1), (2, t2) etc., and in general (k, tk).

2 Place all the plots on the same axes

* Again, logarithmic time is possible.

Davenport Benchmarking Solvers, SAT-style

SLIDE 23

Methodology

1 For each method separately 1

Solve each problem pi, noting the time ti (up to some threshold T).

2

Sort the ti into increasing order (discarding the time-out ones).

3

Plot the points (1, t1, ), (2, t1 + t2, 2) etc., and in general (k, k

i=1 ti).

Or Plot the points (1, t1), (2, t2) etc., and in general (k, tk).

2 Place all the plots on the same axes

* Again, logarithmic time is possible. N.B. There is therefore no guarantee that the same problems were used to produce time results from different solvers.

Davenport Benchmarking Solvers, SAT-style

SLIDE 24

Cumulative Density [XHHLB08]

10

−1

10 10

1

10

2

10

3

10 20 30 40 50 60 70 80 90 100

Runtime [CPU sec]

Pre−solving AvgFeature

Oracle(S) SATzilla07(S,D

h)

March_dl04 Minisat2.0 Vallst

Davenport Benchmarking Solvers, SAT-style

SLIDE 25

Example from ISSAC (Brown)

5 10 15 20 25 5 10 15 20 25 30 Examples Completed log base 2 of time in ms Number of Examples Completed as a Function of Time

"cad441" "cad442" "cad443" "cad541" "cad542" "cad543" "cad641" "cad642" "nucad441" "nucad442" "nucad443" "nucad541" "nucad542" "nucad543" "nucad641" "nucad642" "nucad643"

Davenport Benchmarking Solvers, SAT-style

SLIDE 26

Virtual Best Solver/“Oracle”

Davenport Benchmarking Solvers, SAT-style

SLIDE 27

Virtual Best Solver/“Oracle”

The SAT competition has taken to including a ”virtual best solver” (VBS) which is synthesised from the other results by taking the minimum (across all solvers tested) time taken to solve every given

benchmark. Thus the VBS time is always equal to the time of

some real solver, but which one will change by the benchmark (measuring how often each solver is the VBS is also an interesting metric). The VBS can be added to the survivor/cactus plot, or indeed CDF, to get a feeling for the variability between solvers.

Davenport Benchmarking Solvers, SAT-style

SLIDE 28

Virtual Best Solver/“Oracle”

The SAT competition has taken to including a ”virtual best solver” (VBS) which is synthesised from the other results by taking the minimum (across all solvers tested) time taken to solve every given

benchmark. Thus the VBS time is always equal to the time of

some real solver, but which one will change by the benchmark (measuring how often each solver is the VBS is also an interesting metric). The VBS can be added to the survivor/cactus plot, or indeed CDF, to get a feeling for the variability between solvers. We often count how often a solver is the VBS.

Davenport Benchmarking Solvers, SAT-style

SLIDE 29

Virtual Best Solver/“Oracle”

The SAT competition has taken to including a ”virtual best solver” (VBS) which is synthesised from the other results by taking the minimum (across all solvers tested) time taken to solve every given

benchmark. Thus the VBS time is always equal to the time of

some real solver, but which one will change by the benchmark (measuring how often each solver is the VBS is also an interesting metric). The VBS can be added to the survivor/cactus plot, or indeed CDF, to get a feeling for the variability between solvers. We often count how often a solver is the VBS. A variation on counting is provided by [JLMS16], who measure how often a solver is within one second of being VBS. Their justification is “The constant of one second was chosen since we consider a smaller difference as insignificant, especially in the context of 800 second time-out”.

Davenport Benchmarking Solvers, SAT-style

SLIDE 30

Multiple Copies

One of the effects of having a solution process whose running time is widely variable is that one may well not be best served by just running the process to termination.

Davenport Benchmarking Solvers, SAT-style

SLIDE 31

Multiple Copies

One of the effects of having a solution process whose running time is widely variable is that one may well not be best served by just running the process to termination. In the case of a single processor, this issue was considered by [LSZ93], who suggested (and indeed proved almost-optimality) running the process up to certain time limits and then starting afresh, where the limits were

f the form

T, T, 2T, T, T, 2T, 4T, T, T, 2T, T, T, 2T, 4T, 8T, . . ., where T is some arbitrary unit.

Davenport Benchmarking Solvers, SAT-style

SLIDE 32

Multiple Copies

One of the effects of having a solution process whose running time is widely variable is that one may well not be best served by just running the process to termination. In the case of a single processor, this issue was considered by [LSZ93], who suggested (and indeed proved almost-optimality) running the process up to certain time limits and then starting afresh, where the limits were

f the form

T, T, 2T, T, T, 2T, 4T, T, T, 2T, T, T, 2T, 4T, 8T, . . ., where T is some arbitrary unit. This is in fact the default behaviour in MiniSAT 2.2.0, where it is known as Luby (though T is in fact measured in terms of conflicts rather than time, and it’s not a complete restart that is performed, as certain learned clauses are kept).

Davenport Benchmarking Solvers, SAT-style

SLIDE 33

Parallel running

These days, with processors getting more numerous rather than faster, we might consider running multiple copies in parallel. To see how this might help, consider the trivial case of a process whose running time is 1, K, K 2 with equal probability. Then the average time to solution is 1

3(1 + K + K 2) = 37 when K = 10.

Davenport Benchmarking Solvers, SAT-style

SLIDE 34

Parallel running

These days, with processors getting more numerous rather than faster, we might consider running multiple copies in parallel. To see how this might help, consider the trivial case of a process whose running time is 1, K, K 2 with equal probability. Then the average time to solution is 1

3(1 + K + K 2) = 37 when K = 10.

Running two copies and aborting the other when one finds the solution has an average time to solution of 1

9(5 + 3K + K 2) = 15

when K = 10, so the CPU cost is 30 units, still less than the sequential cost. Similarly, three copies gives

1 27(19 + 7K + K 2) = 7

when K = 10, so the CPU cost is 21 units, even better. For K = 10, the minimum is achieved at 8-fold parallelism, with time-to-solution 1.36 units, and a CPU cost of 10.9 units.

Davenport Benchmarking Solvers, SAT-style

SLIDE 35

Parallel running

These days, with processors getting more numerous rather than faster, we might consider running multiple copies in parallel. To see how this might help, consider the trivial case of a process whose running time is 1, K, K 2 with equal probability. Then the average time to solution is 1

3(1 + K + K 2) = 37 when K = 10.

Running two copies and aborting the other when one finds the solution has an average time to solution of 1

9(5 + 3K + K 2) = 15

when K = 10, so the CPU cost is 30 units, still less than the sequential cost. Similarly, three copies gives

1 27(19 + 7K + K 2) = 7

when K = 10, so the CPU cost is 21 units, even better. For K = 10, the minimum is achieved at 8-fold parallelism, with time-to-solution 1.36 units, and a CPU cost of 10.9 units. The break even point for two-fold parallel running is K = 1

2

1 +

√ 37

≈ 4.5, and three-fold running is K = 4. It is

worth noting, though, that a single Luby process with T = 1

3 (to

avoid T = 1 getting lucky) achieves an average time to solution (and cost) of ≈ 9.

Davenport Benchmarking Solvers, SAT-style

SLIDE 36

Normal Distributions: plot

Davenport Benchmarking Solvers, SAT-style

SLIDE 37

Normal Distributions: log time plot

Davenport Benchmarking Solvers, SAT-style

SLIDE 38

Normal Distributions: compared

Davenport Benchmarking Solvers, SAT-style

SLIDE 39

Normal Distributions: compared

Note that we get very different conclusions from the two.

Davenport Benchmarking Solvers, SAT-style

SLIDE 40

Uniform in log(t)

Davenport Benchmarking Solvers, SAT-style

SLIDE 41

Uniform in log(t)

Seems that running twice and running thrice were very similar, and in fact that running twice was almost half the time of running

nce, thus meaning that they were almost equivalent in cost.

Davenport Benchmarking Solvers, SAT-style

SLIDE 42

Uniform in log(t): Analysis

In fact, this model is susceptible to algebraic treatment, and the formulae (running from 1 to B seconds, with numeric values for B = 10) are as follows:

nce

=

B−1 log B

≈ 3.9087 twice =

2 (log B)2 (B − (log B + 1))

≈ 2.5264 thrice =

6 (log B)3

B − ( 1

2 log B + log B + 1)

≈

1.9887 Hence in fact the “running thrice” number is approximately correct, at one-half the elapsed time of running once.

Davenport Benchmarking Solvers, SAT-style

SLIDE 43

Notes

1 Brown was using (similar) random examples in each line 2 [BH15] had over 300 examples, and this is not uncommon:

* Indeed a talk today had > 5000 examples.

3 My slide had a misleading conclusion from only 20 samples. 4 Plotting time and log(time) gives very different graphs Davenport Benchmarking Solvers, SAT-style

SLIDE 44

Questions?

Davenport Benchmarking Solvers, SAT-style

SLIDE 45

Bibliography I

A. Biere and K. Heljanko.

Hardware Model Checking Competition Report HWMCC’15. http://fmv.jku.at/hwmcc15/Biere-HWMCC15-talk.pdf, 2015.

M. Janota, I. Lynce, and J. Marques-Silva.

Algorithms for computing backbones of propositional formulae. AI Communications, 28:161–177, 2016.

M. Luby, A. Sinclair, and D. Zuckerman.

Optimal Speedup of Las Vegas algorithms. Information Processing Letters, 47:173–180, 1993.

Davenport Benchmarking Solvers, SAT-style

SLIDE 46

Bibliography II

L. Xu, F. Hutter, H.H. Hoos, and K. Leyton-Brown.

SATzilla: Portfolio-based Algorithm Selection for SAT. Journal of Artificial Intelligence Research, 32:565–606, 2008.

Davenport Benchmarking Solvers, SAT-style