F Martin Mhrmann Agenda Introduction Heuristics for saturating - - PowerPoint PPT Presentation

f
SMART_READER_LITE
LIVE PREVIEW

F Martin Mhrmann Agenda Introduction Heuristics for saturating - - PowerPoint PPT Presentation

Performance of Clause Selection Heuristics for Saturation-Based Theorem Proving Stephan Schulz R O O P F Martin Mhrmann Agenda Introduction Heuristics for saturating theorem proving Saturation with the given-clause algorithm


slide-1
SLIDE 1

Performance of Clause Selection Heuristics for Saturation-Based Theorem Proving

Stephan Schulz Martin Möhrmann

P ROO

F

slide-2
SLIDE 2

Agenda

◮ Introduction ◮ Heuristics for saturating theorem proving

◮ Saturation with the given-clause algorithm ◮ Clause selection heuristics

◮ Experimental setup ◮ Results and analysis

◮ Comparison of heuristics ◮ Potential for improvement - how good are we?

◮ Conclusion

2

slide-3
SLIDE 3

Introduction

◮ Heuristics are crucial for first-order theorem provers

◮ Practical experience is clear ◮ Proof search happens in an infinite search space ◮ Proofs are rare

◮ A lot of collected developer experience (folklore)

◮ . . . but no (published) systematic evaluation ◮ . . . and no (published) recent evaluation at all

3

slide-4
SLIDE 4

Saturating Theorem Proving

◮ Search state is a set of first-order clauses ◮ Inferences add new clauses

◮ Existing clauses are premises ◮ Inference generates new clause ◮ If clause set is unsatisfiable then can eventually be derived ◮ Redundancy elimination (rewriting, subsumption . . . ) simplifies search state

◮ Inference rules try to minimize necessary consequences

◮ Restricted by term orderings ◮ Restricted by literal orderings

◮ Question: In which order do we compute potential consequences?

◮ Given-clause algorithm ◮ Controlled by clause selection heuristic

4

slide-5
SLIDE 5

The Given-Clause Algorithm

U

(unprocessed clauses)

g P

(processed clauses)

g=☐ ?

◮ Aim: Move everything from U to P

5

slide-6
SLIDE 6

The Given-Clause Algorithm

U

(unprocessed clauses) Gene- rate

g P

(processed clauses)

g=☐ ?

◮ Aim: Move everything from U to P ◮ Invariant: All generating inferences with premises from P have been performed

5

slide-7
SLIDE 7

The Given-Clause Algorithm

U

(unprocessed clauses) Gene- rate Simpli- fiable? Simplify

g P

(processed clauses)

g=☐ ?

◮ Aim: Move everything from U to P ◮ Invariant: All generating inferences with premises from P have been performed ◮ Invariant: P is interreduced

5

slide-8
SLIDE 8

The Given-Clause Algorithm

U

(unprocessed clauses) Gene- rate Simpli- fiable? Cheap Simplify Simplify

g P

(processed clauses)

g=☐ ?

◮ Aim: Move everything from U to P ◮ Invariant: All generating inferences with premises from P have been performed ◮ Invariant: P is interreduced ◮ Clauses added to U are simplified with respect to P

5

slide-9
SLIDE 9

Choice Point Clause Selection

U

(unprocessed clauses)

g P

(processed clauses)

g=☐ ?

◮ Aim: Move everything from U to P

6

slide-10
SLIDE 10

Choice Point Clause Selection

U

(unprocessed clauses)

g P

(processed clauses)

g=☐ ?

Choice Point

◮ Aim: Move everything from U to P ◮ Without generation: Only choice point!

6

slide-11
SLIDE 11

Choice Point Clause Selection

U

(unprocessed clauses) Gene- rate

g P

(processed clauses)

g=☐ ?

Choice Point

◮ Aim: Move everything from U to P ◮ Without generation: Only choice point! ◮ With generation: Still the major dynamic choice point!

6

slide-12
SLIDE 12

Choice Point Clause Selection

U

(unprocessed clauses) Gene- rate Simpli- fiable? Cheap Simplify Simplify

g P

(processed clauses)

g=☐ ?

Choice Point

◮ Aim: Move everything from U to P ◮ Without generation: Only choice point! ◮ With generation: Still the major dynamic choice point! ◮ With simplification: Still the major dynamic choice point!

6

slide-13
SLIDE 13

The Size of the Problem

U

(unprocessed clauses) Gene- rate Simpli- fiable? Cheap Simplify Simplify

g P

(processed clauses)

g=☐ ?

U

(unprocessed clauses)

Choice Point

7

slide-14
SLIDE 14

The Size of the Problem

U

(unprocessed clauses) Gene- rate Simpli- fiable? Cheap Simplify Simplify

g P

(processed clauses)

g=☐ ?

U

(unprocessed clauses)

Choice Point

◮ |U| ∼ |P|2 ◮ |U| ≈ 3 · 107 after 300s

7

slide-15
SLIDE 15

The Size of the Problem

U

(unprocessed clauses) Gene- rate Simpli- fiable? Cheap Simplify Simplify

g P

(processed clauses)

g=☐ ?

U

(unprocessed clauses)

Choice Point

◮ |U| ∼ |P|2 ◮ |U| ≈ 3 · 107 after 300s

How do we make the best choice among millions?

7

slide-16
SLIDE 16

Basic Clause Selection Heuristics

◮ Basic idea: Clauses ordered by heuristic evaluation

◮ Heuristic assigns a numerical value to a clause ◮ Clauses with smaller (better) evaluations are processed first

◮ Example: Evaluation by symbol counting

◮ |{f (X) = a, P(a) = $true, g(Y ) = f (a)}| = 10 ◮ Motivation: Small clauses are general, has 0 symbols ◮ Best-first search

◮ Example: FIFO evaluation

◮ Clause evaluation based on generation time (always prefer older clauses) ◮ Motivation: Simulate breadth-first search, find shortest proofs

◮ Combine best-first/breadth-first seach

◮ E.g. pick 4 out of every 5 clauses according to size, the last according to age

8

slide-17
SLIDE 17

Clause Selection Heuristics in E

◮ Many symbol-counting variants

◮ E.g. Assign different weights to symbol classes (predicates, functions, variables) ◮ E.g. Goal directed: lower weight for symbols occuring in original conjecture ◮ E.g. ordering-aware/calculus-aware: higher weight for symbols in inference terms

◮ Arbitrary combinations of base evaluation functions

◮ E.g. 5 priority queues ordered by different evaluation functions, weighted round-robin selection

9

slide-18
SLIDE 18

Clause Selection Heuristics in E

◮ Many symbol-counting variants

◮ E.g. Assign different weights to symbol classes (predicates, functions, variables) ◮ E.g. Goal directed: lower weight for symbols occuring in original conjecture ◮ E.g. ordering-aware/calculus-aware: higher weight for symbols in inference terms

◮ Arbitrary combinations of base evaluation functions

◮ E.g. 5 priority queues ordered by different evaluation functions, weighted round-robin selection

E can simulate nearly all other approaches to clause selection!

9

slide-19
SLIDE 19

Folklore on Clause Selection/Evaluation

◮ FIFO is obviously fair, but awful – Everybody ◮ Prefering small clauses is good – Everybody ◮ Interleaving best-first (small) and breadth-first (FIFO) is better

◮ “The optimal pick-given ratio is 5” – Otter

◮ Processing all initial clauses early is good – Waldmeister ◮ Preferring clauses with orientable equation is good – DISCOUNT ◮ Goal-direction is good – E

10

slide-20
SLIDE 20

Folklore on Clause Selection/Evaluation

◮ FIFO is obviously fair, but awful – Everybody ◮ Prefering small clauses is good – Everybody ◮ Interleaving best-first (small) and breadth-first (FIFO) is better

◮ “The optimal pick-given ratio is 5” – Otter

◮ Processing all initial clauses early is good – Waldmeister ◮ Preferring clauses with orientable equation is good – DISCOUNT ◮ Goal-direction is good – E

Can we confirm or refute these claims?

10

slide-21
SLIDE 21

Experimental setup

◮ Prover: E 1.9.1-pre ◮ 14 different heuristics

◮ 13 selected to test folklore claims (interleave 1 or 2 evaluations) ◮ Plus modern evolved heuristic (interleaves 5 evaluations)

◮ TPTP release 6.3.0

◮ Only (assumed) provable first-order problems ◮ 13774 problems: 7082 FOF and 6692 CNF

◮ Compute environment

◮ StarExec cluster: single threaded run on Xeon E5-2609 (2.4 GHz) ◮ 300 second time limit, no memory limit (≥64 GB/core physical)

11

slide-22
SLIDE 22

Meet the Heuristics

Heuristic Rank Successes Successes within 1s total unique absolute

  • f column 3

FIFO 14 4930 (35.8%) 17 3941 79.9% SC12 13 4972 (36.1%) 5 4155 83.6% SC11 9 5340 (38.8%) 4285 80.2% SC21 10 5326 (38.7%) 17 4194 78.7% RW212 11 5254 (38.1%) 13 5764 79.8% 2SC11/FIFO 7 7220 (52.4%) 24 5846 79.7% 5SC11/FIFO 5 7331 (53.2%) 3 5781 78.3% 10SC11/FIFO 3 7385 (53.6%) 1 5656 77.6% 15SC11/FIFO 6 7287 (52.9%) 6 5006 82.5% GD 12 4998 (36.3%) 12 5856 78.4% 5GD/FIFO 4 7379 (53.6%) 62 4213 80.2% SC11-PI 8 6071 (44.1%) 13 4313 86.3% 10SC11/FIFO-PI 2 7467 (54.2%) 31 5934 80.4% Evolved 1 8423 (61.2%) 593 6406 76.1%

12

slide-23
SLIDE 23

Successes Over Time

50 100 150 200 250 time 4000 5000 6000 7000 8000 9000 successes Evolved 10SC11/FIFO-PI 10SC11/FIFO 15SC11/FIFO 5SC11/FIFO 2SC11/FIFO SC11-PI SC11 SC21 SC12 FIFO

13

slide-24
SLIDE 24

Folklore put to the Test

◮ FIFO is awful, prefering small clauses is good – mostly confirmed

◮ In general, only modest advantage for symbol counting (36% FIFO

  • vs. 39% for best SC)

◮ Exception: UEQ (32% vs. 63%)

14

slide-25
SLIDE 25

Folklore put to the Test

◮ FIFO is awful, prefering small clauses is good – mostly confirmed

◮ In general, only modest advantage for symbol counting (36% FIFO

  • vs. 39% for best SC)

◮ Exception: UEQ (32% vs. 63%)

◮ Interleaving best-first/breadth-first is better – confirmed

◮ 54% for interleaving vs. 39% for best SC ◮ Influence of different pick-given ratios is surprisingly small ◮ UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) ◮ The optimal pick-given ratio is 10 (for E)

14

slide-26
SLIDE 26

Folklore put to the Test

◮ FIFO is awful, prefering small clauses is good – mostly confirmed

◮ In general, only modest advantage for symbol counting (36% FIFO

  • vs. 39% for best SC)

◮ Exception: UEQ (32% vs. 63%)

◮ Interleaving best-first/breadth-first is better – confirmed

◮ 54% for interleaving vs. 39% for best SC ◮ Influence of different pick-given ratios is surprisingly small ◮ UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) ◮ The optimal pick-given ratio is 10 (for E)

◮ Processing all initial clauses early is good – confirmed

◮ Effect is less pronounced for interleaved heuristics

14

slide-27
SLIDE 27

Folklore put to the Test

◮ FIFO is awful, prefering small clauses is good – mostly confirmed

◮ In general, only modest advantage for symbol counting (36% FIFO

  • vs. 39% for best SC)

◮ Exception: UEQ (32% vs. 63%)

◮ Interleaving best-first/breadth-first is better – confirmed

◮ 54% for interleaving vs. 39% for best SC ◮ Influence of different pick-given ratios is surprisingly small ◮ UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) ◮ The optimal pick-given ratio is 10 (for E)

◮ Processing all initial clauses early is good – confirmed

◮ Effect is less pronounced for interleaved heuristics

◮ Preferring clauses with orientable equation is good – not confirmed

◮ There is no evidence in our data, not even for UEQ

14

slide-28
SLIDE 28

Folklore put to the Test

◮ FIFO is awful, prefering small clauses is good – mostly confirmed

◮ In general, only modest advantage for symbol counting (36% FIFO

  • vs. 39% for best SC)

◮ Exception: UEQ (32% vs. 63%)

◮ Interleaving best-first/breadth-first is better – confirmed

◮ 54% for interleaving vs. 39% for best SC ◮ Influence of different pick-given ratios is surprisingly small ◮ UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) ◮ The optimal pick-given ratio is 10 (for E)

◮ Processing all initial clauses early is good – confirmed

◮ Effect is less pronounced for interleaved heuristics

◮ Preferring clauses with orientable equation is good – not confirmed

◮ There is no evidence in our data, not even for UEQ

◮ Goal-direction is good – partially confirmed

◮ GD on its own performs similar to SC ◮ GD shines in combination with FIFO

14

slide-29
SLIDE 29

Selected Results

◮ Good heuristics do make a difference

◮ 71% more solutions with Evolved vs. FIFO ◮ 58% more solutions with Evolved vs. best SC

15

slide-30
SLIDE 30

Selected Results

◮ Good heuristics do make a difference

◮ 71% more solutions with Evolved vs. FIFO ◮ 58% more solutions with Evolved vs. best SC

◮ Success comes early

◮ ≈ 80% of all proofs found in less than 1s ◮ . . . with little variation between strategies (spread: 76%–84%)

15

slide-31
SLIDE 31

Selected Results

◮ Good heuristics do make a difference

◮ 71% more solutions with Evolved vs. FIFO ◮ 58% more solutions with Evolved vs. best SC

◮ Success comes early

◮ ≈ 80% of all proofs found in less than 1s ◮ . . . with little variation between strategies (spread: 76%–84%)

◮ Cooperation beats portfolio/strategy scheduling

◮ SC11 solves 5340 problems ◮ FIFO solves 4930 problems ◮ Union of the previous two contains 6329 problems ◮ . . . but 10SC11/FIFO solves 7385

15

slide-32
SLIDE 32

Selected Results

◮ Good heuristics do make a difference

◮ 71% more solutions with Evolved vs. FIFO ◮ 58% more solutions with Evolved vs. best SC

◮ Success comes early

◮ ≈ 80% of all proofs found in less than 1s ◮ . . . with little variation between strategies (spread: 76%–84%)

◮ Cooperation beats portfolio/strategy scheduling

◮ SC11 solves 5340 problems ◮ FIFO solves 4930 problems ◮ Union of the previous two contains 6329 problems ◮ . . . but 10SC11/FIFO solves 7385

◮ Evolving Evolved paid off

◮ Significantly better than best “naive” heuristic ◮ 10× more unique solutions than second-best

15

slide-33
SLIDE 33

Measuring absolute Performance

◮ Definition: Given-clause utilization

◮ A useful given clause appears in the proof object ◮ A useless given clause does not contribute to the proof ◮ The given-clause utilization ratio for a proof search is the ratio of useful given clauses to all given clauses

◮ Given-clause utilization ratio measures heuristic quality

◮ A perfect heuristic has a GCUR of 1 ◮ A failed heurisic has a GCUR of 0 ◮ A better heuristic will pick fewer useless clauses

16

slide-34
SLIDE 34

Given Clause Utilization

500 1000 1500 problems 0.0 0.2 0.4 0.6 0.8 1.0 ratio Evolved 10SC11/FIFO SC11 FIFO ◮ ≈ 2000 non-trivial

problems solved by all 4 heuristics

◮ GCUR rank

corresponds to global performance

◮ GCUR rates are low

even for these easy problems

17

slide-35
SLIDE 35

Given Clause Utilization

500 1000 1500 problems 0.0 0.2 0.4 0.6 0.8 1.0 ratio Evolved 10SC11/FIFO SC11 FIFO ◮ ≈ 2000 non-trivial

problems solved by all 4 heuristics

◮ GCUR rank

corresponds to global performance

◮ GCUR rates are low

even for these easy problems Significant potential for improvement!

17

slide-36
SLIDE 36

Future Work

◮ Evolve/develop better individual heuristics and collections of

heuristics

◮ (Even) more complex heuristics? ◮ Evolve for diversity/swarm success

◮ Learn better heuristics from proofs/proof searches

◮ Feature-based learning? ◮ Pattern-based learning ◮ Deep learning?

◮ Evaluate how results transfer to other situations

◮ Otter-loop? ◮ AVATAR?

18

slide-37
SLIDE 37

Conclusion

◮ First-order proof search is a hard problem that critically

depends on complex heuristics

◮ Developer folklore is useful

◮ . . . but needs to be documented ◮ . . . but needs to be re-verified

◮ Given-clause utilization is useful to gauge the quality

  • f heuristics

◮ Given-clause utilization is low for current heuristics

◮ . . . especially for harder problems ◮ Huge potential for further improvements

19

slide-38
SLIDE 38

Conclusion

◮ First-order proof search is a hard problem that critically

depends on complex heuristics

◮ Developer folklore is useful

◮ . . . but needs to be documented ◮ . . . but needs to be re-verified

◮ Given-clause utilization is useful to gauge the quality

  • f heuristics

◮ Given-clause utilization is low for current heuristics

◮ . . . especially for harder problems ◮ Huge potential for further improvements

Questions?

19

slide-39
SLIDE 39

Conclusion

◮ First-order proof search is a hard problem that critically

depends on complex heuristics

◮ Developer folklore is useful

◮ . . . but needs to be documented ◮ . . . but needs to be re-verified

◮ Given-clause utilization is useful to gauge the quality

  • f heuristics

◮ Given-clause utilization is low for current heuristics

◮ . . . especially for harder problems ◮ Huge potential for further improvements

Thank you!

19