Hints for AVATAR (and some more) Martin Suda Czech Technical - - PowerPoint PPT Presentation

hints for avatar and some more
SMART_READER_LITE
LIVE PREVIEW

Hints for AVATAR (and some more) Martin Suda Czech Technical - - PowerPoint PPT Presentation

Hints for AVATAR (and some more) Martin Suda Czech Technical University in Prague, Czech Republic PIWo 2019, Prague, October 2019 1/17 Interactive Theorem Proving with ATPs Some people actually use ATPs to do math! 1/17 Interactive


slide-1
SLIDE 1

1/17

Hints for AVATAR (and some more)

Martin Suda

Czech Technical University in Prague, Czech Republic

PIWo 2019, Prague, October 2019

slide-2
SLIDE 2

1/17

“Interactive Theorem Proving” with ATPs

Some people actually use ATPs to do math!

slide-3
SLIDE 3

1/17

“Interactive Theorem Proving” with ATPs

Some people actually use ATPs to do math! e.g., Bob Veroff and Michael Kinyon using Otter, Prover9, Mace4 questions from algebra: axioms bases for boolean algebras,

  • rtho-lattices, loop theory

targeting open problems (e.g. the AIM conjecture)

slide-4
SLIDE 4

1/17

“Interactive Theorem Proving” with ATPs

Some people actually use ATPs to do math! e.g., Bob Veroff and Michael Kinyon using Otter, Prover9, Mace4 questions from algebra: axioms bases for boolean algebras,

  • rtho-lattices, loop theory

targeting open problems (e.g. the AIM conjecture) In what sense interactive? a single proof attempt (ATP call) usually does not solve it trying different formulations / axiomatizations trying various additional assumptions and learning from them

slide-5
SLIDE 5

1/17

“Interactive Theorem Proving” with ATPs

Some people actually use ATPs to do math! e.g., Bob Veroff and Michael Kinyon using Otter, Prover9, Mace4 questions from algebra: axioms bases for boolean algebras,

  • rtho-lattices, loop theory

targeting open problems (e.g. the AIM conjecture) In what sense interactive? a single proof attempt (ATP call) usually does not solve it trying different formulations / axiomatizations trying various additional assumptions and learning from them ➥ By the way, these attempts may run for weeks!

slide-6
SLIDE 6

2/17

Hints

What is a hint? a clause supplied by the user as part of the input whenever a newly derived clause C subsumes a hint clause, this C is prioritized for selection

slide-7
SLIDE 7

2/17

Hints

What is a hint? a clause supplied by the user as part of the input whenever a newly derived clause C subsumes a hint clause, this C is prioritized for selection ➥ Hints are a means for steering the proof search!

slide-8
SLIDE 8

2/17

Hints

What is a hint? a clause supplied by the user as part of the input whenever a newly derived clause C subsumes a hint clause, this C is prioritized for selection ➥ Hints are a means for steering the proof search! Where do hints come from? the (expert) user just thinks of some

slide-9
SLIDE 9

2/17

Hints

What is a hint? a clause supplied by the user as part of the input whenever a newly derived clause C subsumes a hint clause, this C is prioritized for selection ➥ Hints are a means for steering the proof search! Where do hints come from? the (expert) user just thinks of some more realistically: clauses from proofs of similar theorems or

  • f the same theorem but under different assumptions
slide-10
SLIDE 10

2/17

Hints

What is a hint? a clause supplied by the user as part of the input whenever a newly derived clause C subsumes a hint clause, this C is prioritized for selection ➥ Hints are a means for steering the proof search! Where do hints come from? the (expert) user just thinks of some more realistically: clauses from proofs of similar theorems or

  • f the same theorem but under different assumptions

➥ Hope that similar theorems can be proved using similar intermediate steps.

slide-11
SLIDE 11

2/17

Hints

What is a hint? a clause supplied by the user as part of the input whenever a newly derived clause C subsumes a hint clause, this C is prioritized for selection ➥ Hints are a means for steering the proof search! Where do hints come from? the (expert) user just thinks of some more realistically: clauses from proofs of similar theorems or

  • f the same theorem but under different assumptions

➥ Hope that similar theorems can be proved using similar intermediate steps. How to come up with hints automatically?

slide-12
SLIDE 12

3/17

AVATAR: a reminder

AVATAR [Voronkov’14] modern architecture of first order theorem provers integrates saturation with a SAT solver (or an SMT solver) efficient realization of the clause splitting rule instead of one monolithic proof search a sequence of proof searches on (much) smaller sub-problems implemented in theorem prover Vampire shown highly successful in practice

slide-13
SLIDE 13

4/17

AVATAR architecture overview

Splitting Interface Base (SAT or SMT) solver FO solver Update model New splittable clause: C1 _ . . . _ Cn New contradiction K Ð rC1s, . . . , rCns Assert C Ð rCs Remove component C Solve Insert split clause rC1s _ . . . _ rCns Insert contradiction clause rC1s _ . . . _ rCns Model or Unsatisfiable

slide-14
SLIDE 14

5/17

Boosting AVATAR with hints

Instead of waiting for the user to supply hints for problem P . . . . . . attempt P using AVATAR and collect as hints the first-order parts of the clauses appearing in the sub-proofs of the so far derived contradiction clauses

slide-15
SLIDE 15

5/17

Boosting AVATAR with hints

Instead of waiting for the user to supply hints for problem P . . . . . . attempt P using AVATAR and collect as hints the first-order parts of the clauses appearing in the sub-proofs of the so far derived contradiction clauses DEMO!

slide-16
SLIDE 16

6/17

Outline

1

Hints for AVATAR

2

An Experiment

3

What is a Significant Improvement?

slide-17
SLIDE 17

7/17

Outline

1

Hints for AVATAR

2

An Experiment

3

What is a Significant Improvement?

slide-18
SLIDE 18

8/17

Experimental setup

Vampire setup:

  • -saturation_algorithm discount (for stability)
  • -age_weight_ratio 1:10 (works well with discount)
  • -time_limit 10 (reasonable time to finish)
slide-19
SLIDE 19

8/17

Experimental setup

Vampire setup:

  • -saturation_algorithm discount (for stability)
  • -age_weight_ratio 1:10 (works well with discount)
  • -time_limit 10 (reasonable time to finish)

Computers: either Starexec

  • r CTU’s (slurm) cluster
slide-20
SLIDE 20

8/17

Experimental setup

Vampire setup:

  • -saturation_algorithm discount (for stability)
  • -age_weight_ratio 1:10 (works well with discount)
  • -time_limit 10 (reasonable time to finish)

Computers: either Starexec

  • r CTU’s (slurm) cluster

The benchmark: TPTP v 7.2.0 17573 eligible first-order problems

slide-21
SLIDE 21

9/17

Results

(on Starexec) configuration solved uniques additional base 7914 7914 base+hints 7882 2 62 sac 8100 13 299 sac+hints 8106 13 23 base =

  • sa discount -awr 10 -t 10

sac =

  • -split_at_activation on
slide-22
SLIDE 22

9/17

Results

(on Starexec) configuration solved uniques additional base 7914 7914 base+hints 7882 2 62 sac 8100 13 299 sac+hints 8106 13 23 base =

  • sa discount -awr 10 -t 10

sac =

  • -split_at_activation on

Experimented with AVATAR flushing; also not very interesting

slide-23
SLIDE 23

10/17

Let’s try a different benchmark . . .

MIZAR bushy “small” 57 880 problems translated from the MIZAR library

slide-24
SLIDE 24

10/17

Let’s try a different benchmark . . .

MIZAR bushy “small” 57 880 problems translated from the MIZAR library (base: -sa discount -awr 10 -t 10 -sac on)

slide-25
SLIDE 25

10/17

Let’s try a different benchmark . . .

MIZAR bushy “small” 57 880 problems translated from the MIZAR library (base: -sa discount -awr 10 -t 10 -sac on) Results configuration solved uniques base 14843 184 base+hints 14873 214

slide-26
SLIDE 26

10/17

Let’s try a different benchmark . . .

MIZAR bushy “small” 57 880 problems translated from the MIZAR library (base: -sa discount -awr 10 -t 10 -sac on) Results configuration solved uniques base 14843 184 base+hints 14873 214 (30 problems is approx. 0.5% of the benchmark size)

slide-27
SLIDE 27

11/17

So, should we be sad and abandon the idea?

slide-28
SLIDE 28

11/17

So, should we be sad and abandon the idea?

Maybe, but . . .

slide-29
SLIDE 29

11/17

So, should we be sad and abandon the idea?

Maybe, but . . . maybe it only gets interesting with really hard problems!

slide-30
SLIDE 30

11/17

So, should we be sad and abandon the idea?

Maybe, but . . . maybe it only gets interesting with really hard problems! maybe we should have a smarter notion of similarity!

demodulate hints?

slide-31
SLIDE 31

11/17

So, should we be sad and abandon the idea?

Maybe, but . . . maybe it only gets interesting with really hard problems! maybe we should have a smarter notion of similarity!

demodulate hints?

maybe we need restarts to prevent the prover from choking

slide-32
SLIDE 32

11/17

So, should we be sad and abandon the idea?

Maybe, but . . . maybe it only gets interesting with really hard problems! maybe we should have a smarter notion of similarity!

demodulate hints?

maybe we need restarts to prevent the prover from choking we should also try strengthening the theory with reasonable additional assumptions, as routinely done by Veroff et al.

slide-33
SLIDE 33

11/17

So, should we be sad and abandon the idea?

Maybe, but . . . maybe it only gets interesting with really hard problems! maybe we should have a smarter notion of similarity!

demodulate hints?

maybe we need restarts to prevent the prover from choking we should also try strengthening the theory with reasonable additional assumptions, as routinely done by Veroff et al. ➥ Ongoing and future work!

slide-34
SLIDE 34

12/17

Outline

1

Hints for AVATAR

2

An Experiment

3

What is a Significant Improvement?

slide-35
SLIDE 35

13/17

A Methodology Question

When should we get excited about a new technique?

slide-36
SLIDE 36

13/17

A Methodology Question

When should we get excited about a new technique?

1 The idea looks clever and sophisticated

slide-37
SLIDE 37

13/17

A Methodology Question

When should we get excited about a new technique?

1 The idea looks clever and sophisticated

➥ Could aim for a pure theory paper at CADE!

slide-38
SLIDE 38

13/17

A Methodology Question

When should we get excited about a new technique?

1 The idea looks clever and sophisticated

➥ Could aim for a pure theory paper at CADE!

2 Solves more problems than baseline

slide-39
SLIDE 39

13/17

A Methodology Question

When should we get excited about a new technique?

1 The idea looks clever and sophisticated

➥ Could aim for a pure theory paper at CADE!

2 Solves more problems than baseline

➥ Obviously, this gives us more power!

slide-40
SLIDE 40

13/17

A Methodology Question

When should we get excited about a new technique?

1 The idea looks clever and sophisticated

➥ Could aim for a pure theory paper at CADE!

2 Solves more problems than baseline

➥ Obviously, this gives us more power!

3 The solution set differs enough from baseline

slide-41
SLIDE 41

13/17

A Methodology Question

When should we get excited about a new technique?

1 The idea looks clever and sophisticated

➥ Could aim for a pure theory paper at CADE!

2 Solves more problems than baseline

➥ Obviously, this gives us more power!

3 The solution set differs enough from baseline

➥ To have a chance to improve strategy schedule . . .

slide-42
SLIDE 42

14/17

Focusing on the Third Point

Proof search is known to be very fragile. Even a small change will “stir” it and create a different solution set.

slide-43
SLIDE 43

14/17

Focusing on the Third Point

Proof search is known to be very fragile. Even a small change will “stir” it and create a different solution set. What does it mean to differ enough from baseline?

slide-44
SLIDE 44

14/17

Focusing on the Third Point

Proof search is known to be very fragile. Even a small change will “stir” it and create a different solution set. What does it mean to differ enough from baseline? Keep a database of problems known to be solvable by some strategy and compare against that.

slide-45
SLIDE 45

14/17

Focusing on the Third Point

Proof search is known to be very fragile. Even a small change will “stir” it and create a different solution set. What does it mean to differ enough from baseline? Keep a database of problems known to be solvable by some strategy and compare against that. ➥ Computationally expensive, open ended, but YES!

slide-46
SLIDE 46

14/17

Focusing on the Third Point

Proof search is known to be very fragile. Even a small change will “stir” it and create a different solution set. What does it mean to differ enough from baseline? Keep a database of problems known to be solvable by some strategy and compare against that. ➥ Computationally expensive, open ended, but YES! While the database is being built . . .

slide-47
SLIDE 47

14/17

Focusing on the Third Point

Proof search is known to be very fragile. Even a small change will “stir” it and create a different solution set. What does it mean to differ enough from baseline? Keep a database of problems known to be solvable by some strategy and compare against that. ➥ Computationally expensive, open ended, but YES! While the database is being built . . . Let’s have some random fun!

slide-48
SLIDE 48

15/17

Randomly permuting the input problem

Use tptp4X -trandomize from the TPTP toolset to: randomize the order of commutative logical operations randomize the order of formulas

slide-49
SLIDE 49

15/17

Randomly permuting the input problem

Use tptp4X -trandomize from the TPTP toolset to: randomize the order of commutative logical operations randomize the order of formulas Can we solve more problems?

slide-50
SLIDE 50

15/17

Randomly permuting the input problem

Use tptp4X -trandomize from the TPTP toolset to: randomize the order of commutative logical operations randomize the order of formulas Can we solve more problems? configuration solved uniques additional straight 8612 53 8612 shuffled1 8773 60 345 shuffled2 8788 85 128 shuffled3 8775 48 48 (now on the CTU cluster)

slide-51
SLIDE 51

15/17

Randomly permuting the input problem

Use tptp4X -trandomize from the TPTP toolset to: randomize the order of commutative logical operations randomize the order of formulas Can we solve more problems? configuration solved uniques additional straight 8612 53 8612 shuffled1 8773 60 345 shuffled2 8788 85 128 shuffled3 8775 48 48 (now on the CTU cluster) ➥ recalling randoCoP (Raths, Otten; 2008)

slide-52
SLIDE 52

16/17

One more experiment with randomness

Clause Selection and Age-weight Ratio Vampire alternates between selecting the next given clause by age (old first) and by weight (light first) under a given ratio.

slide-53
SLIDE 53

16/17

One more experiment with randomness

Clause Selection and Age-weight Ratio Vampire alternates between selecting the next given clause by age (old first) and by weight (light first) under a given ratio. Normally, this alternation is regular. What if we change it to probabilistic?

slide-54
SLIDE 54

16/17

One more experiment with randomness

Clause Selection and Age-weight Ratio Vampire alternates between selecting the next given clause by age (old first) and by weight (light first) under a given ratio. Normally, this alternation is regular. What if we change it to probabilistic? configuration solved uniques additional base 8725 12 8725 rnd1 8747 8 91 rnd2 8744 16 37 rnd3 8768 23 37 rnd4 8735 14 21 rnd5 8741 16 16 base = -sa discount -awr 1:1 -t 10

slide-55
SLIDE 55

17/17

Summary

Empirical research with an ATP:

1 have a new idea 2 implement (and debug) 3 conduct experiments

slide-56
SLIDE 56

17/17

Summary

Empirical research with an ATP:

1 have a new idea 2 implement (and debug) 3 conduct experiments

When are the results significant? improving overall performance (high total solved) solving hard problems (“the uniques”)

slide-57
SLIDE 57

17/17

Summary

Empirical research with an ATP:

1 have a new idea 2 implement (and debug) 3 conduct experiments

When are the results significant? improving overall performance (high total solved) solving hard problems (“the uniques”) Why don’t we use (carefully seeded) randomness to prove more theorems (without much actual extra thinking)?