reinforcement learning for interactive theorem proving in
play

Reinforcement Learning for Interactive Theorem Proving in HOL4 - PowerPoint PPT Presentation

Reinforcement Learning for Interactive Theorem Proving in HOL4 Minchao Wu 1 Michael Norrish 1,2 Christian Walder 1,2 Amir Dezfouli 2 1 Research School of Computer Science Australian National University 2 Data61, CSIRO September 14, 2020


  1. Reinforcement Learning for Interactive Theorem Proving in HOL4 Minchao Wu 1 Michael Norrish 1,2 Christian Walder 1,2 Amir Dezfouli 2 1 Research School of Computer Science Australian National University 2 Data61, CSIRO September 14, 2020

  2. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side.

  3. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side. ◮ Reinforcement learning settings

  4. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side. ◮ Reinforcement learning settings ◮ Policies for choosing proof states, tactics, and theorems or terms as arguments.

  5. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side. ◮ Reinforcement learning settings ◮ Policies for choosing proof states, tactics, and theorems or terms as arguments. ◮ Learning: policy gradient

  6. Environment ◮ An environment can be created by specifying an initial goal. e = HolEnv(GOAL) ◮ An environment can be reset by providing a new goal. e.reset(GOAL2) ◮ The basic function is querying HOL4 about tactic applications. e.query(" ∀ l. NULL l ⇒ l = []", "strip_tac")

  7. Environment The e.step(action) function applies action to the current state and generates the new state. e.step(action) step takes an action and returns the immediate reward received, and a Boolean value indicating whether the proof attempt has finished.

  8. Demo ◮ A quick demo.

  9. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition.

  10. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals.

  11. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals.

  12. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged.

  13. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged. ◮ A state s is a finite sequence of fringes.

  14. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged. ◮ A state s is a finite sequence of fringes. ◮ A fringe can be referred by its index i , i.e., s ( i ) .

  15. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged. ◮ A state s is a finite sequence of fringes. ◮ A fringe can be referred by its index i , i.e., s ( i ) . ◮ A reward is a real number r ∈ R .

  16. Examples Fringe 0 0: p ∧ q ⇒ p ∧ q Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q Figure: Example fringes and states

  17. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic .

  18. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s .

  19. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) .

  20. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) . ◮ t is a HOL4 tactic.

  21. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) . ◮ t is a HOL4 tactic. ◮ Example: (0 , 0 , fs[listTheory.MEM] )

  22. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) . ◮ t is a HOL4 tactic. ◮ Example: (0 , 0 , fs[listTheory.MEM] ) ◮ Rewards ◮ Successful application: 0.1 ◮ Discharges the current goal completely: 0.2 ◮ Main goal proved: 5 ◮ Otherwise: -0.1

  23. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  24. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  25. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  26. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  27. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  28. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R .

  29. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R . ◮ The value v i of fringe s ( i ) is defined by: v i = Σ g ∈ s ( i ) V goal ( g )

  30. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R . ◮ The value v i of fringe s ( i ) is defined by: v i = Σ g ∈ s ( i ) V goal ( g ) ◮ Sample from the following distribution π fringe ( s ) = Softmax( v 1 , ..., v | s | )

  31. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R . ◮ The value v i of fringe s ( i ) is defined by: v i = Σ g ∈ s ( i ) V goal ( g ) ◮ Sample from the following distribution π fringe ( s ) = Softmax( v 1 , ..., v | s | ) ◮ By default, j is fixed to be 0. That is, we always deal with the first goal in a fringe.

  32. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either

  33. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or

  34. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or ◮ A tactic name followed by a list of terms

  35. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or ◮ A tactic name followed by a list of terms ◮ A value network V tactic : G → R D where D is the total number of tactic names allowed.

  36. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or ◮ A tactic name followed by a list of terms ◮ A value network V tactic : G → R D where D is the total number of tactic names allowed. ◮ Sample from the following distribution π tactic ( g ) = Softmax( V tactic ( g ))

  37. Argument policy softmax softmax a 0 v 0 a 1 v 1 a t v t a t+1 Policy Policy Policy h 0 h 1 h t+1 . . . x 0 x 1 x t Figure: Generation of arguments. x i is the candidate theorems. h i is a hidden variable. a i is a chosen argument. v i is the values computed by the policy. Each theorem is represented by an N -dimensional tensor based on its tokenized expression in Polish notation. If we have M candidate theorems, then the shape of x i is M × N . The representations are computed by a separately trained transformer.

  38. Generating arguments Generation of arguments Given a chosen goal g . Each theorem is represented by an N - dimensional tensor based on its tokenized expression. Suppose we have M candidate theorems. Input : the chosen tactic or theorem t ∈ R N , the candidate theorems X ∈ R M × N and a hidden variable h ∈ R N . Policy : V arg : R N × R M × N × R N → R N × R M Initialize hidden variable h to t . l ← [ t ] . Loop for allowed length of arguments (e.g., 5): h, v ← V arg ( t, X, h ) t ← sample from π arg ( g ) = Softmax( v ) l ← l. append( t ) Return l and the associated (log) probabilities.

  39. Generating actions Given state s , we now have some (log) probabilities. ◮ p ( f | s ) given by π fringe .

  40. Generating actions Given state s , we now have some (log) probabilities. ◮ p ( f | s ) given by π fringe . ◮ p ( t | s, f ) given by π tactic .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend