[PPT] - Weighted Automata Extraction from Recurrent Neural Networks via PowerPoint Presentation

SLIDE 1

Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces

Takamasa Okudono, Masaki Waga, Taro Sekiyama, Ichiro Hasuo SOKENDAI, the Graduate University for Advanced Studies, Japan /National Institute of Informatics, Japan LearnAut19, Vancouver, Canada 23 June 2019

1

SLIDE 2

RNN

RNN is a neural network equipped with a internal state

Drawing by François Deloche (CC BY-SA 4.0)

2

SLIDE 3

Goal

Input: RNN 𝑆 whose output is in ℝ (defines 𝑔

𝑆: Σ∗ → ℝ)

Output: WFA 𝐵(𝑆) (defines 𝑔

𝐵(𝑆): Σ∗ → ℝ) s.t. 𝑔 𝐵(𝑆) ≃ 𝑔 𝑆

RNN WFA Initial state Final func. Transition func. Transition matrix Initial vector Final vector

3

SLIDE 4

Motivation

Getting lighter (faster to infer) model of an RNN
Because the inference of RNNs are sometimes heavy
Investigate the behavior of RNN 𝑆 via the extracted WFA 𝐵 𝑆
WFA equips many operations and leads to model checking?
In research line of RNN⇔DFA conversion as an acceptor
Ours is a quantitative extension

4

SLIDE 5

Contribution

Proposed a method to apply Balle and Mohri’s algorithm for

the extraction

The key is checking if 𝑆 ≃ 𝐵 by using regression
Our method extracts +7% more accurate models than the

baseline

The extracted WFAs are about 1,000 times faster to infer

than the target RNNs

5

SLIDE 6

Def. of RNN (Mathematically, in this work)

RNN 𝑆 (of alphabet Σ and dimension 𝑒) consists of

𝛽 ∈ ℝ𝑒: Initial state
𝛾: ℝ𝑒 → ℝ: Final function
𝑕𝑆: ℝ𝑒 × Σ → ℝ𝑒: Transition function
𝑕𝑆: ℝ𝑒 × Σ∗ → ℝ𝑒 is induced recursively■

𝑔

𝑆: Σ∗ → ℝ is induced by 𝑔 𝑆 𝑥1 … 𝑥𝑂 = 𝛾 ∘ 𝑕𝑆(𝛽, 𝑥1 … 𝑥𝑂)

The configuration for 𝑥1 … 𝑥𝑂 is defined by 𝜀𝑆 𝑥1 … 𝑥𝑂 = 𝑕𝑆 𝛽, 𝑥1 … 𝑥𝑂

Need not to be linear “internal state”

6

SLIDE 7

Def. of Weighted Finite Automaton (WFA)

WFA 𝐵 (of size 𝑜 and alphabet Σ) consists of

𝛽 ∈ ℝ𝑜: Initial vector
𝛾 ∈ ℝ𝑜: Final vector
𝐵𝜏 ∈ ℝ𝑜×𝑜: Transition matrix (𝜏 ∈ Σ) ■

WFA 𝐵 is a formalism to define 𝑔

𝐵: Σ∗ → ℝ (c.f.) A DFA is a formalism to define 𝑔: Σ∗ → 2 WFA is an extension of DFA via the matrix representation.

7

SLIDE 8

Def. of WFA
WFA 𝐵 induces the function 𝑔

𝐵: Σ∗ → ℝ as

𝑔

𝐵 𝑥1 … 𝑥𝑂 = 𝛽𝐵𝑥1 … 𝐵𝑥𝑂𝛾

The configuration (“internal state”) of WFA 𝐵 is

𝜀𝐵 𝑥1 … 𝑥𝑂 = 𝛽𝐵𝑥1 … 𝐵𝑥𝑂 ∈ ℝ𝑜 For example:

Σ = 0, 1 , 𝛽 = 0.8

0.2 , 𝛾 = 0.9 0.7 , 𝐵0 = 0 1 1 0 , 𝐵1 = 0.9 0.1 0.5 0.5

𝑔

𝐵 10 = 0.8

0.2 0.9 0.1 0.5 0.5 1 1 0.9 0.7 = 0.736

𝜀𝐵 10 = 0.8

0.2 0.9 0.1 0.5 0.5 1 1 0 = 0.18 0.82

8

SLIDE 9

RNN and WFA

RNN 𝑆 (of alphabet Σ and dimension 𝑒) consists of

𝛽 ∈ ℝ𝑒: Initial state
𝛾: ℝ𝑒 → ℝ: Final function
𝑕𝑆: ℝ𝑒 × Σ → ℝ𝑒: Transition function■

WFA 𝐵 (of alphabet Σ and size 𝑜) consists of

𝛽 ∈ ℝ𝑜: Initial vector
𝛾 ∈ ℝ𝑜: Final vector
𝐵𝜏 ∈ ℝ𝑜×𝑜: Transition matrix (𝜏 ∈ Σ) ■

Similar formalism! Can we approximate RNN by WFA?

9

SLIDE 10

Goal Input: RNN 𝑆 whose output is in ℝ (defines 𝑔

𝑆: Σ∗ → ℝ)

Output: WFA 𝐵(𝑆) (defines 𝑔

𝐵(𝑆): Σ∗ → ℝ) s.t. 𝑔 𝐵(𝑆) ≃ 𝑔 𝑆

Approach: Use Balle and Mohri’s algorithm

The challenge is to give the procedure to check if 𝑔

𝐵 ≃ 𝑔 𝑆 for

a candidate WFA 𝐵

Goal and Our Approach

10

SLIDE 11

Balle and Mohri’s Algorithm

An extension of Angluin’s L* Algorithm for WFA

Input:
Membership query procedure m: Σ∗ → ℝ
Equivalence query procedure e: WFAs → Equivalent ⊔ Σ∗
Output:
Minimal WFA 𝐵′
Property: Given WFA 𝐵, if 𝑛 = 𝑔

𝐵 and

𝑓 ሚ 𝐵 = ቊEquivalent ; 𝑔

𝐵 = 𝑔 ෨ 𝐵

𝑥 ; 𝑔

𝐵 𝑥 ≠ 𝑔 ෨ 𝐵(𝑥)

then, it terminates by calling 𝑛, 𝑓 polynomial times and 𝑔

𝐵 = 𝑔 𝐵′

Called “Counterexample”

11

SLIDE 12

Idea of Overall Architecture (Detailed)

Implement

Membership query 𝑛 to be the RNN’s induced function 𝑔

𝑆

Equivalence query 𝑓 to be

𝑓 ሚ 𝐵 = ቊEquivalent ; 𝑔

𝑆 ≃ 𝑔 ෨ 𝐵

𝑥 ; 𝑔

𝑆 𝑥 ≠ 𝑔 ෨ 𝐵(𝑥)

Then we would be able to get a WFA ሚ 𝐵 s.t. 𝑔

𝑆 ≃ 𝑔 ෨ 𝐵 !

Generally it cannot be “=“

But how can we implement such an equivalence query 𝑓?

12

SLIDE 13

How do we know 𝑔

𝑆 ≃ 𝑔 𝐵?

𝑔

𝑆 𝑥 ≃ 𝑔 𝐵(𝑥)

⇔ 𝛾𝑆 ∘ 𝜀𝑆 𝛽𝑆, 𝑥1 … 𝑥𝑜 ≃ 𝜀𝐵(𝑥1 … 𝑥𝑜)𝛾𝐵 Both calculate their configurations (“internal states”) If there is a “good” relation between 𝜀𝑆 and 𝜀𝐵, 𝐵 and 𝑆 would behave similarly

13

SLIDE 14

“Good” relation between 𝜀𝑆 and 𝜀𝐵

This work views 𝑞: ℝ𝑒 → ℝ𝑜 satisfying the following property

as a good relation:

∀𝑥 ∈ Σ∗. p 𝜀𝑆 w ≃ 𝜀𝐵(𝑥)

14

SLIDE 15

Equivalence Query by approximating 𝑞

Let’s approximate configuration translator 𝑞: ℝ𝑒 → ℝ𝑜 such that ∀𝑥 ∈ Σ∗. p 𝜀𝑆 w ≃ 𝜀𝐵(𝑥) by applying regression on sampled data. The data is sampled by observing Σ∗ in Breadth-First Search.

15

SLIDE 16

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵

16

SLIDE 17

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵

17

SLIDE 18

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵(0)

18

SLIDE 19

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵(0)

19

SLIDE 20

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵(0) ・𝜀𝑆(1) ・𝜀𝐵(1)

20

SLIDE 21

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵(0) ・𝜀𝑆(1) ・𝜀𝐵(1)

21

SLIDE 22

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵(0) ・𝜀𝑆(1) ・𝜀𝐵(1) ・𝜀𝑆(00) ・𝜀𝐵(00)

22

SLIDE 23

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵(0) ・𝜀𝑆(1) ・𝜀𝐵(1) ・𝜀𝑆(00) ・𝜀𝐵(00)

23

SLIDE 24

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵 0 ≃ 𝜀𝐵(01) ・𝜀𝑆(1) ・𝜀𝐵(1) ・𝜀𝑆(00) ・𝜀𝐵(00) ・𝜀𝑆(01)

24

SLIDE 25

Relation 𝑞 between 𝑆 and 𝐵

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝛽𝑆 ・𝛽𝐵 ・𝜀𝑆(0) ・𝜀𝐵 0 ≃ 𝜀𝐵(01) ・𝜀𝑆(1) ・𝜀𝐵(1) ・𝜀𝑆(00) ・𝜀𝐵(00) ・𝜀𝑆(01)

25

SLIDE 26

BFS-based Equivalence Query

Pop w from queue Add w’s next words to queue

Equivalence query proceeds based on Breadth-First Search

26

SLIDE 27

Maintaining 𝑞

Pop w from queue Add w’s next words to queue Check if 𝑞 should be refined Refine 𝑞 NO YES We want it to satisfy ∀𝑥 ∈ 𝑋. p 𝜀𝑆 w ≃ 𝜀𝐵(𝑥)

27

SLIDE 28

Check if 𝑞 should be refined

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 𝑥′ ・𝜀𝐵 𝑥′ = 𝑞(𝜀𝐵 𝑥′ ) 𝑞 𝑥′: a word already visited in the BFS loop 𝑥: a word just popped

28

SLIDE 29

Check if 𝑞 should be refined

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 𝑥′ ・𝜀𝐵 𝑥′ = 𝑞 𝜀𝐵 𝑥′ = 𝜀𝐵(𝑥) ・𝜀𝑆(𝑥) 𝑞 𝑥′: a word already visited in the BFS loop 𝑥: a word just popped

29

SLIDE 30

Check if 𝑞 should be refined

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 𝑥′ ・𝜀𝐵 𝑥′ = 𝑞 𝜀𝐵 𝑥′ = 𝜀𝐵 𝑥 = 𝑞(𝜀𝑆 𝑥 ) ・𝜀𝑆(𝑥)

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 𝑥′ ・𝜀𝐵 𝑥′ = 𝑞 𝜀𝐵 𝑥′ = 𝜀𝐵(𝑥) ・𝜀𝑆(𝑥) 𝑞 𝑞 𝑞 ・𝑞(𝜀𝑆 𝑥 ) 𝑞 𝑥′: a word already visited in the BFS loop 𝑥: a word just popped

30

SLIDE 31

Check if 𝑞 should be refined

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 𝑥′ ・𝜀𝐵 𝑥′ = 𝑞 𝜀𝐵 𝑥′ = 𝜀𝐵 𝑥 = 𝑞(𝜀𝑆 𝑥 ) ・𝜀𝑆(𝑥)

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 𝑥′ ・𝜀𝐵 𝑥′ = 𝑞 𝜀𝐵 𝑥′ = 𝜀𝐵(𝑥) ・𝜀𝑆(𝑥) 𝑞 𝑞 𝑞 ・𝑞(𝜀𝑆 𝑥 ) 𝑞

↓This Violates 𝑞 𝜀𝑆 𝑥 = 𝜀𝐵(𝑥)

𝑥′: a word already visited in the BFS loop 𝑥: a word just popped

31

SLIDE 32

Maintaining 𝑞

Pop w from queue Add w’s next words to queue Check if 𝑞 should be refined Refine 𝑞 NO YES We want it to satisfy ∀𝑥 ∈ 𝑋. p 𝜀𝑆 w ≃ 𝜀𝐵(𝑥)

32

SLIDE 33

Check if 𝑞 should be refined Refine 𝑞

Finding Counterexample

Pop w from queue Add w’s next words to queue NO YES Check if 𝑔

𝑆 𝑥

= 𝑔

𝐵(𝑥)

If 𝑔

𝑆 𝑥 ≠ 𝑔 𝐵(𝑥), returns 𝑥 as a counterexample of the

equivalence query. 𝑓 𝐵 = ቊ Equivalent ; 𝑔

𝑆 ≃ 𝑔 𝐵

𝑥′′ ; 𝑔

𝑆 𝑥′′ ≠ 𝑔 𝐵(𝑥′′)

33

SLIDE 34

Check if 𝑞 should be refined Refine 𝑞

Returning “Equivalent”

Pop w from queue Add w’s next words to queue NO YES Check if 𝑔

𝑆 𝑥

= 𝑔

𝐵(𝑥)

If there are many (𝑁 = 5 times) visited words 𝑥′ ∈ {Visited words} 𝑞 ∘ 𝜀𝑆 𝑥 = 𝑞 ∘ 𝜀𝑆 𝑥′ }, the next words of 𝑥 is not added (Pruning the subtree under 𝑥 in BFS)

34

SLIDE 35

Check if 𝑞 should be refined Refine 𝑞 Add w’s next words to queue NO YES

Returning “Equivalent”

Pop w from queue Check if 𝑔

𝑆 𝑥

= 𝑔

𝐵(𝑥)

When the queue is empty, all the trees are pruned and it returns “Equivalent”. 𝑓 𝐵 = ቊ Equivalent ; 𝑔

𝑆 ≃ 𝑔 𝐵

𝑥′′ ; 𝑔

𝑆 𝑥′′ ≠ 𝑔 𝐵(𝑥′′)

35

SLIDE 36

Experiments (Target RNNs)

90 target RNNs to evaluate our algorithm are made by

1. Generate a random WFA 𝐵 of size 𝑜 ∈ {10, 20, 30} and alphabet

Σ of size a ∈ {10,15,20,30,40,50}

2. Learn RNN 𝑆(𝐵) from 𝐵
3. Repeat Step 1-2 for each (𝑜, 𝑡) 5 times.

RNNs consist of two-stacked LSTM with 50 cells.

36

SLIDE 37

Experiments (Settings)

Methods

Our algorithm with 𝑁 = 5
Baseline algorithm (comes later)

Evaluation

Time to extract (timeout=10,000 sec.)
Accuracy
If 𝑔

𝑆 𝑥 − 𝑔 𝐵 𝑆 (𝑥) < 0.05 then it is “correct”

Calculated by randomly generating 1000 words
Time to infer the words in 𝑆 𝐵 , 𝐵(𝑆 𝐵 )

37

SLIDE 38

Experiments (Baseline algorithm)

Pop w from queue Add w’s next words to queue Check if 𝑔

𝑆 𝑥

= 𝑔

𝐵(𝑥)

If 𝑔

𝑆 𝑥 ≠ 𝑔 𝐵(𝑥), returns 𝑥 as a counterexample of the

equivalence query. 𝑓 𝐵 = ቊ Equivalent ; 𝑔

𝑆 ≃ 𝑔 𝐵

𝑥′′ ; 𝑔

𝑆 𝑥′′ ≠ 𝑔 𝐵(𝑥′′)

38

SLIDE 39

Experiments (Baseline algorithm)

Pop w from queue Add w’s next words to queue Check if 𝑔

𝑆 𝑥

= 𝑔

𝐵(𝑥)

If 𝑔

𝑆 𝑥 = 𝑔 𝐵(𝑥) in a row (1000 times), returns ℎ as a counterexample of the

equivalence query. 𝑓 𝐵 = ቊ Equivalent ; 𝑔

𝑆 ≃ 𝑔 𝐵

𝑥′′ ; 𝑔

𝑆 𝑥′′ ≠ 𝑔 𝐵(𝑥′′)

(If this happens, queue is preserved for the next invoke of eq-query)

39

SLIDE 40

Result (Overall)

Difference of accuracy and extracting time between ours and baseline

40

SLIDE 41

Result (Overall)

Average (and Std) Ours(M=5) Baseline Accuracy[%] 81.9% (std=18.8%) 74.1% (std=22.9%) Time [s] 8805 (std=2220) 6277 (std=2966)

The accuracy of “Ours (M=5)” exceeded those of “Baseline”

in 59 tasks.

The extracting time of “Ours (M=5)” longer than those of

“Baseline” in 80 tasks.

(90 tasks in total)

41

SLIDE 42

Result (WFA size 𝑜 = 10)

Difference of accuracy and extracting time between ours and baseline

42

SLIDE 43

Result (alphabet size 𝑏 = 10)

Difference of accuracy and extracting time between ours and baseline

43

SLIDE 44

Time to Infer a Value from a Word

To test our motivation “Getting lighter (faster to infer) model
f an RNN” is feasible.
We compared the time to compute 𝑔

𝑆(𝑥) and 𝑔 𝐵 𝑆 (𝑥) for

1,000 words whose lengths are ≤ 20. Average Time on RNN 𝑆 [s] 32.0 (std=2.0) Time on WFA 𝐵(𝑆) [s] 0.028 (std=0.007)

44

SLIDE 45

Conclusion

Proposed a method to extract the WFA 𝐵(𝑆) from a given

RNN 𝑆 so that 𝑔

𝐵(𝑆) ≃ 𝑔 𝑆.

Compared our method to the baseline algorithm in the

accuracy and time

Our algorithm achieved higher accuracy and took more time than the

baseline.

The extracted WFA 𝐵 𝑆 took less time to infer values than

the original RNN 𝑆

45

SLIDE 46

Future Work

Adding experiment
To reveal the overall tendency clearly
To reveal what is happening when the accuracy is quite low
Adding the idea of bisimulation to 𝑞
Think of questionable parts in the loop?
Refining 𝑞 at the different timing could be better?
Modifying Balle and Mohri’s algorithm to generate

probabilistic WFA

Finding good hyper parameter 𝑁 experimentally or

theoretically

46

SLIDE 47

“Checking if 𝑞 is OK” could be like this?

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 ℎ′ ・𝜀𝐵 ℎ′ = 𝑞(𝜀𝐵 ℎ′ ) ・𝜀𝑆(ℎ) ・𝜀𝐵 ℎ = 𝜀𝑆 ℎ

config. space of 𝑆 (ℝ𝑒)
config. space of 𝐵 (ℝ𝑜)

・𝜀𝑆 ℎ′ ・𝜀𝐵 ℎ′ = 𝑞(𝜀𝐵 ℎ′ ) ・𝜀𝑆(ℎ) ・𝜀𝐵(ℎ) 𝑞 𝑞 𝑞 ・𝑞(𝜀𝑆(ℎ))

↓This Violates 𝑞 𝜀𝑆 ℎ = 𝜀𝐵(ℎ)

𝑞

47

SLIDE 48

Def. of WFA
WFA 𝐵 is probabilistic if
𝛽 ⋅ 𝟐 = 1
For all 𝜏 ∈ Σ, the sums of rows are 1
0 ≤ 𝛾 ≤ 1 ■

For example:

Σ = 0, 1 , 𝛽 = 0.8

0.2 , 𝛾 = 0.9 0.7 , 𝐵0 = 0 1 1 0 , 𝐵1 = 0.9 0.1 0.5 0.5

48