Towards Interpretable Deep Learning for Natural Language Processing - - PowerPoint PPT Presentation

towards interpretable deep learning for natural language
SMART_READER_LITE
LIVE PREVIEW

Towards Interpretable Deep Learning for Natural Language Processing - - PowerPoint PPT Presentation

Towards Interpretable Deep Learning for Natural Language Processing Roy Schwartz University of Washington & Allen Institute for Artificial Intelligence December 2018 1 / 37 ( Deep-Learning -Based) AI Today 2 / 37 3 / 37 Deep Learning


slide-1
SLIDE 1

Towards Interpretable Deep Learning for Natural Language Processing

Roy Schwartz

University of Washington & Allen Institute for Artificial Intelligence

December 2018

1 / 37

slide-2
SLIDE 2

(Deep-Learning-Based) AI Today

2 / 37

slide-3
SLIDE 3

3 / 37

slide-4
SLIDE 4

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art

3 / 37

slide-5
SLIDE 5

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art ❉ architecture engineering

3 / 37

slide-6
SLIDE 6

s0 s1

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art ❉ architecture engineering

3 / 37

slide-7
SLIDE 7

s0 s1

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art ❉ architecture engineering Weighted Finite-State Automata ❯ widely studied ❯ understandable ❯ interpretable ❯ informed model development

3 / 37

slide-8
SLIDE 8

s0 s1

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art ❉ architecture engineering Weighted Finite-State Automata ❯ widely studied ❯ understandable ❯ interpretable ❯ informed model development ❉ low performance

3 / 37

slide-9
SLIDE 9

s0 s1

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art ❉ architecture engineering Weighted Finite-State Automata ❯ widely studied ❯ understandable ❯ interpretable ❯ informed model development ❉ low performance

3 / 37

slide-10
SLIDE 10

Deep Learning Models for NLP: Overview

Case Study: Sentiment Analysis

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words word embeddings sequence encoders

  • utput:

prediction

4 / 37

slide-11
SLIDE 11

Deep Learning Models for NLP: Overview

Case Study: Sentiment Analysis

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words word embeddings sequence encoders

  • utput:

prediction

Main component in ◮ Machine translation ◮ Question answering ◮ Text summarization ◮ Sentiment analysis ◮ Information extraction ◮ . . .

4 / 37

slide-12
SLIDE 12

Deep Learning Models for NLP: Overview

Case Study: Sentiment Analysis

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words word embeddings sequence encoders

  • utput:

prediction

Main component in ◮ Machine translation ◮ Question answering ◮ Text summarization ◮ Sentiment analysis ◮ Information extraction ◮ . . .

4 / 37

slide-13
SLIDE 13

Overview

◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata

◮ Case Study: Convolutional neural networks 5 / 37

slide-14
SLIDE 14

Overview

◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata

◮ Case Study: Convolutional neural networks 5 / 37

slide-15
SLIDE 15

Background: Finite-State Automata

Regular Expressions (Patterns)

s0 s1 s2 s3 s4 such a great talk

6 / 37

slide-16
SLIDE 16

Background: Finite-State Automata

Regular Expressions (Patterns)

s0 s1 s2 s3 s4 such a great talk

Pattern: such a great talk

6 / 37

slide-17
SLIDE 17

Background: Weighted Finite-State Automata (WFSA)

Each Transition Defines a Weight Function

s0 s1 s2 s3 s4 such/0.7 a/1.3 great/0.3 talk/0.4

◮ (Weighted) pattern: such a great talk

◮ Weights are typically pre-specified 7 / 37

slide-18
SLIDE 18

Background: Weighted Finite-State Automata (WFSA)

Each Transition Defines a Weight Function

s0 s1 s2 s3 s4 such/0.7 a/1.3 great/0.3 talk/0.4

◮ (Weighted) pattern: such a great talk

◮ Weights are typically pre-specified

◮ The score of a sequence is the sum of transition scores

7 / 37

slide-19
SLIDE 19

Background: Weighted Finite-State Automata (WFSA)

Each Transition Defines a Weight Function

s0 s1 0.7 s2 s3 s4 such/0.7 a/1.3 great/0.3 talk/0.4

◮ (Weighted) pattern: such a great talk

◮ Weights are typically pre-specified

◮ The score of a sequence is the sum of transition scores

7 / 37

slide-20
SLIDE 20

Background: Weighted Finite-State Automata (WFSA)

Each Transition Defines a Weight Function

s0 s1 0.7 s2 2.0 s3 s4 such/0.7 a/1.3 great/0.3 talk/0.4

◮ (Weighted) pattern: such a great talk

◮ Weights are typically pre-specified

◮ The score of a sequence is the sum of transition scores

7 / 37

slide-21
SLIDE 21

Background: Weighted Finite-State Automata (WFSA)

Each Transition Defines a Weight Function

s0 s1 0.7 s2 2.0 s3 2.3 s4 such/0.7 a/1.3 great/0.3 talk/0.4

◮ (Weighted) pattern: such a great talk

◮ Weights are typically pre-specified

◮ The score of a sequence is the sum of transition scores

7 / 37

slide-22
SLIDE 22

Background: Weighted Finite-State Automata (WFSA)

Each Transition Defines a Weight Function

s0 s1 0.7 s2 2.0 s3 2.3 s4 2.7 such/0.7 a/1.3 great/0.3 talk/0.4

◮ (Weighted) pattern: such a great talk

◮ Weights are typically pre-specified

◮ The score of a sequence is the sum of transition scores

7 / 37

slide-23
SLIDE 23

Overview

◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata

◮ Case Study: Convolutional neural networks 8 / 37

slide-24
SLIDE 24

Overview

◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata

◮ Case Study: Convolutional neural networks 8 / 37

slide-25
SLIDE 25

Motivation: Soft Pattern Matching

◮ such a great talk

◮ such a wonderful talk, such a lovely talk 9 / 37

slide-26
SLIDE 26

Motivation: Soft Pattern Matching

◮ such a great talk

◮ such a wonderful talk, such a lovely talk

◮ Naive solution:

s0 s1 s2 s3 s4 such/0.7 a/1 wonderful/0.3, lovely/0.25, great/0.3 talk/0.4

9 / 37

slide-27
SLIDE 27

Motivation: Soft Pattern Matching

◮ such a great talk

◮ such a wonderful talk, such a lovely talk

◮ Naive solution:

s0 s1 s2 s3 s4 such/0.7 a/1 wonderful/0.3, lovely/0.25, great/0.3 talk/0.4

◮ Problem: not scalable

◮ what a great talk, such an awesome talk 9 / 37

slide-28
SLIDE 28

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 great/0.3 s0 s1

10 / 37

slide-29
SLIDE 29

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 great/0.3 s0 s1 v

◮ Step 1: word → Rd

◮ Word embeddings ◮ Similar words are encoded in similar vectors 10 / 37

slide-30
SLIDE 30

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 great/0.3 s0 s1 ∀v

◮ Step 1: word → Rd

◮ Word embeddings ◮ Similar words are encoded in similar vectors

◮ Step 2: Accept all word vectors

10 / 37

slide-31
SLIDE 31

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 great/0.3 s0 s1 ∀v/fθ(v)

◮ Step 1: word → Rd

◮ Word embeddings ◮ Similar words are encoded in similar vectors

◮ Step 2: Accept all word vectors ◮ Step 3: weights: fθ : Rd → R

◮ These functions favor specific words ◮ θ parameters are learned 10 / 37

slide-32
SLIDE 32

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 ∀v/fθ(v)

◮ Neural transitions accept all words, ◮ but favor specific words

Learnable parameters Word vector

11 / 37

slide-33
SLIDE 33

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 ∀v/fθ(v)

◮ Neural transitions accept all words, ◮ but favor specific words ◮ Example 1: great

◮ high score: great, awesome, good ◮ low score: bad, child, three

Learnable parameters Word vector

11 / 37

slide-34
SLIDE 34

Solution: Neural Transitions

Schwartz et al., ACL 2018

s0 s1 ∀v/fθ(v)

◮ Neural transitions accept all words, ◮ but favor specific words ◮ Example 1: great

◮ high score: great, awesome, good ◮ low score: bad, child, three

◮ Example 2: the

◮ high score: the, a, an ◮ low score: car, love, well

Learnable parameters Word vector

11 / 37

slide-35
SLIDE 35

Neural Weighted Finite-State Automata

Schwartz et al., ACL 2018

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

v – word vectors θ = (θ0, θ1, θ2, θ3) – learned parameters ◮ Neural WFSAs accept any sequence,1 but prefer certain sequences

1Pending length constraints 12 / 37

slide-36
SLIDE 36

Neural Weighted Finite-State Automata

Schwartz et al., ACL 2018

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

v – word vectors θ = (θ0, θ1, θ2, θ3) – learned parameters ◮ Neural WFSAs accept any sequence,1 but prefer certain sequences ◮ Example 1: such a great talk

◮ high score: what a great talk, such an awesome talk ◮ low score: such a horrible talk, such a black cat, john went to school 1Pending length constraints 12 / 37

slide-37
SLIDE 37

Neural Weighted Finite-State Automata

Schwartz et al., ACL 2018

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

v – word vectors θ = (θ0, θ1, θ2, θ3) – learned parameters ◮ Neural WFSAs accept any sequence,1 but prefer certain sequences ◮ Example 1: such a great talk

◮ high score: what a great talk, such an awesome talk ◮ low score: such a horrible talk, such a black cat, john went to school

◮ Example 2: is not very exciting

◮ high score: is not particularly exciting, are not very inspiring 1Pending length constraints 12 / 37

slide-38
SLIDE 38

Training Procedure

Formally

End-to-end training:

◮ Input

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

◮ Word embeddings: word → Rd ◮ Training data: pairs of

<document, sentiment label>

◮ Output

◮ Parameter values: θ

.

13 / 37

slide-39
SLIDE 39

Training Procedure

Formally

End-to-end training:

◮ Input

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

◮ Word embeddings: word → Rd ◮ Training data: pairs of

<document, sentiment label>

◮ Output

◮ Parameter values: θ

Test:

◮ Input

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

◮ Word embeddings: word → Rd ◮ Learned parameters: θ ◮ New data: <document>

◮ Output

◮ Prediction: <sentiment label>

.

13 / 37

slide-40
SLIDE 40

Training Procedure

Formally

End-to-end training:

◮ Input

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

◮ Word embeddings: word → Rd ◮ Training data: pairs of

<document, sentiment label>

◮ Output

◮ Parameter values: θ

Test:

◮ Input

s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)

◮ Word embeddings: word → Rd ◮ Learned parameters: θ ◮ New data: <document>

◮ Output

◮ Prediction: <sentiment label>

◮ Standard training procedure

◮ Backpropagation ◮ Stochastic gradient descent

.

13 / 37

slide-41
SLIDE 41

Benefits of Neural WFSAs 1:

Informed Model Development

s0 s1 s2 s3 s4

Fixed length: such a great talk

14 / 37

slide-42
SLIDE 42

Benefits of Neural WFSAs 1:

Informed Model Development

s0 s1 s2 s3 s4

Fixed length: such a great talk

s0 s1 s2 s3 s4

Self loops: such a great, wonderful, funny talk

14 / 37

slide-43
SLIDE 43

Benefits of Neural WFSAs 1:

Informed Model Development

s0 s1 s2 s3 s4

Fixed length: such a great talk

s0 s1 s2 s3 s4

Self loops: such a great, wonderful, funny talk

s0 s1 s2 s3 s4

fθǫ()

Epsilon transitions: such great shoes

14 / 37

slide-44
SLIDE 44

Benefits of Neural WFSAs 1:

Informed Model Development

s0 s1 s2 s3 s4

Fixed length: such a great talk

s0 s1 s2 s3 s4

Self loops: such a great, wonderful, funny talk

s0 s1 s2 s3 s4

fθǫ()

Epsilon transitions: such great shoes

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

. . .

14 / 37

slide-45
SLIDE 45

Benefits of Neural WFSAs 2:

◮ They are neural

◮ Backpropagation ◮ Stochastic gradient descent ◮ PyTorch, TensorFlow, AllenNLP 15 / 37

slide-46
SLIDE 46

Benefits of Neural WFSAs 2:

◮ They are neural

◮ Backpropagation ◮ Stochastic gradient descent ◮ PyTorch, TensorFlow, AllenNLP

◮ Coming up:

◮ Many deep models are mathematically equivalent to neural WFSAs ◮ A (new) joint framework ◮ Allows extension of these models 15 / 37

slide-47
SLIDE 47

Overview

◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata

◮ Case Study: Convolutional neural networks 16 / 37

slide-48
SLIDE 48

Overview

◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata

◮ Case Study: Convolutional neural networks 16 / 37

slide-49
SLIDE 49

Case Study: Convolutional Neural Networks (ConvNets)

A Linear-Kernel Filter with Max-Pooling

v1 v2 v3 v4 v5 v6 v7

17 / 37

slide-50
SLIDE 50

Case Study: Convolutional Neural Networks (ConvNets)

A Linear-Kernel Filter with Max-Pooling

v1 v2 v3 v4 v5 v6 v7 Sθ(v1 : v4) =

j=1:4

θj · vj

Learnable parameters Word vectors

17 / 37

slide-51
SLIDE 51

Proposition 1: ConvNet Filters are Computing WFSA scores

Schwartz et al., ACL 2018 s0 s1 s2 s3 s4

18 / 37

slide-52
SLIDE 52

Proposition 1: ConvNet Filters are Computing WFSA scores

Schwartz et al., ACL 2018

◮ fθj(v) = θj · v

s0 s1 s2 s3 s4

18 / 37

slide-53
SLIDE 53

Proposition 1: ConvNet Filters are Computing WFSA scores

Schwartz et al., ACL 2018

◮ fθj(v) = θj · v ◮ sθ(v1 : v4) = j=1:4

fθj(vj) =

j=1:4

(θj · vj)

s0 s1 s2 s3 s4

18 / 37

slide-54
SLIDE 54

ConvNets are (Implicitly) Computing WFSA Scores!

ConvNet : Sθ(v1 : vd) =

  • j=1:d

(θj · vj) (1) Neural WFSA : sθ(v1 : vd) =

  • j=1:d

(θj · vj) (2)

19 / 37

slide-55
SLIDE 55

ConvNets are (Implicitly) Computing WFSA Scores!

ConvNet : Sθ(v1 : vd) =

  • j=1:d

(θj · vj) (1) Neural WFSA : sθ(v1 : vd) =

  • j=1:d

(θj · vj) (2) Benefits: ❉ Interpret ConvNets ❉ Improve ConvNets

19 / 37

slide-56
SLIDE 56

A ConvNet Learns a Fixed-Length Soft-Pattern!

Schwartz et al., ACL 2018 s0 s1 s2 s3 s4

◮ E.g., “such a great talk”

◮ what a great song ◮ such an awesome movie 20 / 37

slide-57
SLIDE 57

Improving ConvNets: SoPa (Soft-Patterns)

Schwartz et al., ACL 2018

◮ Language pattern are often flexible-length ◮ such a great talk

◮ such a great, funny, interesting talk ◮ such

great shoes

21 / 37

slide-58
SLIDE 58

Improving ConvNets: SoPa (Soft-Patterns)

Schwartz et al., ACL 2018

◮ Language pattern are often flexible-length ◮ such a great talk

◮ such a great, funny, interesting talk ◮ such

great shoes

Convolutional Neural Network: Sθ(v1 : vd) =

j=1:d

(θj · vj)

21 / 37

slide-59
SLIDE 59

Improving ConvNets: SoPa (Soft-Patterns)

Schwartz et al., ACL 2018

◮ Language pattern are often flexible-length ◮ such a great talk

◮ such a great, funny, interesting talk ◮ such

great shoes

Weighted Finite-State Automaton:

s0 s1 s2 s3 s4

such a great talk

21 / 37

slide-60
SLIDE 60

Improving ConvNets: SoPa (Soft-Patterns)

Schwartz et al., ACL 2018

◮ Language pattern are often flexible-length ◮ such a great talk

◮ such a great, funny, interesting talk ◮ such

great shoes

Weighted Finite-State Automaton:

s0 s1 s2 s3 s4

such a great talk funny, interesting

21 / 37

slide-61
SLIDE 61

Improving ConvNets: SoPa (Soft-Patterns)

Schwartz et al., ACL 2018

◮ Language pattern are often flexible-length ◮ such a great talk

◮ such a great, funny, interesting talk ◮ such

great shoes

Weighted Finite-State Automaton:

s0 s1 s2 s3 s4

such a great talk funny, interesting ǫ

21 / 37

slide-62
SLIDE 62

Improving ConvNets: SoPa (Soft-Patterns)

Schwartz et al., ACL 2018

◮ Language pattern are often flexible-length ◮ such a great talk

◮ such a great, funny, interesting talk ◮ such

great shoes

Weighted Finite-State Automaton:

s0 s1 s2 s3 s4

such a great talk funny, interesting ǫ

21 / 37

slide-63
SLIDE 63

Sentiment Analysis Experiments

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 = Classify (❯/❉)

22 / 37

slide-64
SLIDE 64

Sentiment Analysis Experiments

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 = Classify (❯/❉)

Sequence encoders:

◮ SoPa (ours) ◮ ConvNet

22 / 37

slide-65
SLIDE 65

Sentiment Analysis Experiments

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 = Classify (❯/❉)

Sequence encoders:

◮ SoPa (ours) ◮ ConvNet

LSTM

22 / 37

slide-66
SLIDE 66

Sentiment Analysis Results

Schwartz et al., ACL 2018 100 1,000 10,000 75 80 85

  • Num. Training Samples (SST)

Classification Accuracy

100 1,000 10,000 70 75 80 85 90

  • Num. Training Samples (Amazon)

23 / 37

slide-67
SLIDE 67

Sentiment Analysis Results

Schwartz et al., ACL 2018 100 1,000 10,000 75 80 85

  • Num. Training Samples (SST)

Classification Accuracy

100 1,000 10,000 70 75 80 85 90

  • Num. Training Samples (Amazon)

SoPa (ours) ConvNet LSTM

23 / 37

slide-68
SLIDE 68

Sentiment Analysis Results

Schwartz et al., ACL 2018 100 1,000 10,000 75 80 85

  • Num. Training Samples (SST)

Classification Accuracy

100 1,000 10,000 70 75 80 85 90

  • Num. Training Samples (Amazon)

SoPa (ours) ConvNet LSTM

23 / 37

slide-69
SLIDE 69

Interpreting SoPa

Soft Patterns!

◮ For each learned pattern, extract the 4 top scoring phrases in the training set

24 / 37

slide-70
SLIDE 70

Interpreting SoPa

Soft Patterns!

◮ For each learned pattern, extract the 4 top scoring phrases in the training set

Highest Scoring Phrases

  • Patt. 1

mesmerizing portrait of a engrossing portrait of a clear-eyed portrait of an fascinating portrait of a

s0 s1 s2 s3 s4

  • portrait
  • f

a

24 / 37

slide-71
SLIDE 71

Interpreting SoPa

Soft Patterns!

◮ For each learned pattern, extract the 4 top scoring phrases in the training set

Highest Scoring Phrases

  • Patt. 1

mesmerizing portrait of a engrossing portrait of a clear-eyed portrait of an fascinating portrait of a Highest Scoring Phrases

  • Patt. 2

honest , and enjoyable forceful , and beautifully energetic , and surprisingly

s0 s1 s2 s3 s4

  • portrait
  • f

a

s0 s1 s2 s3 s4

  • ,

and

  • 24 / 37
slide-72
SLIDE 72

Interpreting SoPa

Soft Patterns!

◮ For each learned pattern, extract the 4 top scoring phrases in the training set

Highest Scoring Phrases

  • Patt. 1

mesmerizing portrait of a engrossing portrait of a clear-eyed portrait of an fascinating portrait of a Highest Scoring Phrases

  • Patt. 2

honest , and enjoyable forceful , and beautifully energetic , and surprisingly unpretentious , charmingSL , quirky

s0 s1 s2 s3 s4

  • portrait
  • f

a

s0 s1 s2 s3 s4

  • ,

and

  • 24 / 37
slide-73
SLIDE 73

25 / 37

slide-74
SLIDE 74

s0 s1 s2 s3 s4

25 / 37

slide-75
SLIDE 75

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

f ()

More expressive WFSA

25 / 37

slide-76
SLIDE 76

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

f ()

++

More expressive WFSA Interpretable More Robust Convolutional Neural Network

25 / 37

slide-77
SLIDE 77

Many Existing Deep Models are Neural WFSAs!

Peng, Schwartz et al., EMNLP 2018

Mikolov et al. arXiv 2014 Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016 Foerster et al. ICML 2017

26 / 37

slide-78
SLIDE 78

Many Existing Deep Models are Neural WFSAs!

Peng, Schwartz et al., EMNLP 2018

Mikolov et al. arXiv 2014

s0 s1

Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016

s0 s1 s2

Foerster et al. ICML 2017

s0 s1 s2 s3

26 / 37

slide-79
SLIDE 79

Many Existing Deep Models are Neural WFSAs!

Peng, Schwartz et al., EMNLP 2018

Mikolov et al. arXiv 2014

s0 s1

Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016

s0 s1 s2

Foerster et al. ICML 2017

s0 s1 s2 s3

◮ Six recent recurrent neural networks (RNN) models are also implicitly computing

WFSA scores

26 / 37

slide-80
SLIDE 80

Developing more Robust WFSA Models

s0 s1 Mikolov et al. (2014) Balduzzi and Ghifary (2016) Bradbury et al. (2017) Lei et al. (2018) s0 s1 s2

S3: S2:

Lei et al. (2016)

27 / 37

slide-81
SLIDE 81

Developing more Robust WFSA Models

s0 s1 Mikolov et al. (2014) Balduzzi and Ghifary (2016) Bradbury et al. (2017) Lei et al. (2018) s0 s1 s2

S3: S2:

Lei et al. (2016) s0 s1 s2

S2 3:

Peng, Schwartz et al. (2018)

27 / 37

slide-82
SLIDE 82

Sentiment Analysis Results

Peng, Schwartz et al., EMNLP 2018

28 / 37

slide-83
SLIDE 83

Language Modeling Results

Peng, Schwartz et al., EMNLP 2018

lower is better

29 / 37

slide-84
SLIDE 84

s0 s1

Deep Learning ❯ backpropagation ❯ stochastic gradient descent ❯ PyTorch, TensorFlow, AllenNLP ❯ state-of-the-art ❉ architecture engineering Weighted Finite-State Automata ❯ widely studied ❯ understandable ❯ interpretable ❯ informed model development ❉ low performance

30 / 37

slide-85
SLIDE 85

Work in Progress 1: Are All Deep Models for NLP Equivalent to WFSAs?

◮ Elman RNN: hi = σ(Whi−1 + Uvi + b) ◮ The interaction between hi and hi−1 is via affine transformations followed by

nonlinearities

◮ Same for LSTM

◮ Most probably not equivalent to a WFSA

31 / 37

slide-86
SLIDE 86

Work in Progress 2: Automatic Model Development

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

32 / 37

slide-87
SLIDE 87

Work in Progress 2: Automatic Model Development

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

32 / 37

slide-88
SLIDE 88

Work in Progress 2: Automatic Model Development

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

Deep learning: model engineering

32 / 37

slide-89
SLIDE 89

Work in Progress 2: Automatic Model Development

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

Deep learning: model engineering

SoPa: informed model development

32 / 37

slide-90
SLIDE 90

Work in Progress 2: Automatic Model Development

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

Deep learning: model engineering

SoPa: informed model development New: Automatic model development

32 / 37

slide-91
SLIDE 91

Other Projects

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words word embeddings sequence encoders

  • utput:

prediction

33 / 37

slide-92
SLIDE 92

Other Projects

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words

Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014

word embeddings sequence encoders

  • utput:

prediction

33 / 37

slide-93
SLIDE 93

Other Projects

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words

Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014

word embeddings

Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 Schwartz et al., NAACL 2016 Vuli´ c et al., CoNLL 2017 Peters et al., 2018

sequence encoders

  • utput:

prediction

33 / 37

slide-94
SLIDE 94

Other Projects

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words

Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014

word embeddings

Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 Schwartz et al., NAACL 2016 Vuli´ c et al., CoNLL 2017 Peters et al., 2018

sequence encoders

Schwartz et al., ACL 2018 Peng et al., EMNLP 2018 Liu et al., RepL4NLP 2018 *best paper award*

  • utput:

prediction

33 / 37

slide-95
SLIDE 95

Other Projects

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 Classify (❯/❉)

input: words

Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014

word embeddings

Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 Schwartz et al., NAACL 2016 Vuli´ c et al., CoNLL 2017 Peters et al., 2018

sequence encoders

Schwartz et al., ACL 2018 Peng et al., EMNLP 2018 Liu et al., RepL4NLP 2018 *best paper award*

  • utput:

prediction Labeled datasets: <sentence, label> pairs

Schwartz et al., ACL 2011 Schwartz et al., COLING 2012 Schwartz et al., CoNLL 2017 Gururangan et al., NAACL 2018 Kang et al., NAACL 2018 Zellers et al., EMNLP 2018 33 / 37

slide-96
SLIDE 96

Annotation Artifacts in NLP Datasets

Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018

Premise A person is running on the beach Hypothesis The person is sleeping

Textual Entailment (state-of-the-art ∼90% accuracy)

34 / 37

slide-97
SLIDE 97

Annotation Artifacts in NLP Datasets

Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018

Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral

Textual Entailment (state-of-the-art ∼90% accuracy)

? ? ?

34 / 37

slide-98
SLIDE 98

Annotation Artifacts in NLP Datasets

Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018

Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral

Textual Entailment (state-of-the-art ∼90% accuracy)

? ? ? AllenNLP Demo!

34 / 37

slide-99
SLIDE 99

Annotation Artifacts in NLP Datasets

Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018

Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral

Textual Entailment (state-of-the-art ∼90% accuracy)

? ? ?

◮ The word “sleeping” is over-represented in the training data with contradiction label

◮ annotation artifact

◮ State-of-the-art models focus on this word rather than understanding the text

34 / 37

slide-100
SLIDE 100

Annotation Artifacts in NLP Datasets

Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018

Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral

Textual Entailment (state-of-the-art ∼90% accuracy)

? ? ?

◮ The word “sleeping” is over-represented in the training data with contradiction label

◮ annotation artifact

◮ State-of-the-art models focus on this word rather than understanding the text ◮ Models are not as strong as we think they are

34 / 37

slide-101
SLIDE 101

Long Term Vision

35 / 37

slide-102
SLIDE 102

Long Term Vision

◮ Explainable models ◮ Unbiased models

35 / 37

slide-103
SLIDE 103

Special Thanks to...

Li Zilles

Dana RubinsteinEffi Levi

36 / 37

slide-104
SLIDE 104

Special Thanks to...

Li Zilles

Dana RubinsteinEffi Levi

36 / 37

slide-105
SLIDE 105

Special Thanks to...

Li Zilles

Dana RubinsteinEffi Levi

36 / 37

slide-106
SLIDE 106

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

f ()

++

More expressive WFSA Interpretable More Robust Convolutional Neural Network

Thank you!

Roy Schwartz homes.cs.washington.edu/~roysch/ roysch@cs.washington.edu

37 / 37

slide-107
SLIDE 107

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7 v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-108
SLIDE 108

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

s(v1 : v4, θ(1)) s0 s1 s2 s3 s4 fθ(1)

0 (v1)

such fθ(1)

1 (v2)

a fθ(1)

2 (v3)

great fθ(1)

3 (v4)

talk

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-109
SLIDE 109

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

s(v2 : v5, θ(1)) s0 s1 s2 s3 s4 fθ(1)

0 (v2)

such fθ(1)

1 (v3)

a fθ(1)

2 (v4)

great fθ(1)

3 (v5)

talk

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-110
SLIDE 110

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

s(v3 : v6, θ(1)) s0 s1 s2 s3 s4 fθ(1)

0 (v3)

such fθ(1)

1 (v4)

a fθ(1)

2 (v5)

great fθ(1)

3 (v6)

talk

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-111
SLIDE 111

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

s(v4 : v7, θ(1)) s0 s1 s2 s3 s4 fθ(1)

0 (v4)

such fθ(1)

1 (v5)

a fθ(1)

2 (v6)

great fθ(1)

3 (v7)

talk

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-112
SLIDE 112

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

maxis(vi : vi+3, θ(1)) s0 s1 s2 s3 s4 fθ(1)

0 (v4)

such fθ(1)

1 (v5)

a fθ(1)

2 (v6)

great fθ(1)

3 (v7)

talk

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-113
SLIDE 113

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

maxis(vi : vi+3, θ(1)) θ = θ(1) s0 s1 s2 s3 s4 fθ(1)

0 (v4)

such fθ(1)

1 (v5)

a fθ(1)

2 (v6)

great fθ(1)

3 (v7)

talk

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-114
SLIDE 114

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

maxis(vi : vi+3, θ(2)) θ = θ(2) s0 s1 s2 s3 s4 fθ(2)

0 (v4)

is fθ(2)

1 (v5)

remarkably fθ(2)

2 (v6)

dull fθ(2)

3 (v7)

!

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-115
SLIDE 115

Neural WFSAs as Sequence Encoders

I saw such a great talk today v1 v2 v3 v4 v5 v6 v7

maxis(vi : vi+3, θ(k)) θ = θ(k) s0 s1 s2 s3 s4 fθ(k)

0 (v4)

gorgeous fθ(k)

1 (v5)

and fθ(k)

2 (v6)

witty fθ(k)

3 (v7)

movie

v1:7 = Classify (❯/❉)

back to main

1 / 11

slide-116
SLIDE 116

SoPa Complexity

◮ Running the Viterbi (1967) algorithm on a sequence of n tokens and a WFSA of d

states typically takes O(d3 + d2(n))

◮ We only allow zero or one ǫ-transition at a time ⇒ O(d2(n)) ◮ We only allow self-loop and main path transitions ⇒ O(dn) ◮ Scores on all patterns can be computed in parallel

◮ GPU optimization further reduces the observed runtime to be sublinear in d

back to main

2 / 11

slide-117
SLIDE 117

Interpreting SoPa

Visualizing Sentiment Predictions

◮ Leave-one-out method on all patterns ◮ Visualize the spans with the largest (positive) and (negative) contribution

Analyzed Documents it’s dumb, but more importantly, it’s just not scary While its careful pace and seemingly opaque story may not satisfy every movie- goer’s appetite, the film’s final scene is soaringly, transparently moving

3 / 11

slide-118
SLIDE 118

LSTMs Exploit Linguistic Attributes of Data

Liu, Levy, Schwartz et al., RepL4NLP 2018, best paper award

◮ Non-linguistic task

4 / 11

slide-119
SLIDE 119

LSTMs Exploit Linguistic Attributes of Data

Liu, Levy, Schwartz et al., RepL4NLP 2018, best paper award

◮ Non-linguistic task ◮ Although they weren’t

designed that way, LSTMs do much better when trained on language data

4 / 11

slide-120
SLIDE 120

Case Study 2: Recurrent Neural Networks (RNN)

Interpretable More robust RNNs

5 / 11

slide-121
SLIDE 121

Case Study 2: Recurrent Neural Networks (RNN)

s0 s1 Interpretable More robust RNNs

5 / 11

slide-122
SLIDE 122

Case Study 2: Recurrent Neural Networks (RNN)

s0 s1

s0 s1 s2

More expressive WFSA Interpretable More robust RNNs

5 / 11

slide-123
SLIDE 123

Case Study 2: Recurrent Neural Networks (RNN)

s0 s1

s0 s1 s2

More expressive WFSA Interpretable More robust RNNs

5 / 11

slide-124
SLIDE 124

Recurrent Neural Networks: Hidden States

I saw such a great talk today v1 h1 v2 h2 v3 h3 v4 h4 v5 h5 v6 h6 v7 h7 v1:7

6 / 11

slide-125
SLIDE 125

Recurrent Neural Networks: Hidden States

I saw such a great talk today v1 h1 v2 h2 v3 h3 v4 h4 v5 h5 v6 h6 v7 h7 v1:7

hi = f(hi−1, vi) Recurrent function:

6 / 11

slide-126
SLIDE 126

Multiple Variants of Recurrent Neural Networks

◮ Elman (1990)

LSTM (Hochreiter and Schmidhuber, 1997)

◮ GRU (Cho et al., 2014) ◮ SGU (Gao and Glowacka,

2016)

◮ RAN (Lee et al., 2017) ◮ SCRN (Mikolov et al., 2014) ◮ T-RNN (Balduzzi and Ghifary,

2016)

◮ RCNN (Lei et al., 2016) ◮ Q-RNN (Bradbury et al., 2017) ◮ ISAN (Foerster et al., 2017) ◮ SoPa (Schwartz et al., 2018) ◮ SRU (Lei et al., 2018)

7 / 11

slide-127
SLIDE 127

Multiple Variants of Recurrent Neural Networks

◮ Elman (1990)

LSTM (Hochreiter and Schmidhuber, 1997)

◮ GRU (Cho et al., 2014) ◮ SGU (Gao and Glowacka,

2016)

◮ RAN (Lee et al., 2017) ◮ SCRN (Mikolov et al., 2014) ◮ T-RNN (Balduzzi and Ghifary,

2016)

◮ RCNN (Lei et al., 2016) ◮ Q-RNN (Bradbury et al., 2017) ◮ ISAN (Foerster et al., 2017) ◮ SoPa (Schwartz et al., 2018) ◮ SRU (Lei et al., 2018) ◮ What do different RNN

variants have in common?

◮ What are they learning? ◮ Can we improve them?

7 / 11

slide-128
SLIDE 128

Example: Strongly-Typed Recurrent Neural Networks

Balduzzi and Ghifary (2016)

◮ A simple, competitive RNN

◮ Draws inspiration from physics and functional programming

◮ hi = zi · hi−1 + ui

◮ zi, ui are non-linear parameterized functions of vi 8 / 11

slide-129
SLIDE 129

Example: Strongly-Typed Recurrent Neural Networks

Balduzzi and Ghifary (2016)

◮ A simple, competitive RNN

◮ Draws inspiration from physics and functional programming

◮ hi = zi · hi−1 + ui

◮ zi, ui are non-linear parameterized functions of vi

◮ Let xi = [xi]k:

hn = zn · hn−1 + un

8 / 11

slide-130
SLIDE 130

Example: Strongly-Typed Recurrent Neural Networks

Balduzzi and Ghifary (2016)

◮ A simple, competitive RNN

◮ Draws inspiration from physics and functional programming

◮ hi = zi · hi−1 + ui

◮ zi, ui are non-linear parameterized functions of vi

◮ Let xi = [xi]k:

hn = zn · hn−1 + un = zn · (zn−1·hn−2+un−1) + un

8 / 11

slide-131
SLIDE 131

Example: Strongly-Typed Recurrent Neural Networks

Balduzzi and Ghifary (2016)

◮ A simple, competitive RNN

◮ Draws inspiration from physics and functional programming

◮ hi = zi · hi−1 + ui

◮ zi, ui are non-linear parameterized functions of vi

◮ Let xi = [xi]k:

hn = zn · hn−1 + un = zn · (zn−1·hn−2+un−1) + un = zn · (zn−1 · (zn−2 · hn−3 + un−2) + un−1) + un

8 / 11

slide-132
SLIDE 132

Example: Strongly-Typed Recurrent Neural Networks

Balduzzi and Ghifary (2016)

◮ A simple, competitive RNN

◮ Draws inspiration from physics and functional programming

◮ hi = zi · hi−1 + ui

◮ zi, ui are non-linear parameterized functions of vi

◮ Let xi = [xi]k:

hn = zn · hn−1 + un = zn · (zn−1·hn−2+un−1) + un = zn · (zn−1 · (zn−2 · hn−3 + un−2) + un−1) + un = . . . =

n−1

  • i=1

 ui

n

  • j=i+1

zj   + un

8 / 11

slide-133
SLIDE 133

Weighted Finite-State Automata!

s0 s1

f0→1(v, θ) 1 f1→1(v, θ)

9 / 11

slide-134
SLIDE 134

Weighted Finite-State Automata!

s0 s1

f0→1(v, θ) 1 f1→1(v, θ)

◮ Soft Pattern: W

◮ Ignore the self-loops for simplicity 9 / 11

slide-135
SLIDE 135

Weighted Finite-State Automata!

s0 s1

f0→1(v, θ) 1 f1→1(v, θ)

◮ Soft Pattern: W

◮ Ignore the self-loops for simplicity

◮ S2 (v1 : vn) = n−1

  • i=1
  • f0

1(vi, θ) n

  • j=i+1

f1

1(vj, θ)

  • + f0

1(vn, θ)

9 / 11

slide-136
SLIDE 136

Strongly-Typed RNNs are Rational!

Can Be Computed Using a Set of WFSAs

hn =

n−1

  • i=1
  • ui

n

  • j=i+1

zj

  • + un

S2(v1 : vn) =

n−1

  • i=1
  • f0

1(vi, θ) n

  • j=i+1

f1

1(vj, θ)

  • + f0

1(vn, θ)

10 / 11

slide-137
SLIDE 137

Work in Progress 3: Make your own Deep Model!

Deep Model

s0 s1 s2 s3 s4

11 / 11

slide-138
SLIDE 138

Work in Progress 3: Make your own Deep Model!

Deep Model

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4

11 / 11

slide-139
SLIDE 139

David Balduzzi and Muhammad Ghifary. 2016. Strongly-typed recurrent neural

  • networks. In Proc. of ICML.

James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural network. In Proc. of ICLR. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In

  • Proc. of SSST.

Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211. Jakob N. Foerster, Justin Gilmer, Jan Chorowski, Jascha Sohl-Dickstein, and David

  • Sussillo. 2017. Intelligible language modeling with input switched affine networks. In
  • Proc. of ICML.

Yuan Gao and Dorota Glowacka. 2016. Deep gate recurrent neural network. In Proc.

  • f ACML, pages 350–365.

Sepp Hochreiter and J¨ urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. Kenton Lee, Omer Levy, and Luke Zettlemoyer. 2017. Recurrent additive networks. arXiv:1705.07393.

11 / 11

slide-140
SLIDE 140

Tao Lei, Hrishikesh Joshi, Regina Barzilay, Tommi Jaakkola, Kateryna Tymoshenko, Alessandro Moschitti, and Llu´ ıs M`

  • arquez. 2016. Semi-supervised question retrieval

with gated convolutions. In Proc. of NAACL. Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In Proc. of EMNLP. Tomas Mikolov, Armand Joulin, Sumit Chopra, Micha¨ el Mathieu, and Marc’Aurelio

  • Ranzato. 2014. Learning longer memory in recurrent neural networks.

arXiv:1412.7753. Marcel Paul Sch¨

  • utzenberger. 1961. On the definition of a family of automata.

Information and Control, 4(2-3):245–270. Roy Schwartz, Sam Thomson, and Noah A. Smith. 2018. SoPa: Bridging CNNs, RNNs, and weighted finite-state machines. In Proc. of ACL.

  • A. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum

decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269.

11 / 11