Towards Interpretable Deep Learning for Natural Language Processing
Roy Schwartz
University of Washington & Allen Institute for Artificial Intelligence
December 2018
1 / 37
Towards Interpretable Deep Learning for Natural Language Processing - - PowerPoint PPT Presentation
Towards Interpretable Deep Learning for Natural Language Processing Roy Schwartz University of Washington & Allen Institute for Artificial Intelligence December 2018 1 / 37 ( Deep-Learning -Based) AI Today 2 / 37 3 / 37 Deep Learning
University of Washington & Allen Institute for Artificial Intelligence
December 2018
1 / 37
2 / 37
3 / 37
3 / 37
3 / 37
3 / 37
3 / 37
3 / 37
3 / 37
Case Study: Sentiment Analysis
input: words word embeddings sequence encoders
prediction
4 / 37
Case Study: Sentiment Analysis
input: words word embeddings sequence encoders
prediction
Main component in ◮ Machine translation ◮ Question answering ◮ Text summarization ◮ Sentiment analysis ◮ Information extraction ◮ . . .
4 / 37
Case Study: Sentiment Analysis
input: words word embeddings sequence encoders
prediction
Main component in ◮ Machine translation ◮ Question answering ◮ Text summarization ◮ Sentiment analysis ◮ Information extraction ◮ . . .
4 / 37
◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata
◮ Case Study: Convolutional neural networks 5 / 37
◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata
◮ Case Study: Convolutional neural networks 5 / 37
Regular Expressions (Patterns)
6 / 37
Regular Expressions (Patterns)
6 / 37
Each Transition Defines a Weight Function
◮ (Weighted) pattern: such a great talk
◮ Weights are typically pre-specified 7 / 37
Each Transition Defines a Weight Function
◮ (Weighted) pattern: such a great talk
◮ Weights are typically pre-specified
◮ The score of a sequence is the sum of transition scores
7 / 37
Each Transition Defines a Weight Function
◮ (Weighted) pattern: such a great talk
◮ Weights are typically pre-specified
◮ The score of a sequence is the sum of transition scores
7 / 37
Each Transition Defines a Weight Function
◮ (Weighted) pattern: such a great talk
◮ Weights are typically pre-specified
◮ The score of a sequence is the sum of transition scores
7 / 37
Each Transition Defines a Weight Function
◮ (Weighted) pattern: such a great talk
◮ Weights are typically pre-specified
◮ The score of a sequence is the sum of transition scores
7 / 37
Each Transition Defines a Weight Function
◮ (Weighted) pattern: such a great talk
◮ Weights are typically pre-specified
◮ The score of a sequence is the sum of transition scores
7 / 37
◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata
◮ Case Study: Convolutional neural networks 8 / 37
◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata
◮ Case Study: Convolutional neural networks 8 / 37
◮ such a great talk
◮ such a wonderful talk, such a lovely talk 9 / 37
◮ such a great talk
◮ such a wonderful talk, such a lovely talk
◮ Naive solution:
9 / 37
◮ such a great talk
◮ such a wonderful talk, such a lovely talk
◮ Naive solution:
◮ Problem: not scalable
◮ what a great talk, such an awesome talk 9 / 37
Schwartz et al., ACL 2018
10 / 37
Schwartz et al., ACL 2018
◮ Step 1: word → Rd
◮ Word embeddings ◮ Similar words are encoded in similar vectors 10 / 37
Schwartz et al., ACL 2018
◮ Step 1: word → Rd
◮ Word embeddings ◮ Similar words are encoded in similar vectors
◮ Step 2: Accept all word vectors
10 / 37
Schwartz et al., ACL 2018
◮ Step 1: word → Rd
◮ Word embeddings ◮ Similar words are encoded in similar vectors
◮ Step 2: Accept all word vectors ◮ Step 3: weights: fθ : Rd → R
◮ These functions favor specific words ◮ θ parameters are learned 10 / 37
Schwartz et al., ACL 2018
◮ Neural transitions accept all words, ◮ but favor specific words
11 / 37
Schwartz et al., ACL 2018
◮ Neural transitions accept all words, ◮ but favor specific words ◮ Example 1: great
◮ high score: great, awesome, good ◮ low score: bad, child, three
11 / 37
Schwartz et al., ACL 2018
◮ Neural transitions accept all words, ◮ but favor specific words ◮ Example 1: great
◮ high score: great, awesome, good ◮ low score: bad, child, three
◮ Example 2: the
◮ high score: the, a, an ◮ low score: car, love, well
11 / 37
Schwartz et al., ACL 2018
v – word vectors θ = (θ0, θ1, θ2, θ3) – learned parameters ◮ Neural WFSAs accept any sequence,1 but prefer certain sequences
1Pending length constraints 12 / 37
Schwartz et al., ACL 2018
v – word vectors θ = (θ0, θ1, θ2, θ3) – learned parameters ◮ Neural WFSAs accept any sequence,1 but prefer certain sequences ◮ Example 1: such a great talk
◮ high score: what a great talk, such an awesome talk ◮ low score: such a horrible talk, such a black cat, john went to school 1Pending length constraints 12 / 37
Schwartz et al., ACL 2018
v – word vectors θ = (θ0, θ1, θ2, θ3) – learned parameters ◮ Neural WFSAs accept any sequence,1 but prefer certain sequences ◮ Example 1: such a great talk
◮ high score: what a great talk, such an awesome talk ◮ low score: such a horrible talk, such a black cat, john went to school
◮ Example 2: is not very exciting
◮ high score: is not particularly exciting, are not very inspiring 1Pending length constraints 12 / 37
Formally
◮ Input
◮
s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)
◮ Word embeddings: word → Rd ◮ Training data: pairs of
<document, sentiment label>
◮ Output
◮ Parameter values: θ
13 / 37
Formally
◮ Input
◮
s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)
◮ Word embeddings: word → Rd ◮ Training data: pairs of
<document, sentiment label>
◮ Output
◮ Parameter values: θ
◮ Input
◮
s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)
◮ Word embeddings: word → Rd ◮ Learned parameters: θ ◮ New data: <document>
◮ Output
◮ Prediction: <sentiment label>
13 / 37
Formally
◮ Input
◮
s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)
◮ Word embeddings: word → Rd ◮ Training data: pairs of
<document, sentiment label>
◮ Output
◮ Parameter values: θ
◮ Input
◮
s0 s1 s2 s3 s4 fθ0(v) fθ1(v) fθ2(v) fθ3(v)
◮ Word embeddings: word → Rd ◮ Learned parameters: θ ◮ New data: <document>
◮ Output
◮ Prediction: <sentiment label>
◮ Standard training procedure
◮ Backpropagation ◮ Stochastic gradient descent
13 / 37
Informed Model Development
s0 s1 s2 s3 s4
14 / 37
Informed Model Development
s0 s1 s2 s3 s4
s0 s1 s2 s3 s4
14 / 37
Informed Model Development
s0 s1 s2 s3 s4
s0 s1 s2 s3 s4
s0 s1 s2 s3 s4
fθǫ()
14 / 37
Informed Model Development
s0 s1 s2 s3 s4
s0 s1 s2 s3 s4
s0 s1 s2 s3 s4
fθǫ()
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
14 / 37
◮ They are neural
◮ Backpropagation ◮ Stochastic gradient descent ◮ PyTorch, TensorFlow, AllenNLP 15 / 37
◮ They are neural
◮ Backpropagation ◮ Stochastic gradient descent ◮ PyTorch, TensorFlow, AllenNLP
◮ Coming up:
◮ Many deep models are mathematically equivalent to neural WFSAs ◮ A (new) joint framework ◮ Allows extension of these models 15 / 37
◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata
◮ Case Study: Convolutional neural networks 16 / 37
◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata
◮ Case Study: Convolutional neural networks 16 / 37
A Linear-Kernel Filter with Max-Pooling
v1 v2 v3 v4 v5 v6 v7
17 / 37
A Linear-Kernel Filter with Max-Pooling
v1 v2 v3 v4 v5 v6 v7 Sθ(v1 : v4) =
j=1:4
θj · vj
Learnable parameters Word vectors
17 / 37
Schwartz et al., ACL 2018 s0 s1 s2 s3 s4
18 / 37
Schwartz et al., ACL 2018
◮ fθj(v) = θj · v
s0 s1 s2 s3 s4
18 / 37
Schwartz et al., ACL 2018
◮ fθj(v) = θj · v ◮ sθ(v1 : v4) = j=1:4
j=1:4
s0 s1 s2 s3 s4
18 / 37
19 / 37
19 / 37
Schwartz et al., ACL 2018 s0 s1 s2 s3 s4
◮ E.g., “such a great talk”
◮ what a great song ◮ such an awesome movie 20 / 37
Schwartz et al., ACL 2018
◮ Language pattern are often flexible-length ◮ such a great talk
◮ such a great, funny, interesting talk ◮ such
great shoes
21 / 37
Schwartz et al., ACL 2018
◮ Language pattern are often flexible-length ◮ such a great talk
◮ such a great, funny, interesting talk ◮ such
great shoes
j=1:d
21 / 37
Schwartz et al., ACL 2018
◮ Language pattern are often flexible-length ◮ such a great talk
◮ such a great, funny, interesting talk ◮ such
great shoes
s0 s1 s2 s3 s4
such a great talk
21 / 37
Schwartz et al., ACL 2018
◮ Language pattern are often flexible-length ◮ such a great talk
◮ such a great, funny, interesting talk ◮ such
great shoes
s0 s1 s2 s3 s4
such a great talk funny, interesting
21 / 37
Schwartz et al., ACL 2018
◮ Language pattern are often flexible-length ◮ such a great talk
◮ such a great, funny, interesting talk ◮ such
great shoes
s0 s1 s2 s3 s4
such a great talk funny, interesting ǫ
21 / 37
Schwartz et al., ACL 2018
◮ Language pattern are often flexible-length ◮ such a great talk
◮ such a great, funny, interesting talk ◮ such
great shoes
s0 s1 s2 s3 s4
such a great talk funny, interesting ǫ
21 / 37
22 / 37
◮ SoPa (ours) ◮ ConvNet
22 / 37
◮ SoPa (ours) ◮ ConvNet
22 / 37
Schwartz et al., ACL 2018 100 1,000 10,000 75 80 85
100 1,000 10,000 70 75 80 85 90
23 / 37
Schwartz et al., ACL 2018 100 1,000 10,000 75 80 85
100 1,000 10,000 70 75 80 85 90
SoPa (ours) ConvNet LSTM
23 / 37
Schwartz et al., ACL 2018 100 1,000 10,000 75 80 85
100 1,000 10,000 70 75 80 85 90
SoPa (ours) ConvNet LSTM
23 / 37
Soft Patterns!
◮ For each learned pattern, extract the 4 top scoring phrases in the training set
24 / 37
Soft Patterns!
◮ For each learned pattern, extract the 4 top scoring phrases in the training set
Highest Scoring Phrases
mesmerizing portrait of a engrossing portrait of a clear-eyed portrait of an fascinating portrait of a
s0 s1 s2 s3 s4
a
24 / 37
Soft Patterns!
◮ For each learned pattern, extract the 4 top scoring phrases in the training set
Highest Scoring Phrases
mesmerizing portrait of a engrossing portrait of a clear-eyed portrait of an fascinating portrait of a Highest Scoring Phrases
honest , and enjoyable forceful , and beautifully energetic , and surprisingly
s0 s1 s2 s3 s4
a
s0 s1 s2 s3 s4
and
Soft Patterns!
◮ For each learned pattern, extract the 4 top scoring phrases in the training set
Highest Scoring Phrases
mesmerizing portrait of a engrossing portrait of a clear-eyed portrait of an fascinating portrait of a Highest Scoring Phrases
honest , and enjoyable forceful , and beautifully energetic , and surprisingly unpretentious , charmingSL , quirky
s0 s1 s2 s3 s4
a
s0 s1 s2 s3 s4
and
25 / 37
s0 s1 s2 s3 s4
25 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
f ()
25 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
f ()
25 / 37
Peng, Schwartz et al., EMNLP 2018
Mikolov et al. arXiv 2014 Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016 Foerster et al. ICML 2017
26 / 37
Peng, Schwartz et al., EMNLP 2018
Mikolov et al. arXiv 2014
s0 s1
Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016
s0 s1 s2
Foerster et al. ICML 2017
s0 s1 s2 s3
26 / 37
Peng, Schwartz et al., EMNLP 2018
Mikolov et al. arXiv 2014
s0 s1
Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016
s0 s1 s2
Foerster et al. ICML 2017
s0 s1 s2 s3
◮ Six recent recurrent neural networks (RNN) models are also implicitly computing
26 / 37
s0 s1 Mikolov et al. (2014) Balduzzi and Ghifary (2016) Bradbury et al. (2017) Lei et al. (2018) s0 s1 s2
Lei et al. (2016)
27 / 37
s0 s1 Mikolov et al. (2014) Balduzzi and Ghifary (2016) Bradbury et al. (2017) Lei et al. (2018) s0 s1 s2
Lei et al. (2016) s0 s1 s2
Peng, Schwartz et al. (2018)
27 / 37
Peng, Schwartz et al., EMNLP 2018
28 / 37
Peng, Schwartz et al., EMNLP 2018
29 / 37
30 / 37
◮ Elman RNN: hi = σ(Whi−1 + Uvi + b) ◮ The interaction between hi and hi−1 is via affine transformations followed by
◮ Same for LSTM
◮ Most probably not equivalent to a WFSA
31 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
32 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
32 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
32 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
32 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
32 / 37
input: words word embeddings sequence encoders
prediction
33 / 37
input: words
Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014
word embeddings sequence encoders
prediction
33 / 37
input: words
Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014
word embeddings
Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 Schwartz et al., NAACL 2016 Vuli´ c et al., CoNLL 2017 Peters et al., 2018
sequence encoders
prediction
33 / 37
input: words
Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014
word embeddings
Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 Schwartz et al., NAACL 2016 Vuli´ c et al., CoNLL 2017 Peters et al., 2018
sequence encoders
Schwartz et al., ACL 2018 Peng et al., EMNLP 2018 Liu et al., RepL4NLP 2018 *best paper award*
prediction
33 / 37
input: words
Schwartz et al., EMNLP 2013 Schwartz et al., COLING 2014
word embeddings
Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 Schwartz et al., NAACL 2016 Vuli´ c et al., CoNLL 2017 Peters et al., 2018
sequence encoders
Schwartz et al., ACL 2018 Peng et al., EMNLP 2018 Liu et al., RepL4NLP 2018 *best paper award*
prediction Labeled datasets: <sentence, label> pairs
Schwartz et al., ACL 2011 Schwartz et al., COLING 2012 Schwartz et al., CoNLL 2017 Gururangan et al., NAACL 2018 Kang et al., NAACL 2018 Zellers et al., EMNLP 2018 33 / 37
Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018
Premise A person is running on the beach Hypothesis The person is sleeping
Textual Entailment (state-of-the-art ∼90% accuracy)
34 / 37
Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018
Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral
Textual Entailment (state-of-the-art ∼90% accuracy)
? ? ?
34 / 37
Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018
Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral
Textual Entailment (state-of-the-art ∼90% accuracy)
? ? ? AllenNLP Demo!
34 / 37
Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018
Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral
Textual Entailment (state-of-the-art ∼90% accuracy)
? ? ?
◮ The word “sleeping” is over-represented in the training data with contradiction label
◮ annotation artifact
◮ State-of-the-art models focus on this word rather than understanding the text
34 / 37
Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018
Premise A person is running on the beach Hypothesis The person is sleeping entailment contradiction neutral
Textual Entailment (state-of-the-art ∼90% accuracy)
? ? ?
◮ The word “sleeping” is over-represented in the training data with contradiction label
◮ annotation artifact
◮ State-of-the-art models focus on this word rather than understanding the text ◮ Models are not as strong as we think they are
34 / 37
35 / 37
◮ Explainable models ◮ Unbiased models
35 / 37
Li Zilles
Dana RubinsteinEffi Levi
36 / 37
Li Zilles
Dana RubinsteinEffi Levi
36 / 37
Li Zilles
Dana RubinsteinEffi Levi
36 / 37
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
f ()
Roy Schwartz homes.cs.washington.edu/~roysch/ roysch@cs.washington.edu
37 / 37
back to main
1 / 11
s(v1 : v4, θ(1)) s0 s1 s2 s3 s4 fθ(1)
0 (v1)
such fθ(1)
1 (v2)
a fθ(1)
2 (v3)
great fθ(1)
3 (v4)
talk
back to main
1 / 11
s(v2 : v5, θ(1)) s0 s1 s2 s3 s4 fθ(1)
0 (v2)
such fθ(1)
1 (v3)
a fθ(1)
2 (v4)
great fθ(1)
3 (v5)
talk
back to main
1 / 11
s(v3 : v6, θ(1)) s0 s1 s2 s3 s4 fθ(1)
0 (v3)
such fθ(1)
1 (v4)
a fθ(1)
2 (v5)
great fθ(1)
3 (v6)
talk
back to main
1 / 11
s(v4 : v7, θ(1)) s0 s1 s2 s3 s4 fθ(1)
0 (v4)
such fθ(1)
1 (v5)
a fθ(1)
2 (v6)
great fθ(1)
3 (v7)
talk
back to main
1 / 11
maxis(vi : vi+3, θ(1)) s0 s1 s2 s3 s4 fθ(1)
0 (v4)
such fθ(1)
1 (v5)
a fθ(1)
2 (v6)
great fθ(1)
3 (v7)
talk
back to main
1 / 11
maxis(vi : vi+3, θ(1)) θ = θ(1) s0 s1 s2 s3 s4 fθ(1)
0 (v4)
such fθ(1)
1 (v5)
a fθ(1)
2 (v6)
great fθ(1)
3 (v7)
talk
back to main
1 / 11
maxis(vi : vi+3, θ(2)) θ = θ(2) s0 s1 s2 s3 s4 fθ(2)
0 (v4)
is fθ(2)
1 (v5)
remarkably fθ(2)
2 (v6)
dull fθ(2)
3 (v7)
!
back to main
1 / 11
maxis(vi : vi+3, θ(k)) θ = θ(k) s0 s1 s2 s3 s4 fθ(k)
0 (v4)
gorgeous fθ(k)
1 (v5)
and fθ(k)
2 (v6)
witty fθ(k)
3 (v7)
movie
back to main
1 / 11
◮ Running the Viterbi (1967) algorithm on a sequence of n tokens and a WFSA of d
◮ We only allow zero or one ǫ-transition at a time ⇒ O(d2(n)) ◮ We only allow self-loop and main path transitions ⇒ O(dn) ◮ Scores on all patterns can be computed in parallel
◮ GPU optimization further reduces the observed runtime to be sublinear in d
2 / 11
Visualizing Sentiment Predictions
◮ Leave-one-out method on all patterns ◮ Visualize the spans with the largest (positive) and (negative) contribution
Analyzed Documents it’s dumb, but more importantly, it’s just not scary While its careful pace and seemingly opaque story may not satisfy every movie- goer’s appetite, the film’s final scene is soaringly, transparently moving
3 / 11
Liu, Levy, Schwartz et al., RepL4NLP 2018, best paper award
◮ Non-linguistic task
4 / 11
Liu, Levy, Schwartz et al., RepL4NLP 2018, best paper award
◮ Non-linguistic task ◮ Although they weren’t
designed that way, LSTMs do much better when trained on language data
4 / 11
5 / 11
5 / 11
s0 s1 s2
5 / 11
s0 s1 s2
5 / 11
6 / 11
hi = f(hi−1, vi) Recurrent function:
6 / 11
◮ Elman (1990)
LSTM (Hochreiter and Schmidhuber, 1997)
◮ GRU (Cho et al., 2014) ◮ SGU (Gao and Glowacka,
2016)
◮ RAN (Lee et al., 2017) ◮ SCRN (Mikolov et al., 2014) ◮ T-RNN (Balduzzi and Ghifary,
2016)
◮ RCNN (Lei et al., 2016) ◮ Q-RNN (Bradbury et al., 2017) ◮ ISAN (Foerster et al., 2017) ◮ SoPa (Schwartz et al., 2018) ◮ SRU (Lei et al., 2018)
7 / 11
◮ Elman (1990)
LSTM (Hochreiter and Schmidhuber, 1997)
◮ GRU (Cho et al., 2014) ◮ SGU (Gao and Glowacka,
2016)
◮ RAN (Lee et al., 2017) ◮ SCRN (Mikolov et al., 2014) ◮ T-RNN (Balduzzi and Ghifary,
2016)
◮ RCNN (Lei et al., 2016) ◮ Q-RNN (Bradbury et al., 2017) ◮ ISAN (Foerster et al., 2017) ◮ SoPa (Schwartz et al., 2018) ◮ SRU (Lei et al., 2018) ◮ What do different RNN
variants have in common?
◮ What are they learning? ◮ Can we improve them?
7 / 11
Balduzzi and Ghifary (2016)
◮ A simple, competitive RNN
◮ Draws inspiration from physics and functional programming
◮ hi = zi · hi−1 + ui
◮ zi, ui are non-linear parameterized functions of vi 8 / 11
Balduzzi and Ghifary (2016)
◮ A simple, competitive RNN
◮ Draws inspiration from physics and functional programming
◮ hi = zi · hi−1 + ui
◮ zi, ui are non-linear parameterized functions of vi
◮ Let xi = [xi]k:
8 / 11
Balduzzi and Ghifary (2016)
◮ A simple, competitive RNN
◮ Draws inspiration from physics and functional programming
◮ hi = zi · hi−1 + ui
◮ zi, ui are non-linear parameterized functions of vi
◮ Let xi = [xi]k:
8 / 11
Balduzzi and Ghifary (2016)
◮ A simple, competitive RNN
◮ Draws inspiration from physics and functional programming
◮ hi = zi · hi−1 + ui
◮ zi, ui are non-linear parameterized functions of vi
◮ Let xi = [xi]k:
8 / 11
Balduzzi and Ghifary (2016)
◮ A simple, competitive RNN
◮ Draws inspiration from physics and functional programming
◮ hi = zi · hi−1 + ui
◮ zi, ui are non-linear parameterized functions of vi
◮ Let xi = [xi]k:
n−1
n
8 / 11
9 / 11
◮ Soft Pattern: W
◮ Ignore the self-loops for simplicity 9 / 11
◮ Soft Pattern: W
◮ Ignore the self-loops for simplicity
◮ S2 (v1 : vn) = n−1
1(vi, θ) n
1(vj, θ)
1(vn, θ)
9 / 11
Can Be Computed Using a Set of WFSAs
n−1
n
n−1
1(vi, θ) n
1(vj, θ)
1(vn, θ)
10 / 11
Deep Model
s0 s1 s2 s3 s4
11 / 11
Deep Model
s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2 s3 s4
11 / 11
11 / 11
11 / 11