Rational Recurrences for Empirical Natural Language Processing Noah - PowerPoint PPT Presentation

Rational Recurrences for Empirical Natural Language Processing Noah Smith University of Washington & Allen Institute for Artificial Intelligence nasmith@cs.washington.edu noah@allenai.org @nlpnoah

A Bit of History Interpretability? Guarantees? Rule-based NLP (1980s and before) • E.g., lexicons and regular expression pattern matching • Information extraction Statistical NLP (1990s-2000s) • Probabilistic models over features derived from rule- based NLP • Sentiment/opinion analysis, machine translation Neural NLP (2010s) • Vectors, matrices, tensors, and lots of nonlinearities

Outline 1. An interpretable neural network inspired by rule-based NLP: SoPa “Bridging CNNs, RNNs, and weighted finite-state machines,” Schwartz et al., ACL 2018 2. A restricted class of RNNs that includes SoPa: rational recurrences “Rational recurrences,” Peng et al., EMNLP 2018 3. More compact rational RNNs using sparse regularization work under review 4. A few parting shots

Patterns • Lexical semantics (Hearst, 1992; Lin et al., 2003; Snow et al., 2006; Turney, 2008; Schwartz et al., 2015) • Information extraction (Etzioni et al., 2005) • Document classification (Tsur et al., 2010; Davidov et al., 2010; Schwartz et al., 2013) • Text generation (Araki et al., 2016)

good fun, good action, good acting, good dialogue, good pace, good cinematography. flat, misguided comedy. long before it 's over, you'll be thinking of 51 ways to leave this loser.

Patterns from Lexicons and Regular Expressions mesmerizing engrossing a * clear-eyed an * fascinating the self-assured of portrait … … q 3 q 4 q 0 q 1 q 2 ε

Weighted Patterns mesmerizing : 2.0 engrossing : 1.8 a : 1.1 clear-eyed : 1.6 * : 1 an : 1.1 fascinating : 1.4 * : 1 portrait : 1.0 the : 1.1 self-assured : 1.3 of : 1.0 … … q 3 q 4 q 0 q 1 q 2 ε : 1 a mesmerizing portrait of an engineer : 1 × 2.0 × 1 × 1 × 1.1 × 1 = 2.2 the most fascinating portrait of students : 1 × 1 × 1.4 × 1 × 1 × 1.1 × 1 = 1.5 a clear-eyed picture of the modern : 0 flat , misguided comedy : 0

Soft Patterns (SoPa) Score word ve vectors instead of a separate weight for each word w i → j , b i → j your favorite embedding q i q j for word x goes here t i,j ( x ) = σ (w i → j ·√ v x + b i → j )

Soft Patterns (SoPa) Flexible-length patterns: l + 1 states with self-loops x ↦ 1 x ↦ t 2,2 ( x ) x ↦ 1 x ↦ t 1,1 ( x ) q 1 q 2 q l q 0 x ↦ t 0,1 ( x ) x ↦ t 1,2 ( x ) x ↦ t 2,3 ( x ) x ↦ t l -1, l ( x )

Soft Patterns (SoPa) Transition matrix has O ( l ) parameters 1 t 0,1 ( x ) 0 0 0 0 0 t 1,1 ( x ) t 1,2 ( x ) 0 0 0 0 0 t 2,2 ( x ) t 2,3 ( x ) 0 0 T( x ) = 0 0 0 t 3,3 ( x ) ⋱ 0 ⋱ 0 0 0 0 t l -1, l ( x ) 0 0 0 0 0 1

SoPa Sequence-Scoring: Matrix Multiplication matchScore ( “ flat , misguided comedy . ” ) = w start ⊤ T( flat ) T( , ) T( misguided ) T( comedy ) T( . ) w end

Two-SoPa Recurrent Neural Network max-pooled END states pattern1 states START states pattern2 states word vectors Fielding’s funniest and most likeable book in years

Experiments • 200 SoPas, each with 2–6 states • Text input is fed to all 200 patterns in parallel • Pattern match scores fed to an MLP, with end-to-end training • Datasets : • Amazon electronic product reviews (20K), binarized (McAuley &Leskovec, 2013) • Stanford sentiment treebank (7K): movie review sentences, binarized (Socher et al., 2013) • ROCStories (3K): story cloze, only right/wrong ending, no story prefix (i.e., style) (Mostafazadeh et al., 2016) • Baselines : • LR with hard patterns (Davidov & Rappaport, 2008; Tsur et al., 2010) • one-layer CNN with max-pooling (Kim, 2014) • deep averaging network (Iyyer et al., 2015) • one-layer biLSTM (Zhou et al., 2016) • Hyperparameters tuned for all models by ra random searc rch; see the paper’s appendix

Results: hard, CNN, DAN, biLSTM, SoPa accuracy 60 70 80 90 100 1000 10000 ROC 100000 Amazon SST 1000000 10000000 # parameters

Results: hard, CNN, DAN, biLSTM, SoPa accuracy (Amazon) 90 85 80 Amazon 75 70 65 10000 1000 100 # training instances

Notes • We also include ε -transitions. • We can replace addition operations with max, so that the recurrence equates to the Vi Viterbi bi algorithm for WFSAs. • Without self-loops, ε -transitions, and the sigmoid, SoPa becomes a convolutional neural network (LeCun, 1998). Lots more experiments and details in the paper!

Interpretability (Negative Patterns) • it’s dumb, but more importantly, it’s just not scary • though moonlight mile is replete with acclaimed actors and actresses and tackles a subject that’s potentially moving , the movie is too predictable and too self-conscious to reach a level of high drama • While its careful pace and seemingly opaque story may not satisfy every moviegoer’s appetite, the film ’s final scene is soaringly, transparently moving • the band’s courage in the face of official repression is inspiring, especially for aging hippies (this one included).

Interpretability (Positive Patterns) • it’s dumb, but more importantly, it’s just not scary • though moonlight mile is replete with acclaimed actors and actresses and tackles a subject that’s potentially moving , the movie is too predictable and too self-conscious to reach a level of high drama • While its careful pace and seemingly opaque story may not satisfy every moviegoer’s appetite, the film ’s final scene is soaringly, transparently moving • the band’s courage in the face of official repression is inspiring, especially for aging hippies (this one included).

Interpretability (One SoPa)

Summary So Far • SoPa: an RNN that • equates to WFSAs that score sequences of word vectors • calculates those scores in parallel • works well for text classification tasks • RNNs don’t have to be inscrutable and disrespectful of theory. https://github.com/ Noahs-ARK/soft_patterns

Rational Recurrences A recurrent network is rational if its hidden state can be calculated by an array of weighted FSAs over some semiring whose operations take constant time and space. *We are using standard terminology. “Ra Rational” is to we weighted FS FSAs as “regular” is to (unweighted) FSAs (e.g., “rational series,” Sakarovitch, 2009; “rational kernels,” Cortes et al., 2004).

Simple Recurrent Unit (Lei et al., 2017) 1 f ( x ) q 0 q 1 (1 – f ( x )) ⊙ z ( x )

Some Rational Recurrences • SoPa (Schwartz et al., 2018) • Simple recurrent unit (Lei et al., 2017) • Input switched affine network (Foerster et al., 2017) • Structurally constrained (Mikolov et al., 2014) • Strongly-typed (Balduzzi and Ghifary, 2016) • Recurrent convolution (Lei et al., 2016) • Quasi-recurrent (Bradbury et al., 2017) • New models!

Rational Recurrences and Others Elman-style networks Rational recurrences and LSTMs, GRUs… Convolutional neural nets (Schwartz et al., 2018) Functions mapping strings to real vectors Conjecture WFSAs, Elman network (this morning, Ariadna FSAs rational LSTM, GRU, … talked about the recurrences connection between WFSAs and linear Elman networks)

“Unigram” and “Bigram” Models Unigram: At least one transition from the initial state to final. (“Example 6” in the paper, close to SRU, T-RNN, and SCRN.) Bigram: At least two transitions from the initial state to final.

Interpolation Weighted sum

Experiments • Datasets : PTB (language modeling); Amazon, SST, Subjectivity, Customer Reviews (text classification) • Baseline : • LSTM reported by Lei et al. (2017) • Hyperparameters follow Lei et al. for language modeling; tuned for text classification models by ra random search rch; see the paper’s appendix

Results: Language Modeling (PTB) 75 72 LSTM (24M parameters) (Lei et al., 2017) 69 66 Perplexity 10M parameters, 2 layers 63 (lower is better) 24M parameters, 3 layers 60 “Unigram” “Bigram”

Results: Text Classification (Average of Amazon, SST, Subjectivity, Customer Reviews) 92 90 Accuracy 88 86 LSTM “Unigram” “Bigram”

Summary So Far • Many RNNs are arrays of WFSAs. • Reduced capacity/expressive power can be beneficial. • Theory is about one-layer RNNs; in practice 2+ layers work better. https://github.com/Noahs-ARK/rational-recurrences

Increased Automation • Original SoPa experiments: “200 SoPas, each with 2–6 states” • Can we learn how many states each pattern needs? • Relatedly, can we learn smaller, more compact models? Sparse regularization lets us do this during parameter learning!

Rational Recurrences for Empirical Natural Language Processing Noah - PowerPoint PPT Presentation

Rational Recurrences for Empirical Natural Language Processing Noah Smith University of Washington & Allen Institute for Artificial Intelligence nasmith@cs.washington.edu noah@allenai.org @nlpnoah A Bit of History Interpretability?

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Rational points, rational curves, rational varieties Rational and integral points We study

Rational Robot A Test Automation Tool What is Rational Robot? Rational Robot is a complete

2. Recurrences http://aofa.cs.princeton.edu A N A L Y T I C C O M B I N A T O R I C S P A R T

Week 3 Oliver Kullmann Divide-and- Conquer Solving Recurrences Merge Sort Solving

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

2.5: Rational Expressions and Equations College Algebra Week 2 Rational Expression

E XAMPLE 1 Identify the sum of product as rational or irrational. a. 5 + 8 rational / irrational

On the convergence of rational Ritz values Applications to rational interpolation of rational

Classes of Rational Graphs Christophe Morvan Irisa Journ ees Montoises 2006 1/25 Classes of

Rational Agents (Ch. 2) Rational agent An agent/robot must be able to perceive and interact with

17. Structs and Classes I C++ does not provide a built-in type for rational numbers Goal Rational

Rational isogenies Computing rational isogenies from the equations of the kernel David Lubicz,

4.4 Growth Rates of Solutions to Recurrences Divide and Conquer Algorithms One of the most basic

Fundamental Algorithms Chapter 2b: Recurrences Dirk Pfl uger Winter 2010/11 D. Pfl uger:

Decomposition of Boolean Multi-Relational Data with Graded Relations Martin Trnecka, Marketa

Review Models that use SVD or eigen-analysis PageRank: eigen-analysis of random dolphin

Scope Stack Allocation Andreas Fredriksson, DICE <dep@dice.se> Contents What are Scope

But know this, that in the last days perilous times will come. 2 Timothy 3:1 Where Are We In

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation James Fogarty

Improving the Resilience of Mutualistic Networks Varun Rao December 3, 2018 1 / 34 Overview

Tree level processes in the worldline formalism James P. Edwards Tlaxcala Sept 2017 En

Foundations for a Logic of Arguments Leila Amgoud 1 , Philippe Besnard 1 and Anthony Hunter 2 1

Rational Recurrences for Empirical Natural Language Processing Noah - PowerPoint PPT Presentation

Rational Recurrences for Empirical Natural Language Processing Noah Smith University of Washington & Allen Institute for Artificial Intelligence nasmith@cs.washington.edu noah@allenai.org @nlpnoah A Bit of History Interpretability?

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Rational points, rational curves, rational varieties Rational and integral points We study

Rational Robot A Test Automation Tool What is Rational Robot? Rational Robot is a complete

2. Recurrences http://aofa.cs.princeton.edu A N A L Y T I C C O M B I N A T O R I C S P A R T

Week 3 Oliver Kullmann Divide-and- Conquer Solving Recurrences Merge Sort Solving

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

2.5: Rational Expressions and Equations College Algebra Week 2 Rational Expression

E XAMPLE 1 Identify the sum of product as rational or irrational. a. 5 + 8 rational / irrational

On the convergence of rational Ritz values Applications to rational interpolation of rational

Classes of Rational Graphs Christophe Morvan Irisa Journ ees Montoises 2006 1/25 Classes of

Rational Agents (Ch. 2) Rational agent An agent/robot must be able to perceive and interact with

17. Structs and Classes I C++ does not provide a built-in type for rational numbers Goal Rational

Rational isogenies Computing rational isogenies from the equations of the kernel David Lubicz,

4.4 Growth Rates of Solutions to Recurrences Divide and Conquer Algorithms One of the most basic

Fundamental Algorithms Chapter 2b: Recurrences Dirk Pfl uger Winter 2010/11 D. Pfl uger:

Decomposition of Boolean Multi-Relational Data with Graded Relations Martin Trnecka, Marketa

Review Models that use SVD or eigen-analysis PageRank: eigen-analysis of random dolphin

Scope Stack Allocation Andreas Fredriksson, DICE &lt;dep@dice.se&gt; Contents What are Scope

But know this, that in the last days perilous times will come. 2 Timothy 3:1 Where Are We In

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation James Fogarty

Improving the Resilience of Mutualistic Networks Varun Rao December 3, 2018 1 / 34 Overview

Tree level processes in the worldline formalism James P. Edwards Tlaxcala Sept 2017 En

Foundations for a Logic of Arguments Leila Amgoud 1 , Philippe Besnard 1 and Anthony Hunter 2 1

Scope Stack Allocation Andreas Fredriksson, DICE <dep@dice.se> Contents What are Scope