Shortcut-Stacked Sentence Encoders for Multi-Domain Inference Yixin - PowerPoint PPT Presentation

Shortcut-Stacked Sentence Encoders for Multi-Domain Inference Yixin Nie & Mohit Bansal 1

Task and Motivation Premise Label Hypothesis Genre The Old One always comforted Ca'daan, except neutral Ca'daan knew the Old One very well. Fiction today. Your gift is appreciated by each and every student Hundreds of students will benefit from your neutral Letters who will benefit from your generosity. generosity. yes now you know if if everybody like in August when contradiction August is a black out month for vacations in Telephone everybody's on vacation or something we can dress a the company. Speech little more casual or At the other end of Pennsylvania Avenue, people People formed a line at the end of entailment 9/11 Report began to line up for a White House tour. Pennsylvania Avenue. A black race car starts up in front of a crowd of people. contradiction A man is driving down a lonely road. SNLI Only encoding-based models are eligible for the RepEval 2017 Shared Task. 2 [https://repeval2017.github.io/shared/], [https://nlp.stanford.edu/projects/snli/]

Motivation of Encoding-based Models Encoding-based Model: models that transform sentences into fixed- length vector representations and reason using only those representations without cross-attention between two sentences 3

Motivation of Encoding-based Models A portable neural model to transform the source sentence into some sentence-level meaning representation • A plug and play module • Sentence-level knowledge unit 4

Existing Encoding-based Model Results 300D NSE encoders (Munkhdalai & Yu 2016) 84.6% on SNLI BiLSTM Encoder (Williams et al., 2017) 67.5%/67.1% on MultiNLI (Matched/Mismatched) There is still much scope for improvement. 5 [https://repeval2017.github.io/shared/], [https://nlp.stanford.edu/projects/snli/]

Typical Architecture of Encoding-based Model Matching Encoding [ v, u, v ⊗ u, | v − u | ] v Encoder Premise Same v ⊗ u Structure u MLP Prediction Encoder Hypothesis | v − u | Key component Let’s zoom in. 6

Encoder (Starting Point) Row max pooling Final Vector Representation biLSTM biLSTM biLSTM w 1 w 2 w n Fine-tunning Word Embedding Source Sentence One Layer biLSTM with Max-pooling 7 [Conneauetal., 2016]

Encoder (Stacking bi-LSTM) Row max pooling Final Vector Representation biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM w 2 w n w 1 Fine-tunning Word Embedding Source Sentence By stacking layers of biLSTM the model was able to learn some high-level semantic features that are useful for natural language inference task. 8 [Simonyan et al., 2016]

Encoder (Shortcut-connection) Row max pooling Final Vector Representation biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM w n w 1 w 2 Fine-tunning Word Embedding Source Sentence Shortcut-connections help sparse gradient from max-pooling to flow into lower layers. 9 [Hashimoto et al., 2016]

Shared Task Competition Results Team Name Authors Matched Mismatched Model Details alpha (ensemble) Chen et al. 74.9% 74.9% S TACK , C HAR , A TTN ., P OOL , P ROD D IFF YixinNie-UNC-NLP Nie and Bansal 74.5% 73.5% S TACK , P OOL , P ROD D IFF , SNLI alpha Chen et al. 73.5% 73.6% S TACK , C HAR , A TTN , P OOL , P ROD D IFF Rivercorners (ensemble) Balazs et al. 72.2% 72.8% A TTN , P OOL , P ROD D IFF , SNLI Rivercorners Balazs et al. 72.1% 72.1% A TTN , P OOL , P ROD D IFF , SNLI LCT-MALTA Vu et al. 70.7% 70.8% C HAR , E NH E MB , P ROD D IFF , P OOL TALP-UPC Yang et al. 67.9% 68.2% C HAR , A TTN , SNLI BiLSTM baseline Williams et al. 67.0% 67.6% P OOL , P ROD D IFF , SNLI RepEval 2017 shared task competition results 10 [Nangia et al., 2017]

Ablation Analysis Layers and Dimensions Accuracy #layers bilstm-dim Matched Mismatched 1 512 72.5 72.9 2 512 + 512 73.4 73.6 1 1024 72.9 72.9 2 512 + 1024 73.7 74.2 1 2048 73.0 73.5 2 512 + 2048 73.7 74.2 2 1024 + 2048 73.8 74.4 2 2048 + 2048 74.0 74.6 3 512 + 1024 + 2048 74.2 74.7 Results for models with different of biLSTM layers and their hidden state dimensions Natural language inference tasks do require some high-level features that could be learned after applying multiple bi-RNN layers in sequence 11

Ablation Analysis Matched Mismatched without any shortcut connection 72.6 73.4 only word shortcut connection 74.2 74.6 full shortcut connection 74.2 74.7 Results with and without shortcut connections. Main performance gain from shortcut property comes from shortcut-connection for word-embedding 12

Ablation Analysis # of MLPs Activation Matched Mismatched 1 tanh 73.7 74.1 2 tanh 73.5 73.6 1 relu 74.1 74.7 2 relu 74.2 74.7 Results for different MLP classifiers Rectified linear unit is better than hyperbolic tangent function in this task 13

Results on SNLI and MultiNLI Accuracy Model SNLI Multi-NLI Matched Multi-NLI Mismatched CBOW (Williams et al., 2017) 80.6 65.2 64.6 biLSTM Encoder (Williams et al., 2017) 81.5 67.5 67.1 300D Tree-CNN Encoder (Mou et al., 2015) 82.1 – – 300D SPINN-PI Encoder (Bowman et al., 2016) 83.2 – – 300D NSE Encoder (Munkhdalai and Yu, 2016) 84.6 – – biLSTM-Max Encoder (Conneau et al., 2017) 84.5 – – Our biLSTM-Max Encoder 85.2 71.7 71.2 Our Shortcut-Stacked Encoder 86.1 74.6 73.6 Test Results on SNLI and Multi-NLI datasets Our encoding-based model achieves new state-of-the-art on SNLI 14

Thoughts about Max-pooling biLSTM biLSTM biLSTM w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 Each column in the final vector representation corresponds to each word in the source sentence and its surroundings/context 15

Thoughts about Max-pooling column-wise matching biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 Column-wise matching between final vector representation of the two sentence corresponds to word matching between two sentence à similar to attention between two sentences 16

Thoughts about Max-pooling like research I research . I do not like . 17

Max-pooling vs. Attention Soft-attention Max-pooling e i = f ( w i , ... ) a = softmax ( e ) X v = a i h i Selectively combining information from each item of the source into a compact representation. We are trying better/advanced max-pooling methods currently. 18

Vector Rep (1-NN Genre Accuracy) Authors 1-NN Genre Accuracy Chen et al. 67.3% Nie and Bansal 74.0% Balazs et al. 69.2% Vu et al. 67.0% Yang et al. 54.7% Table shows the percentage of times the first nearest neighbor belongs to the same genre as the sample sentence. • Learned representations are not genre-agnostic • Potential ability to handle genre classification task 19 [ Nangia et al., 2017 ]

Vector Rep (Heatmap) A heatmap showing the cosine similarity between sentence vectors. Sentences tend to be more similar to one another if they have more structural features in common. 20 [ Nangia et al., 2017 ]

Thanks Yixin Nie yixin1@cs.unc.edu www.cs.unc.edu/~yixin1 Mohit Bansal mbansal@cs.unc.edu www.cs.unc.edu/~mbansal 21

Shortcut-Stacked Sentence Encoders for Multi-Domain Inference Yixin - PowerPoint PPT Presentation

Shortcut-Stacked Sentence Encoders for Multi-Domain Inference Yixin Nie & Mohit Bansal 1 Task and Motivation Premise Label Hypothesis Genre The Old One always comforted Ca'daan, except neutral Ca'daan knew the Old One very well.

STACKED GRAPHS STACKED GRAPHS EVOLUTION OF STACKED GRAPHS Stacked Area Chart Themeriver

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Variational Auto-Encoders without (too) much math Stphane dAscoli Roadmap 1. A reminder

Shortcuts in Dj Vu X3 vs. Dj Vu X2 Function Dj Vu X3 Shortcut Dj Vu X2 Shortcut

YOUR SHORTCUT TO MASSIVE CREDIBILITY CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1 VIRTUAL

Unit B - Rotary Encoders B.2 Rotary Encoders Electromechanical devices used to measure the

Rotary Encoders 2 Rotary Encoders Electromechanical devices used to measure the angular

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction Parallels and Differences

SENTENCE STRUCTURE ATI TEAS ENGLISH AND LANGUAGE USAGE SENTENCE STRUCTURE Sentence Structure

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Classifyng Objects at Differnts Sizes with Multi-scale Stacked Sequential Learning Eloi Puertas,

Nevada Union: Welcome Incoming 9 th graders! Class of 2024 Shortcut to library-clouds-da.jpg.lnk

Optimal Planning and Shortcut Learning: An Unfulfilled Promise Erez Karpas Carmel Domshlak

A Shortcut Fusion Rule for Circular Program Calculation Joo Fernandes 1 Alberto Pardo 2 Joo

Exploring Measures of Readability for Spoken Language Introduction Analyzing linguistic

Steven C. Campbell, MD, PhD Chair AUA Guidelines Panel Professor Surgery, Vice Chair, Program

Anticoagulation Therapy Your key questions for 2018 clinical practice addressed Supported by an

Corporate Overview LISHAN AKLOG, MD Chairman & CEO February 18, 2020 Nasdaq: PAVM, PAVMZ

SVS AVF Clinical Practice Guidelines Venous Ulcer SVS AVF

GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning

Laser drilling of a Copper Mesh Vincenzo Berardi U.O.S. Bari, Italy Dip. Interuniversitario di

Maggy - Open-Source Asynchronous Distributed Hyperparameter Optimization Based on Apache Spark

Shortcut-Stacked Sentence Encoders for Multi-Domain Inference Yixin - PowerPoint PPT Presentation

Shortcut-Stacked Sentence Encoders for Multi-Domain Inference Yixin Nie & Mohit Bansal 1 Task and Motivation Premise Label Hypothesis Genre The Old One always comforted Ca'daan, except neutral Ca'daan knew the Old One very well.

STACKED GRAPHS STACKED GRAPHS EVOLUTION OF STACKED GRAPHS Stacked Area Chart Themeriver

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Variational Auto-Encoders without (too) much math Stphane dAscoli Roadmap 1. A reminder

Shortcuts in Dj Vu X3 vs. Dj Vu X2 Function Dj Vu X3 Shortcut Dj Vu X2 Shortcut

YOUR SHORTCUT TO MASSIVE CREDIBILITY CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1 VIRTUAL

Unit B - Rotary Encoders B.2 Rotary Encoders Electromechanical devices used to measure the

Rotary Encoders 2 Rotary Encoders Electromechanical devices used to measure the angular

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction Parallels and Differences

SENTENCE STRUCTURE ATI TEAS ENGLISH AND LANGUAGE USAGE SENTENCE STRUCTURE Sentence Structure

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Classifyng Objects at Differnts Sizes with Multi-scale Stacked Sequential Learning Eloi Puertas,

Nevada Union: Welcome Incoming 9 th graders! Class of 2024 Shortcut to library-clouds-da.jpg.lnk

Optimal Planning and Shortcut Learning: An Unfulfilled Promise Erez Karpas Carmel Domshlak

A Shortcut Fusion Rule for Circular Program Calculation Joo Fernandes 1 Alberto Pardo 2 Joo

Exploring Measures of Readability for Spoken Language Introduction Analyzing linguistic

Steven C. Campbell, MD, PhD Chair AUA Guidelines Panel Professor Surgery, Vice Chair, Program

Anticoagulation Therapy Your key questions for 2018 clinical practice addressed Supported by an

Corporate Overview LISHAN AKLOG, MD Chairman &amp; CEO February 18, 2020 Nasdaq: PAVM, PAVMZ

SVS AVF Clinical Practice Guidelines Venous Ulcer SVS AVF

GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning

Laser drilling of a Copper Mesh Vincenzo Berardi U.O.S. Bari, Italy Dip. Interuniversitario di

Maggy - Open-Source Asynchronous Distributed Hyperparameter Optimization Based on Apache Spark

Corporate Overview LISHAN AKLOG, MD Chairman & CEO February 18, 2020 Nasdaq: PAVM, PAVMZ