Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/

An Example Structured Prediction Problem: Sequence Labeling

Sequence Labeling • One tag for one word • e.g. Part of speech tagging hate I movie this PRP VBP DT NN • e.g. Named entity recognition The movie featured Keanu Reeves O O O B-PER I-PER

Sequence Labeling as Independent Classification I hate this movie <s> <s> classifier classifier classifier classifier PRP VBP DT NN

Locally Normalized Models I this movie <s> <s> hate classifier classifier classifier classifier PRP VBP DT NN

Summary • Independent classification models • Strong independent assumption 𝑀 𝑄 𝑍 𝑌 = 𝑄(𝑧 𝑗 |𝑌) 𝑗=1 • No guarantee of valid (consistent) structured outputs • BIO tagging scheme in NER • Locally normalized models (e.g. history-based RNN, seq2seq) • Prior order 𝑀 𝑄 𝑍 𝑌 = 𝑄(𝑧 𝑗 |𝑌, 𝑧 <𝑗 ) 𝑗=1 • Approximating decoding • Greedy search • Beam search • Label bias

Globally normalized models? • Not too strong independent assumption (local dependencies) • Optimal decoding

Globally normalized models? • Not too strong independent assumption (local dependencies) • Optimal decoding Conditional Random Fields (CRFs)

Globally Normalized Models • Each output sequence has a score, which is not normalized over a particular decision exp(𝑇 𝑍, 𝑌 ) 𝜔(𝑍, 𝑌) 𝑄 𝑍 𝑌 = 𝑍′ exp(𝑇 𝑍 ′ , 𝑌 ) = 𝑍′ 𝜔(𝑍 ′ , 𝑌) where 𝜔(𝑍, 𝑌) are potential functions.

Conditional Random Fields General form of globally normalized model First-order linear CRF y n- y 1 y 2 y 3 y n y 1 y 2 y 3 y n 1 x x 𝑀 𝜔(𝑍, 𝑌) 𝑗=1 𝜔 𝑗 (𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌) 𝑄 𝑍 𝑌 = 𝑄 𝑍 𝑌 = 𝑍′ 𝜔(𝑍 ′ , 𝑌) 𝑀 𝑍′ 𝑗=1 𝜔 𝑗 (𝑧′ 𝑗−1 , 𝑧′ 𝑗 , 𝑌)

Potential Functions • 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 = exp 𝑋 𝑈 𝑈 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌, 𝑗 +𝑉 𝑈 𝑇 𝑧 𝑗 , 𝑌, 𝑗 + 𝑐 𝑧 𝑗−1 ,𝑧 𝑗 • Using neural features in DNN: 𝑈 𝐺 𝑌, 𝑗 + 𝑐 𝑧 𝑗−1 ,𝑧 𝑗 𝑈 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 = exp 𝑋 𝐺 𝑌, 𝑗 +𝑉 𝑧 𝑗 𝑧 𝑗−1 ,𝑧 𝑗 • Number of parameters: 𝑃( 𝑍 2 𝑒 𝐺 ) • Simpler version: 𝑈 𝐺 𝑌, 𝑗 + 𝑐 𝑧 𝑗−1 ,𝑧 𝑗 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 = exp 𝑋 𝑧 𝑗−1 ,𝑧 𝑗 + 𝑉 𝑧 𝑗 • Number of parameters: 𝑃( 𝑍 2 + |𝑍|𝑒 𝐺 )

BiLSTM-CRF for Sequence Labeling <s> I hate this movie <s> PRP VBP DT NN

Training &Decoding of CRF Viterbi Algorithm

CRF Training & Decoding 𝑀 𝑀 𝑗=1 𝑗=1 𝜔 𝑗 (𝑧 𝑗−1 ,𝑧 𝑗 ,𝑌) 𝜔 𝑗 (𝑧 𝑗−1 ,𝑧 𝑗 ,𝑌) • 𝑄 𝑍 𝑌 = 𝜔 𝑗 (𝑧′ 𝑗−1 ,𝑧′ 𝑗 ,𝑌) = 𝑀 𝑍′ 𝑗=1 𝑎(𝑌) • Training: computing the partition function Z(X) 𝑀 𝑎 𝑌 = 𝜔 𝑗 (𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌) 𝑍 𝑗=1 • Decoding 𝑧 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑍 𝑄(𝑍|𝑌) Go through the output space of Y which grows exponentially with the length of the input sequence.

Interactions 𝑀 𝑎 𝑌 = 𝜔 𝑗 (𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌) 𝑍 𝑗=1 • Each label depends on the input, and the nearby labels • But given adjacent labels, others do not matter • If we knew the score of every sequence 𝑧 1 , … , 𝑧 𝑜−1 , we could compute easily the score of sequence 𝑧 1 , … , 𝑧 𝑜−1 , 𝑧 𝑜 • So we really only need to know the score of all the sequences ending in each 𝑧 𝑜−1 • Think of that as some “ precalculation ” that happens before we think about 𝑧 𝑜

Viterbi Algorithm • 𝜌 𝑢 (𝑧 |X) is the partition of sequence with length equal to 𝑢 and end with label 𝑧 : 𝑢−1 𝜌 𝑢 𝑧 𝑌 = 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 𝜔 𝑢 (𝑧 𝑢−1 , 𝑧 𝑢 = 𝑧, 𝑌) 𝑧 𝑗 ,…,𝑧 𝑢−1 𝑗=1 𝑢−2 = 𝜔 𝑢 (𝑧 𝑢−1 , 𝑧 𝑢 = 𝑧, 𝑌) 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 𝜔 𝑢−1 (𝑧 𝑢−2 , 𝑧 𝑢−1 , 𝑌) 𝑧 𝑢−1 𝑧 𝑗 ,…,𝑧 𝑢−2 𝑗=1 = 𝜔 𝑢 (𝑧 𝑢−1 , 𝑧 𝑢 = 𝑧, 𝑌)𝜌 𝑢−1 𝑧 𝑢−1 𝑌 𝑧 𝑢−1 • Computing partition function 𝑎 𝑌 = 𝑧 𝜌 𝑀 (𝑧|𝑌)

Forward Step: Final Part  Finish up the sentence with the sentence final symbol science score [“ I+1 </S>”] = log_sum_exp( I :NN I+1 :</S> score [“ I NN”] + T(</S>|NN), score [“ I JJ”] + T(</S>|JJ), I :JJ score [“ I VB”] + T(</S>|VB), score [“ I LRB”] + T(</S>|LRB), I :VB score [“ I NN”] + T(</S>|RRB), ... ) I :LRB I :RRB …

Viterbi Algorithm • Decoding is performed with similar dynamic programming algorithm • Calculating gradient: 𝑚 𝑁𝑀 𝑌, 𝑍; 𝜄 = − log 𝑄(𝑍|𝑌; 𝜄) 𝜖𝑚 𝑁𝑀 (𝑌, 𝑍; 𝜄) = 𝐺 𝑍, 𝑌 − 𝐹 𝑄 𝑍 𝑌; 𝜄 𝐺(𝑍, 𝑌) 𝜖𝜄 • Forward-backward algorithm (Sutton and McCallum, 2010) • Both 𝑄 𝑍 𝑌; 𝜄 and 𝐺(𝑍, 𝑌) can be decomposed • Need to compute the marginal distribution: 𝑄 𝑧 𝑗−1 = 𝑧 ′ , 𝑧 𝑗 = 𝑧 𝑌; 𝜄 = 𝛽 𝑗−1 𝑧 ′ 𝑌 𝜔 𝑗 𝑧 ′ , 𝑧, 𝑌 𝛾 𝑗 (𝑧|𝑌) 𝑎(𝑌) • Not necessary if using DNN framework (auto-grad)

Case Study BiLSTM-CNN-CRF for Sequence Labeling

Case Study: BiLSTM-CNN-CRF for Sequence Labeling (Ma et al, 2016) • Goal: Build a truly end-to-end neural model for sequence labeling task, requiring no feature engineering and data pre-processing. • Two levels of representations • Character-level representation: CNN • Word-level representation: Bi-directional LSTM

CNN for Character-level representation • We used CNN to extract morphological information such as prefix or suffix of a word

Bi-LSTM-CNN-CRF • We used Bi-LSTM to model word-level information. • CRF is on top of Bi-LSTM to consider the co-relation between labels.

Training Details • Optimization Algorithm: • SGD with momentum (0.9) • Learning rate decays with rate 0.05 after each epoch. • Dropout Training: • Applying dropout to regularize the model with fixed dropout rate 0.5 • Parameter Initialization: • Parameters: Glorot and Bengio (2010) • Word Embedding: Stanford’s GloVe 100-dimentional embeddings 3 3 • Character Embedding: uniformly sampled from [− 𝑒𝑗𝑛 , + 𝑒𝑗𝑛 ] , where 𝑒𝑗𝑛 = 30

Experiments POS NER Dev Test Dev Test Acc. Acc. Prec. Recall F1 Prec. Recall F1 Model BRNN 96.56 96.76 92.04 89.13 90.56 87.05 83.88 85.44 BLSTM 96.88 96.93 92.31 90.85 91.57 87.77 86.23 87.00 BLSTM-CNN 97.34 97.33 92.52 93.64 93.07 88.53 90.21 89.36 BLSTM-CNN-CRF 97.46 97.55 94.85 94.63 94.74 91.35 91.06 91.21

Considering Rewards during Training

Reward Functions in Structured Prediction • POS tagging: token-level accuracy • NER: F1 score • Dependency parsing: labeled attachment score • Machine translation: corpus-level BLEU Do different reward functions impact our decisions?

• Data1: 𝑌, 𝑍 ∼ 𝑄 • Task1: predict 𝑍 given 𝑌 i.e. ℎ 1 (𝑌) • Reward1: 𝑆 1 (ℎ 1 𝑌 , 𝑍) • Data2: 𝑌, 𝑍 ∼ 𝑄 • Task2: predict 𝑍 given 𝑌 i.e. ℎ 2 (𝑌) • Reward2: 𝑆 2 (ℎ 2 𝑌 , 𝑍) ℎ 1 𝑌 = ℎ 2 𝑌 ?

Predictor $0? $0 $1M? $1M Reward is the amount of money we get

Predictor $0 $0 $1M $1M

Predictor $0 $0 $1M $1B Reward is the amount of money we get

Predictor $0 $0 $1M $1B

Considering Rewards during Training • Max-Margin (Taskar et al., 2004) • Similar to cost-augmented hinge loss (last class) • Do not rely on a probabilistic model (only decoding algorithm is required) • Minimum Risk Training (Shen et al., 2016) • Reward-augmented Maximum Likelihood (Norouzi et al., 2016)

Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence Labeling One tag for one word

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

CLOUD CONNECTED VEHICLE BASED ON OPEN SOURCE SOFTWARE Alex Agizim, EPAM May, 2017 1 2017 NEWS

Financial Econometrics Econ 40357 ARIMA (Auto Regressive Integrated Moving Average) Models Part

Drupal Core Auto-Update Drupal Core Auto-Update Architecture Architecture Peter Wolanin Peter

You Are How You Drive: Peer and Temporal-Aware Representation Learning for Driving Behavior

Natural Language Processing Lecture 13: More on CFG Parsing Probabilistjc/Weighted Parsing

On Learning Parametric Dependencies from Monitoring Data Johannes Grohmann, Simon Eismann, Samuel

Covering the Basics of QRTP in Dependency Court Judge Christine Schaller, Thurston County

Advanced Use of Eclipse 4s Dependency Injection Framework Brian de Alwis Manumitting