Sequence Models Spring 2020 Adapted from slides from Danqi Chen and - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton COS 484)

Overview • Hidden markov models (HMM) • Viterbi algorithm • Maximum entropy markov models (MEMM)

Sequence Tagging

What are POS tags • Word classes or syntactic categories • Reveal useful information about a word (and its neighbors!) The/DT cat/NN sat/VBD on/IN the/DT mat/NN British/NNP left/NN waffles/NNS on/IN Falkland/NNP Islands/NNP The/DT old/NN man/VB the/DT boat/NN

Parts of Speech • Di ff erent words have di ff erent functions • Closed class: fixed membership, function words • e.g. prepositions ( in, on, of ), determiners ( the, a ) • Open class: New words get added frequently • e.g. nouns (Twitter, Facebook), verbs (google), adjectives, adverbs

Penn Tree Bank tagset [45 tags] (Marcus et al., 1993) Other corpora: Brown, WSJ, Switchboard

Part of Speech Tagging • Disambiguation task: each word might have di ff erent senses/functions • The/DT man/NN bought/VBD a/DT boat/NN • The/DT old/NN man/VB the/DT boat/NN

Part of Speech Tagging • Disambiguation task: each word might have di ff erent senses/functions • The/DT man/NN bought/VBD a/DT boat/NN • The/DT old/NN man/VB the/DT boat/NN Some words have many functions!

A simple baseline • Many words might be easy to disambiguate • Most frequent class: Assign each token (word) to the class it occurred most in the training set. (e.g. man/NN) • Accurately tags 92.34% of word tokens on Wall Street Journal (WSJ)! • State of the art ~ 97% • Average English sentence ~ 14 words • Sentence level accuracies: 0.92 14 = 31% vs 0.97 14 = 65% • POS tagging not solved yet!

Hidden Markov Models

Some observations • The function (or POS) of a word depends on its context • The/DT old/NN man/VB the/DT boat/NN • The/DT old/JJ man/NN bought/VBD the/DT boat/NN • Certain POS combinations are extremely unlikely • <JJ, DT> or <DT, IN> • Better to make decisions on entire sequences instead of individual words (Sequence modeling!)

Markov chains � s 4 � s 1 � s 2 � s 3 • Model probabilities of sequences of variables • Each state can take one of K values ({1, 2, ..., K} for simplicity) • Markov assumption: � P ( s t | s < t ) ≈ P ( s t | s t − 1 ) Where have we seen this before?

Markov chains � s 4 � s 1 � s 2 � s 3 The/?? cat/?? sat/?? on/?? the/?? mat/?? • We don’t observe POS tags at test time

Hidden Markov Model (HMM) hidden Tags � s 4 � s 1 � s 2 � s 3 Words observed on the cat sat The/?? cat/?? sat/?? on/?? the/?? mat/?? • We don’t observe POS tags at test time • But we do observe the words! • HMM allows us to jointly reason over both hidden and observed events.

Components of an HMM Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4 1. Set of states S = {1, 2, ..., K} and observations O 2. Initial state probability distribution � π ( s 1 ) 3. Transition probabilities � P ( s t +1 | s t ) 4. Emission probabilities � P ( o t | s t )

Assumptions Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4 1. Markov assumption: � P ( s t +1 | s 1 , . . . , s t ) = P ( s t +1 | s t ) 2. Output independence: � P ( o t | s 1 , . . . , s t ) = P ( o t | s t ) Which is a stronger assumption?

Sequence likelihood Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4

Learning • Maximum likelihood estimate: C ( s i , s j ) • � P ( s i , s j ) = C ( s j ) P ( o | s ) = C ( s , o ) • � C ( s )

� � Learning Maximum likelihood estimate: C ( s j , s i ) P ( s i | s j ) = C ( s j ) P ( o | s ) = C ( s , o ) C ( s )

Example: POS tagging the/?? cat/?? sat/?? on/?? the/?? mat/?? s t +1 o t π ( DT ) = 0.8 DT NN IN VBD the cat sat on mat DT 0.5 0.8 0.05 0.1 DT 0.5 0 0 0 0 s t NN 0.05 0.2 0.15 0.6 NN 0.01 0.2 0.01 0.01 0.2 IN 0.5 0.2 0.05 0.25 IN 0 0 0 0.4 0 VBD 0.3 0.3 0.3 0.1 VBD 0 0.01 0.1 0.01 0.01

Example: POS tagging the/?? cat/?? sat/?? on/?? the/?? mat/?? s t +1 o t π ( DT ) = 0.8 DT NN IN VBD the cat sat on mat DT 0.5 0.8 0.05 0.1 DT 0.5 0 0 0 0 NN 0.05 0.2 0.15 0.6 NN 0.01 0.2 0.01 0.01 0.2 s t IN 0.5 0.2 0.05 0.25 IN 0 0 0 0.4 0 VBD 0.3 0.3 0.3 0.1 VBD 0 0.01 0.1 0.01 0.01 1.84 * 10 − 5

Decoding with HMMs ? ? ? ? � o 1 � o 2 � o 3 � o 4 • Task: Find the most probable sequence of states � given the ⟨ s 1 , s 2 , . . . , s n ⟩ observations � ⟨ o 1 , o 2 , . . . , o n ⟩

Decoding with HMMs ? ? ? ? � o 4 � o 1 � o 2 � o 3 • Task: Find the most probable sequence of states � given the ⟨ s 1 , s 2 , . . . , s n ⟩ observations � ⟨ o 1 , o 2 , . . . , o n ⟩

Greedy decoding ? ? ? DT The � o 2 � o 3 � o 4

Greedy decoding NN ? ? DT The cat � o 3 � o 4

Greedy decoding IN NN VBD DT on The cat sat • Not guaranteed to be optimal! • Local decisions

Viterbi decoding • Use dynamic programming! • Probability lattice, � M [ T , K ] • � T : Number of time steps • � K : Number of states • � Most probable sequence of states ending with M [ i , j ] : state j at time i

Viterbi decoding M [1, DT ] = π ( DT ) P ( the | DT ) DT M [1, NN ] = π ( NN ) P ( the | NN ) NN M [1, VBD ] = π ( VBD ) P ( the | VBD ) VBD M [1, IN ] = π ( IN ) P ( the | IN ) IN the Forward

Viterbi decoding DT DT DT DT NN NN NN NN VBD VBD VBD VBD IN IN IN IN The cat sat on M [ i − 1, k ] P ( s j | s k ) P ( o i | s j ) 1 ≤ k ≤ K 1 ≤ i ≤ n M [ i , j ] = max k Backward: Pick max M [ n , k ] and backtrack k

Viterbi decoding DT DT DT DT NN NN NN NN Time VBD VBD VBD VBD complexity? IN IN IN IN The cat sat on M [ i − 1, k ] P ( s j | s k ) P ( o i | s j ) 1 ≤ k ≤ K 1 ≤ i ≤ n M [ i , j ] = max k Backward: Pick max M [ n , k ] and backtrack k

Beam Search • If K (number of states) is too large, Viterbi is too expensive! DT DT DT DT NN NN NN NN VBD VBD VBD VBD IN IN IN IN The cat sat on

Beam Search • If K (number of states) is too large, Viterbi is too expensive! DT DT DT DT NN NN NN NN VBD VBD VBD VBD IN IN IN IN on The cat sat Many paths have very low likelihood!

Beam Search • If K (number of states) is too large, Viterbi is too expensive! • Keep a fixed number of hypotheses at each point • Beam width, � β

Beam Search • Keep a fixed number of hypotheses at each point score = − 4.1 DT score = − 9.8 NN β = 2 score = − 6.7 VBD score = − 10.1 IN The

Beam Search • Keep a fixed number of hypotheses at each point DT DT score = − 16.5 score = − 6.5 NN NN β = 2 VBD VBD score = − 13.0 IN IN score = − 22.1 The cat Step 1: Expand all partial sequences in current beam

Beam Search • Keep a fixed number of hypotheses at each point DT DT score = − 16.5 score = − 6.5 NN NN β = 2 VBD VBD score = − 13.0 IN IN score = − 22.1 The cat Step 2: Prune set back to top � sequences β

Beam Search • Keep a fixed number of hypotheses at each point DT DT DT DT NN NN NN NN β = 2 VBD VBD VBD VBD IN IN IN IN sat on The cat Pick max M [ n , k ] from within beam and backtrack k

Beam Search • If K (number of states) is too large, Viterbi is too expensive! • Keep a fixed number of hypotheses at each point • Beam width, � β • Trade-o ff computation for (some) accuracy Time complexity?

Beyond bigrams • Real-world HMM taggers have more relaxed assumptions • Trigram HMM: � P ( s t +1 | s 1 , s 2 , . . . , s t ) ≈ P ( s t +1 | s t − 1 , s t ) IN NN VBD DT on The cat sat Pros? Cons?

Maximum Entropy Markov Models

Generative vs Discriminative • HMM is a generative model • Can we model � directly? P ( s 1 , . . . , s n | o 1 , . . . , o n ) Generative Discriminative Naive Bayes: Logistic Regression: � P ( c ) P ( d | c ) � P ( c | d ) MEMM: HMM: � P ( s 1 , . . . , s n | o 1 , . . . , o n ) � P ( s 1 , . . . , s n ) P ( o 1 , . . . , o n | s 1 , . . . , s n )

� ̂ MEMM IN IN DT NN VB DT NN VB on on The cat sat The cat sat HMM MEMM • Compute the posterior directly: ∏ S = arg max P ( S | O ) = arg max P ( s i | o i , s i − 1 ) • S S i • Use features: � P ( s i | o i , s i − 1 ) ∝ exp( w ⋅ f ( s i , o i , s i − 1 ))

Sequence Models Spring 2020 Adapted from slides from Danqi Chen and - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton COS 484) Overview Hidden markov models (HMM) Viterbi algorithm Maximum entropy

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Machine Translation and Sequence-to-sequence Models http://phontron.com/class/mtandseq2seq2018/

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

CSE 527 Lectures 11-12 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

What Is the Output of What Is the Output of Visual Data Analysis? Visual Data Analysis? Gennady

C++ for Embedded development C++ for Embedded development Thiago Macieira Thiago Macieira

Introduction to version 10.2 and responses to IF requests Makoto Asai SLAC SD/EPP Version 10.2

I n f o r m a t i o n T r a n s m i s s i o n C h a p t e r 5 , C

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Hidden Markov Models CMSC 473/673 UMBC Recap from last time Expectation Maximization

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

Sequence Models Spring 2020 Adapted from slides from Danqi Chen and - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton COS 484) Overview Hidden markov models (HMM) Viterbi algorithm Maximum entropy

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Machine Translation and Sequence-to-sequence Models http://phontron.com/class/mtandseq2seq2018/

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

CSE 527 Lectures 11-12 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

What Is the Output of What Is the Output of Visual Data Analysis? Visual Data Analysis? Gennady

C++ for Embedded development C++ for Embedded development Thiago Macieira Thiago Macieira

Introduction to version 10.2 and responses to IF requests Makoto Asai SLAC SD/EPP Version 10.2

I n f o r m a t i o n T r a n s m i s s i o n C h a p t e r 5 , C

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Hidden Markov Models CMSC 473/673 UMBC Recap from last time Expectation Maximization

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or