SLIDE 1 复旦大学大数据学院
School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Sequence Labeling
魏忠钰
November 15th, 2017
SLIDE 2
Natural Language Processing Startup
▪ 深度好奇
SLIDE 3
Joint Distributions ▪ A joint distribution over a set of random variables specifies a real number for each assignment (or outcome):
▪ Must obey:
▪ Size of distribution if n variables with domain sizes d?
▪ Impractical to write out! T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3
SLIDE 4
Marginal Distributions
▪ Marginal distributions are sub-tables which eliminate variables ▪ Marginalization (summing out): Combine collapsed rows by adding T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 T P hot 0.5 cold 0.5 W P sun 0.6 rain 0.4
SLIDE 5
Conditional Probabilities ▪ A simple relation between joint and conditional probabilities
▪ In fact, this is taken as the definition of a conditional probability T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 P(b) P(a) P(a,b)
SLIDE 6 Conditional Independence ▪ Unconditional (absolute) independence very rare ▪ Conditional independence is our most basic and robust form of knowledge about uncertain environments. ▪ X is conditionally independent of Y given Z if and only if:
- r, equivalently, if and only if
SLIDE 7
Outline
▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
SLIDE 8
Markov Model ▪ Value of X at a given time is called the state ▪ Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities) ▪ Stationarity assumption: transition probabilities the same at all times X2 X1 X3 X4
SLIDE 9
Joint Distribution of a Markov Model X2 X1 X3 X4 ▪ Joint distribution: ▪ More generally:
SLIDE 10
Example Markov Chain: Weather
▪ States: X = {rain, sun}
rain sun 0.9 0.7 0.3 0.1
Two new ways of representing the same CPT
sun rain sun rain 0.1 0.9 0.7 0.3 Xt-1 Xt P(Xt|Xt-1) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7
▪ Initial distribution: 1.0 sun ▪ CPT P(Xt | Xt-1):
SLIDE 11
Mini-Forward Algorithm
▪ Question: What’s P(X) on some day t?
Forward simulation
X2 X1 X3 X4
SLIDE 12 Example Run of Forward Algorithm
▪ From initial observation of sun ▪ From initial observation of rain ▪ From yet another initial distribution P(X1):
P(X1) P(X2) P(X3) P(X) P(X4) P(X1) P(X2) P(X3) P(X) P(X4) P(X1) P(X)
…
Xt-1 Xt P(Xt|Xt-1) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7
SLIDE 13 Stationary Distributions
▪ Stationary distribution:
▪ The distribution we end up with is called the stationary distribution
▪ It satisfies
▪ For most chains:
▪ Influence of the initial distribution gets less and less over time. ▪ The distribution we end up in is independent of the initial distribution
SLIDE 14
Xt-1 Xt P(Xt|Xt-1) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7
Example: Stationary Distributions
▪ Question: What’s P(X) at time t = infinity?
X2 X1 X3 X4 Also:
SLIDE 15 Stationary Distribution for Weblink analysis ▪ PageRank over a web graph
▪ Each web page is a state ▪ Initial distribution: uniform over pages ▪ Transitions:
▪ With prob. c, uniform jump to a random page (dotted lines, not all shown) ▪ With prob. 1-c, follow a random outlink (solid lines)
▪ Stationary distribution
▪ Will spend more time on highly reachable pages ▪ Somewhat robust to link spam ▪ Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many
- ther factors (rank actually getting less important over time)
SLIDE 16
Text as a Graph
▪ Node stands for sentences ▪ Edge stands for similarity
SLIDE 17
Centrality-based Summarization
▪ Assumption: The centrality of the node is an indication of its importance ▪ Representation: Connectivity Matrix based on intro- sentence cosine similarity ▪ Extraction Mechanism
▪ Compute PageRank score for every sentence u ▪ Extract k sentences with the highest PageRanks score
SLIDE 18
Outline
▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
SLIDE 19
Hidden Markov Model ▪ Hidden Markov models (HMMs)
▪ Underlying Markov chain over states X ▪ You observe outputs (effects) at each time step
X5 X2 E1 X1 X3 X4 E2 E3 E4 E5
SLIDE 20 Example: Weather HMM
Rt Rt+1 P(Rt+1|Rt) +r +r 0.7 +r
0.3
+r 0.3
0.7 Umbrellat-1 Rt Ut P(Ut|Rt) +r +u 0.9 +r
0.1
+u 0.2
0.8 Raint-1
▪ An HMM is defined by:
▪ Initial distribution: ▪ Transitions: ▪ Emissions:
Raint Raint+1 Umbrellat-1 Umbrellat-1
SLIDE 21
Conditional Independence ▪ HMMs have two important independence properties:
▪ Markov hidden process: future depends on past via the present ▪ Current observation independent of all else given current state
▪ Does this mean that evidence variables are guaranteed to be independent?
▪ [No, they tend to correlated by the hidden state]
X5 X2 E1 X1 X3 X4 E2 E3 E4 E5
SLIDE 22
Chain Rule and HMMs ▪ From the chain rule, every joint distribution over can be written as: ▪ Assuming that for all t:
▪ State independent of all past states and all past evidence given the previous state, i.e.: ▪ Evidence is independent of all past states and all past evidence given the current state, i.e.:
So, we have:
SLIDE 23
Tasks for HMM ▪ Filtering
▪ Computing the belief state—the posterior distribution over the most recent state—given all evidence to date.
▪ 𝑸(𝒀𝒖|𝒇𝟐:𝒖) ▪ Prediction
▪ Computing the posterior distribution over the future state, given all evidence to date. ▪ 𝑸(𝒀𝒖+𝒍|𝒇𝟐:𝒖)
▪ Smoothing
▪ Computing the posterior distribution over a past state, given all evidence up to the present. ▪ 𝑸(𝒀𝒍|𝒇𝟐:𝒖)
▪ Most Likely Explanation
▪ Given a sequence of observations, find the sequence of states that is most likely to have generated those observations.
SLIDE 24
Real HMM Examples ▪ Speech recognition HMMs:
▪ Observations are acoustic signals (continuous valued) ▪ States are specific positions in specific words (so, tens of thousands)
▪ Machine translation HMMs:
▪ Observations are words (tens of thousands) ▪ States are translation options
▪ Robot tracking:
▪ Observations are range readings (continuous) ▪ States are positions on a map (continuous)
SLIDE 25
Filtering / Monitoring ▪ Filtering, or monitoring, is the task of tracking the distribution Bt(X) = Pt(Xt | e1, …, et) (the belief state) over time ▪ We start with B1(X) in an initial setting, usually uniform ▪ As time passes, or we get observations, we update B(X) ▪ The Kalman filter was invented in the 60’s and first implemented as a method of trajectory estimation for the Apollo program
SLIDE 26
Inference: Base Cases E1 X1 X2 X1
SLIDE 27
Passage of Time ▪ Assume we have current belief P(X | evidence to date) ▪ Then, after one time step passes: X2 X1 ▪ Or compactly:
SLIDE 28
Observation ▪ Assume we have current belief P(X | previous evidence): ▪ Then, after evidence comes in: ▪ Or, compactly:
SLIDE 29
The Forward Algorithm ▪ We are given evidence at each time and want to know ▪ We can derive the following updates
We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end…
SLIDE 30
Online Belief Updates ▪ Every time step, we start with current P(X | evidence) ▪ We update for time: ▪ We update for evidence: ▪ The forward algorithm does both at once X2 X1 X2 E2
SLIDE 31 In-class Quiz
Rt Rt+1 P(Rt+1|Rt) +r +r 0.7 +r
0.3
+r 0.3
0.7 Rt Ut P(Ut|Rt) +r +u 0.9 +r
0.1
+u 0.2
0.8 Umbrella1 Umbrella2 Rain0 Rain1 Rain2 B(+r) = 0.5 B(-r) = 0.5 B(+r) B(-r) B(+r) B(-r)
SLIDE 32 quiz: Weather HMM
Rt Rt+1 P(Rt+1|Rt) +r +r 0.7 +r
0.3
+r 0.3
0.7 Rt Ut P(Ut|Rt) +r +u 0.9 +r
0.1
+u 0.2
0.8 Umbrella1 Umbrella2 Rain0 Rain1 Rain2 B(+r) = 0.5 B(-r) = 0.5 B’(+r) = 0.5 B’(-r) = 0.5 B(+r) = 0.818 B(-r) = 0.182 B’(+r) = 0.627 B’(-r) = 0.373 B(+r) = 0.883 B(-r) = 0.117
SLIDE 33
Most Likely Explanation
SLIDE 34
HMMs: MLE Queries
▪ HMMs defined by
▪ States X ▪ Observations E ▪ Initial distribution: ▪ Transitions: ▪ Emissions:
▪ New query: most likely explanation:
X2 E1 X1 X3 X4 E2 E3 E4
SLIDE 35
HMMs: MLE Queries ▪ Graph of states and transitions over time ▪ Each arc represents some transition ▪ Each arc has weight ▪ Each path is a sequence of states ▪ The product of weights on a path is that sequence’s probability along with the evidence ▪ Forward algorithm computes sums of paths, Viterbi computes best paths
sun rain sun rain sun rain sun rain
SLIDE 36
HMMs: MLE Queries
sun rain sun rain sun rain sun rain
Forward Algorithm (Sum) Viterbi Algorithm (Max)
SLIDE 37
Outline
▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
SLIDE 38 Sequence problems
▪ Many problems in NLP have data which is a sequence of characters, words, phrases, lines, or sentences … ▪ We can think of our task as one of labeling each item
VBG NN IN DT NN IN NN
Chasing
in an age
upheaval
POS tagging B B I I B I B I B B 而 相 对 于 这 些 品 牌 的 价 Word segmentation PERS O O O ORG ORG
Murdoch discusses future
News Corp.
Named entity recognition
Text segmen- tation
Q A Q A A A Q A
SLIDE 39
HMM for sequence labeling
▪How likely a sequence of observation is generated?
▪Forward algorithm
▪Given a sequence of observation, what is the most likely hidden states sequence?
▪Viterbi algorithm
SLIDE 40 Trigram Hidden Markov Models (Trigram HMMs)
▪ A trigram HMM consists of a finite set V of possible words, and a finite set K of possible tags.
▪ Transitions: for any trigram tags (u, v, s), we have the probability of seeing the tag s immediately after the bigram of tags (u, v). ▪ Emissions: for any word x tag s, we have the probability of seeing
- bservation word x paired with state s.
q(𝑡|𝑣, 𝑤) e(𝑦|𝑡)
SLIDE 41
Trigram Hidden Markov Models (Trigram HMMs)
▪ Define S to be the set of all sequence/tag-sequence pairs
< 𝑦1 … 𝑦𝑜, 𝑧1, 𝑧2 … 𝑧𝑜+1 >, in which 𝑧𝑜+1 is the STOP.
▪ The probability of seeing the sequence S can thus be interpreted as: ▪ And we have 𝑧−1 = 𝑧0 =∗. * is the start state.
SLIDE 42
Trigram Hidden Markov Models (Trigram HMMs) - Example
IO encoding Fred PER showed O Sue PER Mengqiu PER Huang PER ‘s O STOP
q 𝑄𝐹𝑆 ∗,∗ × 𝑟 O ∗, PER × q PER PER, O × q PER PER, PER × 𝑟(𝑃|𝑄𝐹𝑆, 𝑄𝐹𝑆) ×q(STOP|PER,O) × e Fred PER × 𝑓 𝑡ℎ𝑝𝑥𝑓𝑒 𝑃 × e Sue PER × 𝑓 𝑁𝑓𝑜𝑟𝑗𝑣 𝑄𝐹𝑆 × e Huang PER × 𝑓 ′𝑡 𝑃
SLIDE 43
Sequence Generation Process
SLIDE 44
In-class Quiz
Fred shouts * PER O STOP Tag Word E(word|tag) PER Fred 0.91 PER shouts 0.09 O Fred 0.12 O Shouts 0.88 tag tag q(tag|tag) PER PER 0.71 PER O 0.09 O O 0.12 O PER 0.88 * PER 0.5 * O 0.5 PER STOP 0.2 O STOP 0.4 𝑞 𝐺𝑠𝑓𝑒, 𝑡ℎ𝑝𝑣𝑢𝑡, 𝑄𝐹𝑆, 𝑃, 𝑇𝑈𝑃𝑄 ?
SLIDE 45
Parameter Estimation
𝑟 𝑡 𝑣, 𝑤 = 𝑑(𝑣, 𝑤, 𝑡) 𝑑(𝑣, 𝑤) e 𝑦 𝑡 =
𝑑(𝑦,𝑡) 𝑑(𝑡)
𝑟 𝑡 𝑣, 𝑤 = 𝜇1 × 𝑟𝑁𝑀 𝑡 𝑣, 𝑤 + 𝜇2 × 𝑟𝑁𝑀 𝑡 𝑤 + 𝜇3 × 𝑟𝑁𝑀(𝑡)
▪ ▪ For back-off smoothing
𝜇1 + 𝜇2 + 𝜇3 = 1
SLIDE 46
Decoding with HMMs
▪ HMMs defined by
▪ States X ▪ Observations E ▪ Initial distribution: ▪ Transitions: ▪ Emissions:
▪ New query: most likely explanation: ▪ New method: the Viterbi algorithm
X2 E1 X1 X3 X4 E2 E3 E4
SLIDE 47
Decoding for Trigram HMMs
▪ Finding the most likely tag sequence for an input sentence 𝑦1, 𝑦2, … 𝑦𝑜. ▪ In which, ▪ for “Fred showed Sue Menghui Huang’s new painting” ▪ If we have 2 tags, the number of sequence is ?
SLIDE 48 Greedy Decoding
▪ Greedy inference:
▪ We just start at the left, and use our classifier at each position to assign a label ▪ The classifier can depend on previous labeling decisions as well as observed data
▪ Advantages:
▪ Fast, no extra memory requirements ▪ Very easy to implement ▪ With rich features including observations to the right, it may perform quite well
▪ Disadvantage:
▪ Greedy. We make commit errors we cannot recover from
𝑞 𝑦1 … 𝑦𝑜, 𝑧1, … 𝑧𝑜+1 = ෑ
𝑗=1 𝑜
𝑓(𝑦𝑗|𝑧𝑗)
SLIDE 49
Probability of generating sub-sequence
▪ Suppose S(k, u, v) is the set of all tag sequences of length k, which end in the tag bigram (u, v). We have:
SLIDE 50
Beam Inference ▪ Beam inference:
▪ At each position keep the top m complete sequences. ▪ Extend each sequence in each local way. ▪ The extensions compete for the m slots at the next position.
▪ Advantages:
▪ Fast; beam sizes of 3–5 are almost as good as exact inference in many cases. ▪ Easy to implement
▪ Disadvantage:
▪ Inexact: the globally best sequence can fall off the beam.
SLIDE 51
Forward Algorithm for probability computation
▪ For any position k in the sequence, we can use the following recursive definition: ▪ is the allowable tags at position k in the sequence.
SLIDE 52
Viterbi for HMM decoding to compute the highest probability
SLIDE 53
Viterbi for HMM decoding to choose the best sequence
SLIDE 54
Viterbi Inference ▪ Viterbi inference:
▪ Dynamic programming. ▪ Requires small window of state influence (e.g., past two states are relevant).
▪ Advantage:
▪ Exact: the global best sequence is returned.
▪ Disadvantage:
▪ Harder to implement long-distance state-state interactions.
SLIDE 55
Outline
▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
SLIDE 56
Conditional Markov Models
▪ Hidden Markov Model: Generative model p(E,X) ▪ Conditional Markov Model: P(X|E) ▪ Maximum Entropy Markov Model
X2 E1 X1 X3 X4 E2 E3 E4 X2 E1 X1 X3 X4 E2 E3 E4
SLIDE 57
HMM VS CMM (MEMM)
▪ Hidden Markov Model: Generative model p(E,X) ▪ Conditional Markov Model: P(X|E)
SLIDE 58 MEMM inference in systems
▪ For a Conditional Markov Model (CMM) a.k.a. a Maximum Entropy Markov Model (MEMM), the classifier makes a single decision at a time, conditioned on evidence from
- bservations and previous decisions
- 3
- 2
- 1
+1 DT NNP VBD ??? ??? The Dow fell 22.6 %
Local Context Features
W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Decision Point
SLIDE 59 Example: POS Tagging ▪ Scoring individual labeling decisions is no more complex than standard classification decisions ▪ We have some assumed labels to use for prior positions ▪ We use features of those and the observed data (which can include current, previous, and next words) to predict the current label
+1 DT NNP VBD ??? ??? The Dow fell 22.6 %
Local Context Features
W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …
Decision Point
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
SLIDE 60 Example: POS Tagging
- POS tagging Features can include:
- Current, previous, next words in isolation or together.
- Previous one, two, three tags.
- Word-internal features: word types, suffixes, dashes, etc.
- 3
- 2
- 1
+1 DT NNP VBD ??? ??? The Dow fell 22.6 %
Local Context Features
W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Decision Point
SLIDE 61 Inference in Systems Sequence Level Local Level
Local Data
Feature Extraction Features Label
Optimization Smoothing Classifier Type
Features Label
Sequence Data Maximum Entropy Models Quadratic Penalties Conjugate Gradient Sequence Model Inference Local Data Local Data
SLIDE 62
Features for sequence labeling
▪ Words
▪ Current word (essentially like a learned dictionary) ▪ Previous/next word (context)
▪ Other kinds of inferred linguistic classification
▪ Part-of-speech tags
▪ Label context
▪ Previous (and perhaps next) label
SLIDE 63 Features: Word substrings
241 drug company movie place person
Cotrimoxazole Wethersfield Alien Fury: Countdown to Invasion
18
708 6
:
8 6 68 14
field
SLIDE 64
Features: Word shapes
▪ Word Shapes
▪ Map words to simplified representation that encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc.
Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd