[PPT] - November 15 th , 2017 Natural Language Processing Startup Joint PowerPoint Presentation

SLIDE 1

复旦大学大数据学院

School of Data Science, Fudan University

DATA130006 Text Management and Analysis

Sequence Labeling

魏忠钰

November 15th, 2017

SLIDE 2

Natural Language Processing Startup

▪ 深度好奇

SLIDE 3

Joint Distributions ▪ A joint distribution over a set of random variables specifies a real number for each assignment (or outcome):

▪ Must obey:

▪ Size of distribution if n variables with domain sizes d?

▪ Impractical to write out! T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3

SLIDE 4

Marginal Distributions

▪ Marginal distributions are sub-tables which eliminate variables ▪ Marginalization (summing out): Combine collapsed rows by adding T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 T P hot 0.5 cold 0.5 W P sun 0.6 rain 0.4

SLIDE 5

Conditional Probabilities ▪ A simple relation between joint and conditional probabilities

▪ In fact, this is taken as the definition of a conditional probability T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 P(b) P(a) P(a,b)

SLIDE 6

Conditional Independence ▪ Unconditional (absolute) independence very rare ▪ Conditional independence is our most basic and robust form of knowledge about uncertain environments. ▪ X is conditionally independent of Y given Z if and only if:

r, equivalently, if and only if

SLIDE 7

Outline

▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

SLIDE 8

Markov Model ▪ Value of X at a given time is called the state ▪ Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities) ▪ Stationarity assumption: transition probabilities the same at all times X2 X1 X3 X4

SLIDE 9

Joint Distribution of a Markov Model X2 X1 X3 X4 ▪ Joint distribution: ▪ More generally:

SLIDE 10

Example Markov Chain: Weather

▪ States: X = {rain, sun}

rain sun 0.9 0.7 0.3 0.1

Two new ways of representing the same CPT

sun rain sun rain 0.1 0.9 0.7 0.3 Xt-1 Xt P(Xt|Xt-1) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7

▪ Initial distribution: 1.0 sun ▪ CPT P(Xt | Xt-1):

SLIDE 11

Mini-Forward Algorithm

▪ Question: What’s P(X) on some day t?

Forward simulation

X2 X1 X3 X4

SLIDE 12

Example Run of Forward Algorithm

▪ From initial observation of sun ▪ From initial observation of rain ▪ From yet another initial distribution P(X1):

P(X1) P(X2) P(X3) P(X) P(X4) P(X1) P(X2) P(X3) P(X) P(X4) P(X1) P(X)

…

Xt-1 Xt P(Xt|Xt-1) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7

SLIDE 13

Stationary Distributions

▪ Stationary distribution:

▪ The distribution we end up with is called the stationary distribution

f the chain

▪ It satisfies

▪ For most chains:

▪ Influence of the initial distribution gets less and less over time. ▪ The distribution we end up in is independent of the initial distribution

SLIDE 14

Xt-1 Xt P(Xt|Xt-1) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7

Example: Stationary Distributions

▪ Question: What’s P(X) at time t = infinity?

X2 X1 X3 X4 Also:

SLIDE 15

Stationary Distribution for Weblink analysis ▪ PageRank over a web graph

▪ Each web page is a state ▪ Initial distribution: uniform over pages ▪ Transitions:

▪ With prob. c, uniform jump to a random page (dotted lines, not all shown) ▪ With prob. 1-c, follow a random outlink (solid lines)

▪ Stationary distribution

▪ Will spend more time on highly reachable pages ▪ Somewhat robust to link spam ▪ Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many

ther factors (rank actually getting less important over time)

SLIDE 16

Text as a Graph

▪ Node stands for sentences ▪ Edge stands for similarity

SLIDE 17

Centrality-based Summarization

▪ Assumption: The centrality of the node is an indication of its importance ▪ Representation: Connectivity Matrix based on intro- sentence cosine similarity ▪ Extraction Mechanism

▪ Compute PageRank score for every sentence u ▪ Extract k sentences with the highest PageRanks score

SLIDE 18

Outline

▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

SLIDE 19

Hidden Markov Model ▪ Hidden Markov models (HMMs)

▪ Underlying Markov chain over states X ▪ You observe outputs (effects) at each time step

X5 X2 E1 X1 X3 X4 E2 E3 E4 E5

SLIDE 20

Example: Weather HMM

Rt Rt+1 P(Rt+1|Rt) +r +r 0.7 +r

r

0.3

r

+r 0.3

r
r

0.7 Umbrellat-1 Rt Ut P(Ut|Rt) +r +u 0.9 +r

u

0.1

r

+u 0.2

r
u

0.8 Raint-1

▪ An HMM is defined by:

▪ Initial distribution: ▪ Transitions: ▪ Emissions:

Raint Raint+1 Umbrellat-1 Umbrellat-1

SLIDE 21

Conditional Independence ▪ HMMs have two important independence properties:

▪ Markov hidden process: future depends on past via the present ▪ Current observation independent of all else given current state

▪ Does this mean that evidence variables are guaranteed to be independent?

▪ [No, they tend to correlated by the hidden state]

X5 X2 E1 X1 X3 X4 E2 E3 E4 E5

SLIDE 22

Chain Rule and HMMs ▪ From the chain rule, every joint distribution over can be written as: ▪ Assuming that for all t:

▪ State independent of all past states and all past evidence given the previous state, i.e.: ▪ Evidence is independent of all past states and all past evidence given the current state, i.e.:

So, we have:

SLIDE 23

Tasks for HMM ▪ Filtering

▪ Computing the belief state—the posterior distribution over the most recent state—given all evidence to date.

▪ 𝑸（𝒀𝒖|𝒇𝟐:𝒖） ▪ Prediction

▪ Computing the posterior distribution over the future state, given all evidence to date. ▪ 𝑸（𝒀𝒖+𝒍|𝒇𝟐:𝒖）

▪ Smoothing

▪ Computing the posterior distribution over a past state, given all evidence up to the present. ▪ 𝑸（𝒀𝒍|𝒇𝟐:𝒖）

▪ Most Likely Explanation

▪ Given a sequence of observations, find the sequence of states that is most likely to have generated those observations.

SLIDE 24

Real HMM Examples ▪ Speech recognition HMMs:

▪ Observations are acoustic signals (continuous valued) ▪ States are specific positions in specific words (so, tens of thousands)

▪ Machine translation HMMs:

▪ Observations are words (tens of thousands) ▪ States are translation options

▪ Robot tracking:

▪ Observations are range readings (continuous) ▪ States are positions on a map (continuous)

SLIDE 25

Filtering / Monitoring ▪ Filtering, or monitoring, is the task of tracking the distribution Bt(X) = Pt(Xt | e1, …, et) (the belief state) over time ▪ We start with B1(X) in an initial setting, usually uniform ▪ As time passes, or we get observations, we update B(X) ▪ The Kalman filter was invented in the 60’s and first implemented as a method of trajectory estimation for the Apollo program

SLIDE 26

Inference: Base Cases E1 X1 X2 X1

SLIDE 27

Passage of Time ▪ Assume we have current belief P(X | evidence to date) ▪ Then, after one time step passes: X2 X1 ▪ Or compactly:

SLIDE 28

Observation ▪ Assume we have current belief P(X | previous evidence): ▪ Then, after evidence comes in: ▪ Or, compactly:

SLIDE 29

The Forward Algorithm ▪ We are given evidence at each time and want to know ▪ We can derive the following updates

We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end…

SLIDE 30

Online Belief Updates ▪ Every time step, we start with current P(X | evidence) ▪ We update for time: ▪ We update for evidence: ▪ The forward algorithm does both at once X2 X1 X2 E2

SLIDE 31

In-class Quiz

Rt Rt+1 P(Rt+1|Rt) +r +r 0.7 +r

r

0.3

r

+r 0.3

r
r

0.7 Rt Ut P(Ut|Rt) +r +u 0.9 +r

u

0.1

r

+u 0.2

r
u

0.8 Umbrella1 Umbrella2 Rain0 Rain1 Rain2 B(+r) = 0.5 B(-r) = 0.5 B(+r) B(-r) B(+r) B(-r)

SLIDE 32

quiz: Weather HMM

Rt Rt+1 P(Rt+1|Rt) +r +r 0.7 +r

r

0.3

r

+r 0.3

r
r

0.7 Rt Ut P(Ut|Rt) +r +u 0.9 +r

u

0.1

r

+u 0.2

r
u

0.8 Umbrella1 Umbrella2 Rain0 Rain1 Rain2 B(+r) = 0.5 B(-r) = 0.5 B’(+r) = 0.5 B’(-r) = 0.5 B(+r) = 0.818 B(-r) = 0.182 B’(+r) = 0.627 B’(-r) = 0.373 B(+r) = 0.883 B(-r) = 0.117

SLIDE 33

Most Likely Explanation

SLIDE 34

HMMs: MLE Queries

▪ HMMs defined by

▪ States X ▪ Observations E ▪ Initial distribution: ▪ Transitions: ▪ Emissions:

▪ New query: most likely explanation:

X2 E1 X1 X3 X4 E2 E3 E4

SLIDE 35

HMMs: MLE Queries ▪ Graph of states and transitions over time ▪ Each arc represents some transition ▪ Each arc has weight ▪ Each path is a sequence of states ▪ The product of weights on a path is that sequence’s probability along with the evidence ▪ Forward algorithm computes sums of paths, Viterbi computes best paths

sun rain sun rain sun rain sun rain

SLIDE 36

HMMs: MLE Queries

sun rain sun rain sun rain sun rain

Forward Algorithm (Sum) Viterbi Algorithm (Max)

SLIDE 37

Outline

▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

SLIDE 38

Sequence problems

▪ Many problems in NLP have data which is a sequence of characters, words, phrases, lines, or sentences … ▪ We can think of our task as one of labeling each item

VBG NN IN DT NN IN NN

Chasing

pportunity

in an age

f

upheaval

POS tagging B B I I B I B I B B 而相对于这些品牌的价 Word segmentation PERS O O O ORG ORG

Murdoch discusses future

f

News Corp.

Named entity recognition

Text segmen- tation

Q A Q A A A Q A

SLIDE 39

HMM for sequence labeling

▪How likely a sequence of observation is generated?

▪Forward algorithm

▪Given a sequence of observation, what is the most likely hidden states sequence?

▪Viterbi algorithm

SLIDE 40

Trigram Hidden Markov Models (Trigram HMMs)

▪ A trigram HMM consists of a finite set V of possible words, and a finite set K of possible tags.

▪ Transitions: for any trigram tags (u, v, s), we have the probability of seeing the tag s immediately after the bigram of tags (u, v). ▪ Emissions: for any word x tag s, we have the probability of seeing

bservation word x paired with state s.

q(𝑡|𝑣, 𝑤) e(𝑦|𝑡)

SLIDE 41

Trigram Hidden Markov Models (Trigram HMMs)

▪ Define S to be the set of all sequence/tag-sequence pairs

< 𝑦1 … 𝑦𝑜, 𝑧1, 𝑧2 … 𝑧𝑜+1 >, in which 𝑧𝑜+1 is the STOP.

▪ The probability of seeing the sequence S can thus be interpreted as: ▪ And we have 𝑧−1 = 𝑧0 =∗. * is the start state.

SLIDE 42

Trigram Hidden Markov Models (Trigram HMMs) - Example

IO encoding Fred PER showed O Sue PER Mengqiu PER Huang PER ‘s O STOP

q 𝑄𝐹𝑆 ∗,∗ × 𝑟 O ∗, PER × q PER PER, O × q PER PER, PER × 𝑟(𝑃|𝑄𝐹𝑆, 𝑄𝐹𝑆) ×q(STOP|PER,O) × e Fred PER × 𝑓 𝑡ℎ𝑝𝑥𝑓𝑒 𝑃 × e Sue PER × 𝑓 𝑁𝑓𝑜𝑕𝑟𝑗𝑣 𝑄𝐹𝑆 × e Huang PER × 𝑓 ′𝑡 𝑃

SLIDE 43

Sequence Generation Process

SLIDE 44

In-class Quiz

Fred shouts * PER O STOP Tag Word E(word|tag) PER Fred 0.91 PER shouts 0.09 O Fred 0.12 O Shouts 0.88 tag tag q(tag|tag) PER PER 0.71 PER O 0.09 O O 0.12 O PER 0.88 * PER 0.5 * O 0.5 PER STOP 0.2 O STOP 0.4 𝑞 𝐺𝑠𝑓𝑒, 𝑡ℎ𝑝𝑣𝑢𝑡, 𝑄𝐹𝑆, 𝑃, 𝑇𝑈𝑃𝑄 ?

SLIDE 45

Parameter Estimation

𝑟 𝑡 𝑣, 𝑤 = 𝑑(𝑣, 𝑤, 𝑡) 𝑑(𝑣, 𝑤) e 𝑦 𝑡 =

𝑑(𝑦,𝑡) 𝑑(𝑡)

𝑟 𝑡 𝑣, 𝑤 = 𝜇1 × 𝑟𝑁𝑀 𝑡 𝑣, 𝑤 + 𝜇2 × 𝑟𝑁𝑀 𝑡 𝑤 + 𝜇3 × 𝑟𝑁𝑀(𝑡)

▪ ▪ For back-off smoothing

𝜇1 + 𝜇2 + 𝜇3 = 1

SLIDE 46

Decoding with HMMs

▪ HMMs defined by

▪ States X ▪ Observations E ▪ Initial distribution: ▪ Transitions: ▪ Emissions:

▪ New query: most likely explanation: ▪ New method: the Viterbi algorithm

X2 E1 X1 X3 X4 E2 E3 E4

SLIDE 47

Decoding for Trigram HMMs

▪ Finding the most likely tag sequence for an input sentence 𝑦1, 𝑦2, … 𝑦𝑜. ▪ In which, ▪ for “Fred showed Sue Menghui Huang’s new painting” ▪ If we have 2 tags, the number of sequence is ?

SLIDE 48

Greedy Decoding

▪ Greedy inference:

▪ We just start at the left, and use our classifier at each position to assign a label ▪ The classifier can depend on previous labeling decisions as well as observed data

▪ Advantages:

▪ Fast, no extra memory requirements ▪ Very easy to implement ▪ With rich features including observations to the right, it may perform quite well

▪ Disadvantage:

▪ Greedy. We make commit errors we cannot recover from

𝑞 𝑦1 … 𝑦𝑜, 𝑧1, … 𝑧𝑜+1 = ෑ

𝑗=1 𝑜

𝑓(𝑦𝑗|𝑧𝑗)

SLIDE 49

Probability of generating sub-sequence

▪ Suppose S(k, u, v) is the set of all tag sequences of length k, which end in the tag bigram (u, v). We have:

SLIDE 50

Beam Inference ▪ Beam inference:

▪ At each position keep the top m complete sequences. ▪ Extend each sequence in each local way. ▪ The extensions compete for the m slots at the next position.

▪ Advantages:

▪ Fast; beam sizes of 3–5 are almost as good as exact inference in many cases. ▪ Easy to implement

▪ Disadvantage:

▪ Inexact: the globally best sequence can fall off the beam.

SLIDE 51

Forward Algorithm for probability computation

▪ For any position k in the sequence, we can use the following recursive definition: ▪ is the allowable tags at position k in the sequence.

SLIDE 52

Viterbi for HMM decoding to compute the highest probability

SLIDE 53

Viterbi for HMM decoding to choose the best sequence

SLIDE 54

Viterbi Inference ▪ Viterbi inference:

▪ Dynamic programming. ▪ Requires small window of state influence (e.g., past two states are relevant).

▪ Advantage:

▪ Exact: the global best sequence is returned.

▪ Disadvantage:

▪ Harder to implement long-distance state-state interactions.

SLIDE 55

Outline

▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

SLIDE 56

Conditional Markov Models

▪ Hidden Markov Model: Generative model p(E,X) ▪ Conditional Markov Model: P(X|E) ▪ Maximum Entropy Markov Model

X2 E1 X1 X3 X4 E2 E3 E4 X2 E1 X1 X3 X4 E2 E3 E4

SLIDE 57

HMM VS CMM (MEMM)

▪ Hidden Markov Model: Generative model p(E,X) ▪ Conditional Markov Model: P(X|E)

SLIDE 58

MEMM inference in systems

▪ For a Conditional Markov Model (CMM) a.k.a. a Maximum Entropy Markov Model (MEMM), the classifier makes a single decision at a time, conditioned on evidence from

bservations and previous decisions
3
2
1

+1 DT NNP VBD ??? ??? The Dow fell 22.6 %

Local Context Features

W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …

(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

Decision Point

SLIDE 59

Example: POS Tagging ▪ Scoring individual labeling decisions is no more complex than standard classification decisions ▪ We have some assumed labels to use for prior positions ▪ We use features of those and the observed data (which can include current, previous, and next words) to predict the current label

3
2
1

+1 DT NNP VBD ??? ??? The Dow fell 22.6 %

Local Context Features

W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …

Decision Point

(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

SLIDE 60

Example: POS Tagging

POS tagging Features can include:
Current, previous, next words in isolation or together.
Previous one, two, three tags.
Word-internal features: word types, suffixes, dashes, etc.
3
2
1

+1 DT NNP VBD ??? ??? The Dow fell 22.6 %

Local Context Features

W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …

(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

Decision Point

SLIDE 61

Inference in Systems Sequence Level Local Level

Local Data

Feature Extraction Features Label

Optimization Smoothing Classifier Type

Features Label

Sequence Data Maximum Entropy Models Quadratic Penalties Conjugate Gradient Sequence Model Inference Local Data Local Data

SLIDE 62

Features for sequence labeling

▪ Words

▪ Current word (essentially like a learned dictionary) ▪ Previous/next word (context)

▪ Other kinds of inferred linguistic classification

▪ Part-of-speech tags

▪ Label context

▪ Previous (and perhaps next) label

SLIDE 63

Features: Word substrings

241 drug company movie place person

Cotrimoxazole Wethersfield Alien Fury: Countdown to Invasion

18

xa

708 6

:

8 6 68 14

field

SLIDE 64