[PPT] - T agging CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT PowerPoint Presentation

SLIDE 1

Part-of-Speech T agging

CMSC 723 / LING 723 / INST 725 MARINE CARPUAT

marine@cs.umd.edu

SLIDE 2

T

day’s Agenda
What are parts of speech (POS)?
What is POS tagging?
How to POS tag text automatically?

SLIDE 3

Source: Calvin and Hobbs

SLIDE 4

Parts of Speech

“Equivalence class” of linguistic entities

– “Categories” or “types” of words

Study dates back to the ancient Greeks

– Dionysius Thrax of Alexandria (c. 100 BC) – 8 parts of speech: noun, verb, pronoun, preposition, adverb, conjunction, participle, article – Remarkably enduring list!

4

SLIDE 5

How do we define POS?

By meaning

– Verbs are actions – Adjectives are properties – Nouns are things

By the syntactic environment

– What occurs nearby? – What does it act as?

By what morphological processes affect it

– What affixes does it take?

Combination of the above

SLIDE 6

Parts of Speech

Open class

– Impossible to completely enumerate – New words continuously being invented, borrowed, etc.

Closed class

– Closed, fixed membership – Reasonably easy to enumerate – Generally, short function words that “structure” sentences

SLIDE 7

Open Class POS

Four major open classes in English

– Nouns – Verbs – Adjectives – Adverbs

All languages have nouns and verbs... but

may not have the other two

SLIDE 8

Nouns

Open class

– New inventions all the time: muggle, webinar, ...

Semantics:

– Generally, words for people, places, things – But not always (bandwidth, energy, ...)

Syntactic environment:

– Occurring with determiners – Pluralizable, possessivizable

Other characteristics:

– Mass vs. count nouns

SLIDE 9

Verbs

Open class

– New inventions all the time: google, tweet, ...

Semantics:

– Generally, denote actions, processes, etc.

Syntactic environment:

– Intransitive, transitive, ditransitive – Alternations

Other characteristics:

– Main vs. auxiliary verbs – Gerunds (verbs behaving like nouns) – Participles (verbs behaving like adjectives)

SLIDE 10

Adjectives and Adverbs

Adjectives

– Generally modify nouns, e.g., tall girl

Adverbs

– A semantic and formal potpourri… – Sometimes modify verbs, e.g., sang beautifully – Sometimes modify adjectives, e.g., extremely hot

SLIDE 11

Closed Class POS

Prepositions

– In English, occurring before noun phrases – Specifying some type of relation (spatial, temporal, …) – Examples: on the shelf, before noon

Particles

– Resembles a preposition, but used with a verb (“phrasal verbs”) – Examples: find out, turn over, go on

SLIDE 12

Particle vs. Prepositions

He came by the office in a hurry He came by his fortune honestly We ran up the phone bill We ran up the small hill He lived down the block He never lived down the nicknames (by = preposition) (by = particle) (up = particle) (up = preposition) (down = preposition) (down = particle)

SLIDE 13

More Closed Class POS

Determiners

– Establish reference for a noun – Examples: a, an, the (articles), that, this, many, such, …

Pronouns

– Refer to person or entities: he, she, it – Possessive pronouns: his, her, its – Wh-pronouns: what, who

SLIDE 14

Closed Class POS: Conjunctions

Coordinating conjunctions

– Join two elements of “equal status” – Examples: cats and dogs, salad or soup

Subordinating conjunctions

– Join two elements of “unequal status” – Examples: We’ll leave after you finish eating. While I was waiting in line, I saw my friend. – Complementizers are a special case: I think that you should finish your assignment

SLIDE 15

Beyond English…

Chinese

No verb/adjective distinction! Riau Indonesian/Malay No Articles No Tense Marking 3rd person pronouns neutral to both gender and number No features distinguishing verbs from nouns 漂亮: beautiful/to be beautiful

Ayam (chicken) Makan (eat) The chicken is eating The chicken ate The chicken will eat The chicken is being eaten Where the chicken is eating How the chicken is eating Somebody is eating the chicken The chicken that is eating

SLIDE 16

T

day’s Agenda
What are parts of speech (POS)?
What is POS tagging?
How to POS tag text automatically?

SLIDE 17

POS T agging: What’s the task?

Process of assigning part-of-speech tags to words
But what tags are we going to assign?

– Coarse grained: noun, verb, adjective, adverb, … – Fine grained: {proper, common} noun – Even finer-grained: {proper, common} noun  animate

Important issues to remember

– Choice of tags encodes certain distinctions/non-distinctions – Tagsets will differ across languages!

For English, Penn Treebank is the most common tagset

SLIDE 18

Penn Treebank T agset: 45 T ags

SLIDE 19

Penn Treebank T agset: Choices

Example:

– The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

Distinctions and non-distinctions

– Prepositions and subordinating conjunctions are tagged “IN” (“Although/IN I/PRP ..”) – Except the preposition/complementizer “to” is tagged “TO”

SLIDE 20

Why do POS tagging?

One of the most basic NLP tasks

– Nicely illustrates principles of statistical NLP

Useful for higher-level analysis

– Needed for syntactic analysis – Needed for semantic analysis

Sample applications that require POS tagging

– Machine translation – Information extraction – Lots more…

SLIDE 21

Try your hand at tagging…

The back door
On my back
Win the voters back
Promised to back the bill

SLIDE 22

Try your hand at tagging…

I hope that she wins
That day was nice
You can go that far

SLIDE 23

Why is POS tagging hard?

Ambiguity!

– Not just a lexical problem – Ambiguity in English

11.5% of word types ambiguous in Brown corpus
40% of word tokens ambiguous in Brown corpus
Annotator disagreement in Penn Treebank: 3.5%

SLIDE 24

T

day’s Agenda
What are parts of speech (POS)?
What is POS tagging?
How to POS tag text automatically?

SLIDE 25

POS tagging: how to do it?

Given Penn Treebank, how would you

build a system that can POS tag new text?

Baseline: pick most frequent tag for each

word type

– 90% accuracy if train+test sets are drawn from Penn Treebank

How can we do better?

SLIDE 26

Prediction problems

Given x, predict y

Binary Prediction/Classification Multiclass Prediction/Classification Structured Prediction

SLIDE 27

How can we POS tag automatically?

POS tagging as multiclass classification

– What is x? What is y? – What model and training algorithm can we use? – What kind of features can we use?

POS tagging as sequence labeling

– Models sequences of predictions

SLIDE 28

Hidden Markov Models

Common approach to sequence labeling
A finite state machine with probabilistic

transitions

Markov Assumption

– next state only depends on the current state and independent of previous history

SLIDE 29

Hidden Markov Models (HMM) for POS tagging

Probabilistic model for generating sequences

– e.g., word sequences

Assume

– underlying set of hidden (unobserved) states in which the model can be (e.g., POS) – probabilistic transitions between states over time (e.g., from POS to POS in order) – probabilistic generation of (observed) tokens from states (e.g., words generate for each POS)

SLIDE 30

HMM for POS tagging: intuition

Credit: Jordan Boyd Graber

SLIDE 31

HMM for POS tagging: intuition

Credit: Jordan Boyd Graber

SLIDE 32

HMM: Formal Specification

Q: a finite set of N states

– Q = {q0, q1, q2, q3, …}

N  N Transition probability matrix A = [aij]

– aij = P(qj|qi), Σ aij = 1 I

Sequence of observations O = o1, o2, ... oT

– Each drawn from a given set of symbols (vocabulary V)

N  |V| Emission probability matrix, B = [bit]

– bit = bi(ot) = P(ot|qi), Σ bit = 1 i

Start and end states

– An explicit start state q0 or alternatively, a prior distribution over start states: {π1, π2, π3, …}, Σ πi = 1 – The set of final states: qF

SLIDE 33

Let’s model the stock market…

1 2 3 4 5 6 Day: ↑ ↓ ↔ ↑ ↓ ↔

↑: Market is up ↓: Market is down ↔: Market hasn’t changed

BullBearSBear Bull S

Bull: Bull Market Bear: Bear Market S: Static Market Not observable ! Here’s what you actually observe:

Credit: Jimmy Lin

SLIDE 34

Stock Market HMM

States? ✓ Transitions? Vocabulary? Emissions? Priors?

SLIDE 35

Stock Market HMM

States? ✓ Transitions? ✓ Vocabulary? Emissions? Priors?

SLIDE 36

Stock Market HMM

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors?

SLIDE 37

Stock Market HMM

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors? ✓

SLIDE 38

Stock Market HMM

π1=0.5 π2=0.2 π3=0.3

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors? ✓ ✓

SLIDE 39

Properties of HMMs

The (first-order) Markov assumption holds
The probability of an output symbol depends
nly on the state generating it
The number of states (N) does not have to equal

the number of observations (T)

SLIDE 40

HMMs: Three Problems

Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

Decoding: Given an HMM λ = (A, B, ∏), and an
bservation sequence O, find the most likely

(hidden) state sequence

Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

SLIDE 41

HMM Problem #1: Likelihood

SLIDE 42

Computing Likelihood

1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock

Assuming λstock models the stock market, how likely are we to observe the sequence of outputs?

π1=0.5 π2=0.2 π3=0.3

SLIDE 43

Computing Likelihood

First try:

– Sum over all possible ways in which we could generate O from λ – What’s the problem?

Right idea, wrong algorithm!

Takes O(NT) time to compute!

SLIDE 44

Computing Likelihood

What are we doing wrong?

– State sequences may have a lot of overlap… – We’re recomputing the shared subsequences every time – Let’s store intermediate results and reuse them! – Can we do this?

Sounds like a job for dynamic programming!

SLIDE 45

Forward Algorithm

Use an N  T trellis or chart [αtj]
Forward probabilities: αtj or αt(j)

– = P(being in state j after seeing t observations) – = P(o1, o2, ... ot, qt=j)

Each cell = ∑ extensions of all paths from other cells

αt(j) = ∑i αt-1(i) aij bj(ot)

– αt-1(i): forward path probability until (t-1) – aij: transition probability of going from state i to j – bj(ot): probability of emitting symbol ot in state j

P(O|λ) = ∑i αT(i)

SLIDE 46

Forward Algorithm: Formal Definition

Initialization
Recursion
Termination

SLIDE 47

Forward Algorithm

↑ ↓ ↑ O = find P(O|λstock)

SLIDE 48

Forward Algorithm

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

SLIDE 49

Forward Algorithm: Initialization

α1(Bull) α1(Bear) α1(Static)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09

Bear Bull Static

states

SLIDE 50

Forward Algorithm: Recursion

0.140.60.1=0.0084

∑

α1(Bull)aBullBullbBull(↓)

.... and so on

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145

Bear Bull Static

states

SLIDE 51

Forward Algorithm: Recursion

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145 ? ? ? ? ?

Bear Bull Static

states

Work through the rest of these numbers… What’s the asymptotic complexity of this algorithm?

SLIDE 52

HMM Problem #2: Decoding

SLIDE 53

Decoding

Given λstock as our model and O as our observations, what are the most likely states the market went through to produce O?

1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock

π1=0.5 π2=0.2 π3=0.3

SLIDE 54

Decoding

“Decoding” because states are hidden
First try:

– Compute P(O) for all possible state sequences, then choose sequence with highest probability – What’s the problem here?

SLIDE 55

Viterbi Algorithm

“Decoding” = computing most likely state

sequence

– Another dynamic programming algorithm – Efficient: polynomial vs. exponential (brute force)

Same idea as the forward algorithm

– Store intermediate computation results in a trellis – Build new cells from existing cells

SLIDE 56

Viterbi Algorithm

Use an N  T trellis [vtj]

– Just like in forward algorithm

vtj or vt(j)

– = P(in state j after seeing t observations and passing through the most likely state sequence so far) – = P(q1, q2, ... qt-1, qt=j, o1, o2, ... ot)

Each cell = extension of most likely path from other cells

vt(j) = maxi vt-1(i) aij bj(ot)

– vt-1(i): Viterbi probability until (t-1) – aij: transition probability of going from state i to j – bj(ot) : probability of emitting symbol ot in state j

P = maxi vT(i)

SLIDE 57

Viterbi vs. Forward

Maximization instead of summation over previous paths
This algorithm is still missing something!

– In forward algorithm, we only care about the probabilities – What’s different here?

We need to store the most likely path (transition):

– Use “backpointers” to keep track of most likely transition – At the end, follow the chain of backpointers to recover the most likely state sequence

SLIDE 58

Viterbi Algorithm: Formal Definition

Initialization
Recursion
Termination

SLIDE 59

Viterbi Algorithm

↑ ↓ ↑ O =

find most likely state sequence given λstock

SLIDE 60

Viterbi Algorithm

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

SLIDE 61

Viterbi Algorithm: Initialization

α1(Bull) α1(Bear) α1(Static)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09

Bear Bull Static

states

SLIDE 62

Viterbi Algorithm: Recursion

0.140.60.1=0.0084

Max

α1(Bull)aBullBullbBull(↓)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084

Bear Bull Static

states

SLIDE 63

Viterbi Algorithm: Recursion

.... and so on

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084

Bear Bull Static

states

store backpointer

SLIDE 64

Viterbi Algorithm: Recursion

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084 ? ? ? ? ?

Work through the rest of the algorithm…

SLIDE 65

POS T agging with HMMs

SLIDE 66

Modeling the problem

What’s the problem?

– The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

What should the HMM look like ?

– States: part-of-speech tags (t1, t2, ..., tN) – Output symbols: words (w1, w2, ..., w|V|)

SLIDE 67

HMMs: Three Problems

Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

Decoding: Given an HMM λ = (A, B, ∏), and an
bservation sequence O, find the most likely

(hidden) state sequence

Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

SLIDE 68

T

day’s Agenda
What are parts of speech (POS)?
What is POS tagging?
How to POS tag text automatically?

– Sequence labeling problem – Decoding with Hidden Markov Models