FSTs, HMMs & POS tagging CMSC 723 / LING 723 / INST 725 M ARINE - - PowerPoint PPT Presentation

fsts hmms pos tagging
SMART_READER_LITE
LIVE PREVIEW

FSTs, HMMs & POS tagging CMSC 723 / LING 723 / INST 725 M ARINE - - PowerPoint PPT Presentation

FSTs, HMMs & POS tagging CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Complete Morphological Parser Practical NLP Applications In practice, it is almost never necessary to write FSTs by hand Typically, one


slide-1
SLIDE 1

FSTs, HMMs & POS tagging

CMSC 723 / LING 723 / INST 725 MARINE CARPUAT

marine@cs.umd.edu

slide-2
SLIDE 2

Complete Morphological Parser

slide-3
SLIDE 3

Practical NLP Applications

  • In practice, it is almost never necessary to write FSTs by

hand…

  • Typically, one writes rules:

– Chomsky and Halle Notation: a → b / c__d = rewrite a as b when occurs between c and d – E-Insertion rule

  • Rule → FST compiler handles the rest…

ε → e / x s z ^ __ s #

slide-4
SLIDE 4

FSTs and Ambiguity

  • unionizable

– union +ize +able – un+ ion +ize +able

slide-5
SLIDE 5

FSA as a language model

he saw me he ran home she talked

slide-6
SLIDE 6

Weighted FSA as a language model

slide-7
SLIDE 7

Weighted FSAs

  • Assigns a score to each string that it

accepts

  • Score can be probability

– But not necessary – Strings that are not accepted are said to have probability zero

slide-8
SLIDE 8

Weighted Finite-State Automata

  • We can view n-gram language models as

weighted finite state automata

  • We can also define weighted finite-state

transducers

– Generates pairs of strings and assigns a weight to each pair – Weight can often be interpreted as conditional probability P(output-string | input-string)

slide-9
SLIDE 9

T

  • day
  • Computational tools

– Weighted Finite State Automata/Transducers – Hidden Markov Models

  • Part-of-Speech Tagging
slide-10
SLIDE 10

WH WHAT A T ARE PAR ARTS TS OF OF S SPE PEECH ECH?

slide-11
SLIDE 11

Parts of Speech

  • “Equivalence class” of linguistic entities

– “Categories” or “types” of words

  • Study dates back to the ancient Greeks

– Dionysius Thrax of Alexandria (c. 100 BC) – 8 parts of speech: noun, verb, pronoun, preposition, adverb, conjunction, participle, article – Remarkably enduring list!

1 1

slide-12
SLIDE 12

How can we define POS?

  • By meaning?

– Verbs are actions – Adjectives are properties – Nouns are things

  • By the syntactic environment

– What occurs nearby? – What does it act as?

  • By what morphological processes affect it

– What affixes does it take?

  • Typically combination of syntactic+morphology
slide-13
SLIDE 13

Parts of Speech

  • Open class

– Impossible to completely enumerate – New words continuously being invented, borrowed, etc.

  • Closed class

– Closed, fixed membership – Reasonably easy to enumerate – Generally, short function words that “structure” sentences

slide-14
SLIDE 14

Open Class POS

  • Four major open classes in English

– Nouns – Verbs – Adjectives – Adverbs

  • All languages have nouns and verbs... but

may not have the other two

slide-15
SLIDE 15

Nouns

  • Open class

– New inventions all the time: muggle, webinar, ...

  • Semantics:

– Generally, words for people, places, things – But not always (bandwidth, energy, ...)

  • Syntactic environment:

– Occurring with determiners – Pluralizable, possessivizable

  • Other characteristics:

– Mass vs. count nouns

slide-16
SLIDE 16

Verbs

  • Open class

– New inventions all the time: google, tweet, ...

  • Semantics

– Generally, denote actions, processes, etc.

  • Syntactic environment

– E.g., Intransitive, transitive

  • Other characteristics

– Main vs. auxiliary verbs – Gerunds (verbs behaving like nouns) – Participles (verbs behaving like adjectives)

slide-17
SLIDE 17

Adjectives and Adverbs

  • Adjectives

– Generally modify nouns, e.g., tall girl

  • Adverbs

– A semantic and formal hodge-podge… – Sometimes modify verbs, e.g., sang beautifully – Sometimes modify adjectives, e.g., extremely hot

slide-18
SLIDE 18

Closed Class POS

  • Prepositions

– In English, occurring before noun phrases – Specifying some type of relation (spatial, temporal, …) – Examples: on the shelf, before noon

  • Particles

– Resembles a preposition, but used with a verb (“phrasal verbs”) – Examples: find out, turn over, go on

slide-19
SLIDE 19

Particle vs. Prepositions

He came by the office in a hurry He came by his fortune honestly We ran up the phone bill We ran up the small hill He lived down the block He never lived down the nicknames (by = preposition) (by = particle) (up = particle) (up = preposition) (down = preposition) (down = particle)

slide-20
SLIDE 20

More Closed Class POS

  • Determiners

– Establish reference for a noun – Examples: a, an, the (articles), that, this, many, such, …

  • Pronouns

– Refer to person or entities: he, she, it – Possessive pronouns: his, her, its – Wh-pronouns: what, who

slide-21
SLIDE 21

Closed Class POS: Conjunctions

  • Coordinating conjunctions

– Join two elements of “equal status” – Examples: cats and dogs, salad or soup

  • Subordinating conjunctions

– Join two elements of “unequal status” – Examples: We’ll leave after you finish eating. While I was waiting in line, I saw my friend. – Complementizers are a special case: I think that you should finish your assignment

slide-22
SLIDE 22

Beyond English…

Chinese

No verb/adjective distinction! Riau Indonesian/Malay No Articles No Tense Marking 3rd person pronouns neutral to both gender and number No features distinguishing verbs from nouns 漂亮: beautiful/to be beautiful

Ayam (chicken) Makan (eat) The chicken is eating The chicken ate The chicken will eat The chicken is being eaten Where the chicken is eating How the chicken is eating Somebody is eating the chicken The chicken that is eating

slide-23
SLIDE 23

PO POS TAGG GGING NG

slide-24
SLIDE 24

POS T agging: What’s the task?

  • Process of assigning part-of-speech tags to words
  • But what tags are we going to assign?

– Coarse grained: noun, verb, adjective, adverb, … – Fine grained: {proper, common} noun – Even finer-grained: {proper, common} noun  animate

  • Important issues to remember

– Choice of tags encodes certain distinctions/non-distinctions – Tagsets will differ across languages!

  • For English, Penn Treebank is the most common tagset
slide-25
SLIDE 25

Penn Treebank T agset: 45 T ags

slide-26
SLIDE 26

Penn Treebank T agset: Choices

  • Example:

– The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

  • Distinctions and non-distinctions

– Prepositions and subordinating conjunctions are tagged “IN” (“Although/IN I/PRP ..”) – Except the preposition/complementizer “to” is tagged “TO”

slide-27
SLIDE 27

Why do POS tagging?

  • One of the most basic NLP tasks

– Nicely illustrates principles of statistical NLP

  • Useful for higher-level analysis

– Needed for syntactic analysis – Needed for semantic analysis

  • Sample applications that require POS tagging

– Machine translation – Information extraction – Lots more…

slide-28
SLIDE 28

Try your hand at tagging…

  • The back door
  • On my back
  • Win the voters back
  • Promised to back the bill
slide-29
SLIDE 29

Try your hand at tagging…

  • I hope that she wins
  • That day was nice
  • You can go that far
slide-30
SLIDE 30

Why is POS tagging hard?

  • Ambiguity!

– Ambiguity in English

  • 11.5% of word types ambiguous in Brown corpus
  • 40% of word tokens ambiguous in Brown corpus
  • Annotator disagreement in Penn Treebank: 3.5%
slide-31
SLIDE 31

POS tagging: how to do it?

  • Given Penn Treebank, how would you

build a system that can POS tag new text?

  • Baseline: pick most frequent tag for each

word type

– 90% accuracy if train+test sets are drawn from Penn Treebank

  • How can we do better?
slide-32
SLIDE 32

HO HOW W TO O SOL OLVE VE PO POS TAGG GGING NG?

slide-33
SLIDE 33

How can we POS tag automatically?

  • POS tagging as multiclass classification

– What is x? What is y?

  • POS tagging as sequence labeling

– Models sequences of predictions

slide-34
SLIDE 34

Hidden Markov Models

  • Common approach to sequence labeling
  • A finite state machine with probabilistic

transitions

  • Markov Assumption

– next state only depends on the current state and independent of previous history

slide-35
SLIDE 35

Hidden Markov Models (HMM) for POS tagging

  • Probabilistic model for generating sequences

– e.g., word sequences

  • Assume

– underlying set of hidden (unobserved) states in which the model can be (e.g., POS) – probabilistic transitions between states over time (e.g., from POS to POS in order) – probabilistic generation of (observed) tokens from states (e.g., words generate for each POS)

slide-36
SLIDE 36

HMM for POS tagging: intuition

Credit: Jordan Boyd Graber

slide-37
SLIDE 37

HMM for POS tagging: intuition

Credit: Jordan Boyd Graber

slide-38
SLIDE 38

HMM: Formal Specification

  • Q: a finite set of N states

– Q = {q0, q1, q2, q3, …}

  • N  N Transition probability matrix A = [aij]

– aij = P(qj|qi), Σ aij = 1 I

  • Sequence of observations O = o1, o2, ... oT

– Each drawn from a given set of symbols (vocabulary V)

  • N  |V| Emission probability matrix, B = [bit]

– bit = bi(ot) = P(ot|qi), Σ bit = 1 i

  • Start and end states

– An explicit start state q0 or alternatively, a prior distribution over start states: {π1, π2, π3, …}, Σ πi = 1 – The set of final states: qF

slide-39
SLIDE 39

Let’s model the stock market…

1 2 3 4 5 6 Day: ↑ ↓ ↔ ↑ ↓ ↔

↑: Market is up ↓: Market is down ↔: Market hasn’t changed

BullBearSBear Bull S

Bull: Bull Market Bear: Bear Market S: Static Market Not observable ! Here’s what you actually observe:

Credit: Jimmy Lin

slide-40
SLIDE 40

Stock Market HMM

States? ✓ Transitions? Vocabulary? Emissions? Priors?

slide-41
SLIDE 41

Stock Market HMM

States? ✓ Transitions? ✓ Vocabulary? Emissions? Priors?

slide-42
SLIDE 42

Stock Market HMM

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors?

slide-43
SLIDE 43

Stock Market HMM

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors? ✓

slide-44
SLIDE 44

Stock Market HMM

π1=0.5 π2=0.2 π3=0.3

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors? ✓ ✓

slide-45
SLIDE 45

Properties of HMMs

  • The (first-order) Markov assumption holds
  • The probability of an output symbol depends
  • nly on the state generating it
  • The number of states (N) does not have to equal

the number of observations (T)

slide-46
SLIDE 46

HMMs: Three Problems

  • Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

  • Decoding: Given an HMM λ = (A, B, ∏), and an
  • bservation sequence O, find the most likely

(hidden) state sequence

  • Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

slide-47
SLIDE 47

HMM Problem #1: Likelihood

slide-48
SLIDE 48

Computing Likelihood

1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock

Assuming λstock models the stock market, how likely are we to observe the sequence of outputs?

π1=0.5 π2=0.2 π3=0.3

slide-49
SLIDE 49

Computing Likelihood

  • First try:

– Sum over all possible ways in which we could generate O from λ – What’s the problem?

  • Right idea, wrong algorithm!

Takes O(NT) time to compute!

slide-50
SLIDE 50

Computing Likelihood

  • What are we doing wrong?

– State sequences may have a lot of overlap… – We’re recomputing the shared subsequences every time – Let’s store intermediate results and reuse them! – Can we do this?

  • Sounds like a job for dynamic programming!
slide-51
SLIDE 51

Forward Algorithm

  • Use an N  T trellis or chart [αtj]
  • Forward probabilities: αtj or αt(j)

– = P(being in state j after seeing t observations) – = P(o1, o2, ... ot, qt=j)

  • Each cell = ∑ extensions of all paths from other cells

αt(j) = ∑i αt-1(i) aij bj(ot)

– αt-1(i): forward path probability until (t-1) – aij: transition probability of going from state i to j – bj(ot): probability of emitting symbol ot in state j

  • P(O|λ) = ∑i αT(i)
slide-52
SLIDE 52

Forward Algorithm: Formal Definition

  • Initialization
  • Recursion
  • Termination
slide-53
SLIDE 53

Forward Algorithm

↑ ↓ ↑ O = find P(O|λstock)

slide-54
SLIDE 54

Forward Algorithm

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

slide-55
SLIDE 55

Forward Algorithm: Initialization

α1(Bull) α1(Bear) α1(Static)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09

Bear Bull Static

states

slide-56
SLIDE 56

Forward Algorithm: Recursion

0.140.60.1=0.0084

α1(Bull)aBullBullbBull(↓)

.... and so on

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145

Bear Bull Static

states

slide-57
SLIDE 57

Forward Algorithm: Recursion

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145 ? ? ? ? ?

Bear Bull Static

states

Work through the rest of these numbers… What’s the asymptotic complexity of this algorithm?

slide-58
SLIDE 58

HMMs: Three Problems

  • Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

  • Decoding: Given an HMM λ = (A, B, ∏), and an
  • bservation sequence O, find the most likely

(hidden) state sequence

  • Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

slide-59
SLIDE 59

T

  • day
  • Computational tools

– Weighted Finite State Automata/Transducers – Hidden Markov Models

  • Part-of-Speech Tagging