Accelerated Natural Language Processing Lecture 3 Morphology and - - PowerPoint PPT Presentation

accelerated natural language processing lecture 3
SMART_READER_LITE
LIVE PREVIEW

Accelerated Natural Language Processing Lecture 3 Morphology and - - PowerPoint PPT Presentation

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2019 Sharon Goldwater ANLP Lecture 3 20 September 2019 Recap: Tasks


slide-1
SLIDE 1

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2019

Sharon Goldwater ANLP Lecture 3 20 September 2019

slide-2
SLIDE 2

Recap: Tasks

  • Recognition

– given: surface form – wanted: yes/no decision if it is in the language

  • Generation

– given: lemma and morphological properties – wanted: surface form

  • Analysis

– given: surface form – wanted: lemma and morphological properties

Sharon Goldwater ANLP Lecture 3 1

slide-3
SLIDE 3

Recap: General approach

  • Could list all words with their analyses, but

– List gets too big – Language is infinite, cannot generalize beyond list

  • Instead, use finite state machines

– Finite and compact representation of infinite language – Several toolkits available

Sharon Goldwater ANLP Lecture 3 2

slide-4
SLIDE 4

Recap: Finite State Automata

START END a b a b b c c a c a c b b

Can be viewed as either emitting or recognizing strings

Sharon Goldwater ANLP Lecture 3 3

slide-5
SLIDE 5

Today’s lecture

  • How are FSMs and FSTs used for morphological recognition,

analysis and generation?

  • How can we deal with spelling changes in morphological analysis?
  • What is an alignment between two strings?
  • What is minimum edit distance and how do we compute it?

What’s wrong with a brute force solution, and how do we solve that problem?

Sharon Goldwater ANLP Lecture 3 4

slide-6
SLIDE 6

One Word

S E walk

Basic finite state automaton:

  • start state
  • transition that emits the word

walk

  • end state

Sharon Goldwater ANLP Lecture 3 5

slide-7
SLIDE 7

One Word and One Inflection

S 1 walk +ed E

Two transitions and intermediate state

  • first transition emits walk
  • second transition emits +ed

→ walked

Sharon Goldwater ANLP Lecture 3 6

slide-8
SLIDE 8

One Word and Multiple Inflections

S 1 walk +ed E +ing +s

Multiple transitions between states

  • three different paths

→ walks, walked, walking

Sharon Goldwater ANLP Lecture 3 7

slide-9
SLIDE 9

Multiple Words and Multiple Inflections

S 1 walk +ed E +ing +s report laugh

Multiple stems

  • implements regular verb morphology

→ laughs, laughed, laughing walks, walked, walking reports, reported, reporting

Sharon Goldwater ANLP Lecture 3 8

slide-10
SLIDE 10

Multiple Words and Multiple Inflections

S 1 walk +ed E +ing +s report laugh

Multiple stems

  • implements regular verb morphology

→ what about bake, emit, fuss? more on this later...

Sharon Goldwater ANLP Lecture 3 9

slide-11
SLIDE 11

Derivational Morphology

ion word START

double lines = end state

Sharon Goldwater ANLP Lecture 3 10

slide-12
SLIDE 12

Derivational Morphology

ion word y START

Sharon Goldwater ANLP Lecture 3 11

slide-13
SLIDE 13

Derivational Morphology

ion word y fy START

again: wordify not wordyfy! again, will come back to that later...

Sharon Goldwater ANLP Lecture 3 12

slide-14
SLIDE 14

Derivational Morphology

START word cate y fy ion

why a loop? could it be placed differently?

Sharon Goldwater ANLP Lecture 3 13

slide-15
SLIDE 15

Derivational Morphology

START word cate y er ist ism fy ion

Sharon Goldwater ANLP Lecture 3 14

slide-16
SLIDE 16

Marking Part of Speech

START word cate y er ist ism fy ion N A V V N N

Sharon Goldwater ANLP Lecture 3 15

slide-17
SLIDE 17

Marking Part of Speech

START word cate y er ist ism fy ion N A V V N N

Now: where to add -less? -ness? Others?

Sharon Goldwater ANLP Lecture 3 16

slide-18
SLIDE 18

Concatenation

  • Constructing an FSA gets very complicated
  • Build components as separate FSAs

– L: FSA for lexicon – D: FSA for derivational morphology – I: FSA for inflectional morphology

  • Concatenate L + D + I (there are standard algorithms)

– In fact, each component may consist of multiple components (e.g., D has different sets of affixes with ordering constraints)

Sharon Goldwater ANLP Lecture 3 17

slide-19
SLIDE 19

What is Required?

  • Lexicon of lemmas

– very large, needs to be collected by hand

  • Inflection and derivation rules

– not large, but requires understanding of the language Recent work: automatically learn lemmas and suffixes from a corpus

  • OK solution for languages with few resources
  • Hand-engineered systems much better when available

Sharon Goldwater ANLP Lecture 3 18

slide-20
SLIDE 20

Generation and analysis

  • FSAs used as morphological recognizers
  • What if we want to generate or analyze?

walk+V+past ↔ walked report+V+prog ↔ reporting

  • Use a finite-state transducer (FST)

– Replace output symbols with input-output pairs x : y

Sharon Goldwater ANLP Lecture 3 19

slide-21
SLIDE 21

FSA for verbs

ing laugh walk report s ed

Sharon Goldwater ANLP Lecture 3 20

slide-22
SLIDE 22

Schematically

ing ed s verb−reg

Sharon Goldwater ANLP Lecture 3 21

slide-23
SLIDE 23

FST for verbs

+past:ed +V: verb−reg +prog:ing +3sg:s

where x means x:x and x: means x:ǫ.

Sharon Goldwater ANLP Lecture 3 22

slide-24
SLIDE 24

Accounting for spelling changes

  • We now have:

walk+V+past ↔ walked BUT bake+V+past ↔ bakeed

  • How to fix this?

Sharon Goldwater ANLP Lecture 3 23

slide-25
SLIDE 25

Accounting for spelling changes

  • We now have:

walk+V+past ↔ walked BUT bake+V+past ↔ bakeed

  • How to fix this? Use two FSTs in a row!

walk+V+past ↔ walkˆed# ↔ walked bake+V+past ↔ bakeˆed# ↔ baked

Sharon Goldwater ANLP Lecture 3 24

slide-26
SLIDE 26
  • 1. Analysis to intermediate form

verb−reg +V:^ +3sg:s# +past:ed# +prog:ing#

where x means x:x and x: means x:ǫ.

  • Examples:

walk+V+past ↔ walkˆed# bake+V+past ↔ bakeˆed# bake+V+prog ↔ bakeˆing#

Sharon Goldwater ANLP Lecture 3 25

slide-27
SLIDE 27
  • 2. Intermediate form to surface form

Simplified version, only handles some aspects of past tense:

  • ther

e: ^: ed#:ed e,other

where other means any character except ‘e’.

  • Examples: walkˆed# ↔ walked, bakeˆed# ↔ baked
  • A nondeterministic FST: mulitple transitions may be possible
  • n the same input (where?).

If any path goes to end state, string is accepted.

Sharon Goldwater ANLP Lecture 3 26

slide-28
SLIDE 28

Plural transducer (J&M, Fig. 3.17)

  • Complete FST for English plural (‘other’ = none of {z,s,x,ˆ,#,ǫ})
  • What happens in each case?

catˆs# foxˆs# axleˆs#

Sharon Goldwater ANLP Lecture 3 27

slide-29
SLIDE 29

Remaining problem: ambiguity

  • FSTs often produce multiple analyses for a single form:

walks → walk+V+1sg OR walk+N+pl German ‘the’: 6 surface forms, but 24 possible analyses

  • Resolve

using context (surrounding words), usually in a probabilistic system (stay tuned...)

Sharon Goldwater ANLP Lecture 3 28

slide-30
SLIDE 30

More info and tools

  • More information: Oflazer (2009): Computational Morphology

http://fsmnlp2009.fastar.org/Program files/Oflazer - slides.pdf

  • OpenFST (Google and NYU)

http://www.openfst.org/

  • Carmel Toolkit

http://www.isi.edu/licensed-sw/carmel/

  • FSA Toolkit

http://www-i6.informatik.rwth-aachen.de/∼kanthak/fsa.html

Sharon Goldwater ANLP Lecture 3 29

slide-31
SLIDE 31

Related task: string similarity

Given two strings, how “similar” are they?

  • Could indicate morphological relationships:

walk - walks, sleep - slept

  • Or possible spelling errors (and corrections):

definition - defintion, separate - seperate

  • Also used in other fields, e.g., bioinformatics:

ACCGTA - ACCGATA

Sharon Goldwater ANLP Lecture 3 30

slide-32
SLIDE 32

One measure: minimum edit distance

  • How many changes to go from string s1 → s2?

S T A L L T A L L deletion T A B L substitution T A B L E insertion

  • To solve the problem, we need to find the best alignment between

the words. – Could be several equally good alignments.

Sharon Goldwater ANLP Lecture 3 31

slide-33
SLIDE 33

Example alignments

Let ins/del cost (distance) = 1, sub cost = 2 (0 if no change) (can use other costs, incl diff costs for diff chars)

  • Two optimal alignments (cost = 4):

S T A L L

  • d

| | s | i

  • T

A B L E S T A

  • L

L d | | i | s

  • T

A B L E

Sharon Goldwater ANLP Lecture 3 32

slide-34
SLIDE 34

Example alignments

Let ins/del cost (distance) = 1, sub cost = 2 (0 if no change) (can use other costs, incl diff costs for diff chars)

  • Two optimal alignments (cost = 4):

S T A L L

  • d

| | s | i

  • T

A B L E S T A

  • L

L d | | i | s

  • T

A B L E

  • LOTS of non-optimal alignments, such as:

S T A

  • L
  • L

s d | i | i d T

  • A

B L E

  • S

T A L

  • L
  • d

d s s i | i

  • T

A B L E

Sharon Goldwater ANLP Lecture 3 33

slide-35
SLIDE 35

Brute force solution: too slow

How many possible alignments to consider?

  • First character could align to any of:
  • T

A B L E

  • Next character can align anywhere to its right
  • And so on... the number of alignments grows exponentially with

the length of the sequences.

Sharon Goldwater ANLP Lecture 3 34

slide-36
SLIDE 36

Brute force solution: too slow

How many possible alignments to consider?

  • First character could align to any of:
  • T

A B L E

  • Next character can align anywhere to its right
  • And so on... the number of alignments grows exponentially with

the length of the sequences. To solve, we use a dynamic programming algorithm

  • Store solutions to smaller computations and combine them
  • Widespread in NLP, e.g. tagging (HMMs), parsing (CKY)

Sharon Goldwater ANLP Lecture 3 35

slide-37
SLIDE 37

Intuition

  • Minimum distance D(stall, table) must be the minimum of:

– D(stall, tabl) + 1 (ins e) – D(stal, table) + 1 (del l) – D(stal, tabl) + 2 (sub l/e)

  • Similarly for the smaller subproblems
  • So proceed as follows:

– solve smallest subproblems first – store solutions in a table (chart) – use these to solve and store larger subproblems until we get the full solution

Sharon Goldwater ANLP Lecture 3 36

slide-38
SLIDE 38

Chart: starting point

T A B L E 1 2 3 4 5 S 1 T 2 A 3 L 4 L 5

  • Chart[i, j] stores D(stall[0..i],table[0..j])

– Ex: Chart[3,0] = D(stall[0..3], table[0..0]) = D(sta, ǫ)

Sharon Goldwater ANLP Lecture 3 37

slide-39
SLIDE 39

Chart: next step

T A B L E 1 2 3 4 5 S 1 ? T 2 A 3 L 4 L 5

  • Chart[1, 1] is D(S, T): the minimum of

D(-, T) + 1 (Chart[0, 1] + 1 = 2) D(S, -) + 1 (Chart[1, 0] + 1 = 2) D(-, -) + 2 (Chart[0, 0] + 2 = 2)

Sharon Goldwater ANLP Lecture 3 38

slide-40
SLIDE 40

Chart: one more step

T A B L E 1 2 3 4 5 S 1 2 T 2 ? A 3 L 4 L 5

  • Chart[2, 1] is D(ST, T): the minimum of

D(S, T) + 1 (Chart[1, 1] + 1 = 3) D(ST, -) + 1 (Chart[2, 0] + 1 = 3) D(S, -) + 0 (Chart[1, 0] + 0 = 1)

Sharon Goldwater ANLP Lecture 3 39

slide-41
SLIDE 41

Chart: next steps

T A B L E 1 2 3 4 5 S 1 2 T 2 1 A 3 L 4 L 5

  • Continue by filling in each full column in order (or go by rows)

Sharon Goldwater ANLP Lecture 3 40

slide-42
SLIDE 42

Chart: completed

T A B L E 1 2 3 4 5 S 1 2 3 4 5 6 T 2 1 2 3 4 5 A 3 2 1 2 3 4 L 4 3 2 3 2 3 L 5 4 3 4 3 4

Sharon Goldwater ANLP Lecture 3 41

slide-43
SLIDE 43

To find alignments

T A B L E ←1 ←2 ←3 ←4 ←5 S ↑1 ←տ↑2 ←տ↑3 ←տ↑4 ←տ↑5 ←տ↑6 T ↑2 տ1 ←2 ←3 ←4 ←5 A ↑3 ↑2 տ1 ←2 ←3 ←4 L ↑4 ↑3 ↑2 ←տ↑3 տ2 ←3 L ↑5 ↑4 ↑3 ←տ↑4 տ↑3 ←տ↑4

  • also store which subproblem the best score came from
  • backtrack to get the best alignment

⇒ More complete worked example on lecture page, with exercises.

Sharon Goldwater ANLP Lecture 3 42

slide-44
SLIDE 44

Questions for review

  • How are FSMs and FSTs used for morphological recognition,

analysis and generation?

  • How can we deal with spelling changes in morphological analysis?
  • What is an alignment between two strings?
  • What is minimum edit distance and how do we compute it?

What’s wrong with a brute force solution, and how do we solve that problem?

Sharon Goldwater ANLP Lecture 3 43

slide-45
SLIDE 45

Announcements

  • Next lecture: Probability models and estimation.

– Assumes you know or are getting to grips with the material in the maths tutorials on the Readings section. Do it now!

  • Tutorials start on Tuesday. Try to register for this class ASAP so

you’re assigned to a group. But if not, go to a group anyway! – Work through the exercise sheet for Week 2 (see web page): bring questions to your tutorial group.

  • Remember to register for Piazza (use link on Learn).
  • Drop-in TA hours Mon/Fri 10-11, see Help+Support on Learn.

Sharon Goldwater ANLP Lecture 3 44