Finite State Methods for Lexicon and Morphology Bernd Kiefer - - PowerPoint PPT Presentation

finite state methods for lexicon and morphology
SMART_READER_LITE
LIVE PREVIEW

Finite State Methods for Lexicon and Morphology Bernd Kiefer - - PowerPoint PPT Presentation

Foundations of Language Science and Technology Finite State Methods for Lexicon and Morphology Bernd Kiefer Bernd.Kiefer@dfki.de Deutsches Forschungszentrum f ur k unstliche Intelligenz Finite State Methods for Morphology p.1/41


slide-1
SLIDE 1

Foundations of Language Science and Technology

Finite State Methods for Lexicon and Morphology

Bernd Kiefer

Bernd.Kiefer@dfki.de

Deutsches Forschungszentrum f¨ ur k¨ unstliche Intelligenz

Finite State Methods for Morphology – p.1/41

slide-2
SLIDE 2

Morphological Parsing

Break a surface form into morphemes: foxes into fox (noun stem) and -e -s (plural suffix + e-insertion) Compute stem and features goose → goose +N +SG or +V geese → goose +N +PL gooses → goose +V +3SG Needed for (among others) spell-checking: is steadyly or steadily correct? identify a word’s part-of-speech reduce a word to its stem

Finite State Methods for Morphology – p.2/41

slide-3
SLIDE 3

Morphological Knowledge

Components needed in a morphological parser:

  • 1. Lexicon: list of stems and class information (base,

inflectional class etc.)

  • 2. Morphotactics: a model of morphological processes like

English adjective inflection on the last slide lexical and morphotactic knowlegde will be encoded using finite-state automata

  • 3. Orthography: a model of how the spelling changes when

morphemes combine, e.g., city+s → cities in → il in context of l, like in- +legal will be modeled using finite-state transducers

Finite State Methods for Morphology – p.3/41

slide-4
SLIDE 4

Detour: Describing Languages

Language: a set of finite sequences of symbols Symbols can be anything like graphemes, phonemes, etc. Alphabet: the inventory of symbols We want formal devices to describe the strings in a language

Finite State Methods for Morphology – p.4/41

slide-5
SLIDE 5

Formal Languages - Definitions

Alphabet Σ (Sigma): a nonempty finite set of symbols Strings of a language: arbitrary finite sequences of symbols in Σ

ǫ (epsilon) denotes the empty string Σ* is the set of all strings over Σ, including ǫ

A language L is a subset of Σ*, L ⊆ Σ* grammatical sentences w ∈ L ungrammatical sentences v ∈ L

Σ*

L

Finite State Methods for Morphology – p.5/41

slide-6
SLIDE 6

Formal Grammars - Definitions

Mathematical devices to describe languages Goal: separate the grammatical from the ungrammatical strings One of the devices: rule systems Two alphabets: terminals Σ, nonterminals N Rules rewrite strings in (Σ∪ N)* into new strings in (Σ∪ N)* Languages differ in complexity Complexity depends on the type of rule system / device needed

Finite State Methods for Morphology – p.6/41

slide-7
SLIDE 7

Chomsky Hierarchy

Type 3: regular languages Rules of type A → α, A → α B; A,B ∈ N; α ∈ Σ* Type 2: context free languages A → ψ; ψ ∈ (Σ ∪ N)* Type 1: context sensitive languages

α A β → αψβ; α, β ∈ Σ*

Type 0: unrestricted

α A β → ψ

The following inclusions hold: Type 3 ⊂ Type 2 ⊂ Type 1 ⊂ Type 0

Finite State Methods for Morphology – p.7/41

slide-8
SLIDE 8

Regular Languages

Simplest formal languages, rules A → x, A → x B Alternative characterization: use symbols from the alphabet and combine them using concatenation • alternative | Kleene star * (repeat zero or more times) Examples: {the}•{gifted}•{student} {the}•({very}|{extremely})•{gifted}•{student} ({0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9})*•({0}|{2}|{4}|{6}|{8})

Finite State Methods for Morphology – p.8/41

slide-9
SLIDE 9

Properties of Regular Languages

Rule systems are right linear Nonterminal always at the right end of the rule’s right hand side: A → x , A → x B A linear (in size of the string) number of steps is enough to answer: w ∈ L ?

Finite State Methods for Morphology – p.9/41

slide-10
SLIDE 10

Properties of Regular Languages

Rule systems are right linear Nonterminal always at the right end of the rule’s right hand side: A → x , A → x B A linear (in size of the string) number of steps is enough to answer: w ∈ L ? Can describe arbitrary long strings, e.g., sheep talk: ba(a)*h

Finite State Methods for Morphology – p.9/41

slide-11
SLIDE 11

Properties of Regular Languages

Rule systems are right linear Nonterminal always at the right end of the rule’s right hand side: A → x , A → x B A linear (in size of the string) number of steps is enough to answer: w ∈ L ? Can describe arbitrary long strings, e.g., sheep talk: ba(a)*h Can describe infinite languages

Finite State Methods for Morphology – p.9/41

slide-12
SLIDE 12

Properties of Regular Languages

Rule systems are right linear Nonterminal always at the right end of the rule’s right hand side: A → x , A → x B A linear (in size of the string) number of steps is enough to answer: w ∈ L ? Can describe arbitrary long strings, e.g., sheep talk: ba(a)*h Can describe infinite languages What is the simplest thing not possible (Hotz’s question)

anbn, n ∈ N

  • nly finite counting!

Finite State Methods for Morphology – p.9/41

slide-13
SLIDE 13

Properties of Regular Languages

Rule systems are right linear Nonterminal always at the right end of the rule’s right hand side: A → x , A → x B A linear (in size of the string) number of steps is enough to answer: w ∈ L ? Can describe arbitrary long strings, e.g., sheep talk: ba(a)*h Can describe infinite languages What is the simplest thing not possible (Hotz’s question)

anbn, n ∈ N

  • nly finite counting!

Equivalent to finite automata

Finite State Methods for Morphology – p.9/41

slide-14
SLIDE 14

Finite Automata

A finite set of states Q, containing a start state q0 and a subset of final states F An input tape containing the input string and a pointer to mark the current input position A transition relation δ : Q × (Σ ∪ {ǫ}) × Q Possible moves depend on: the current state the current input symbol every move advances the input pointer graphical representation: directed graph, states are nodes, edges are state transitions

Finite State Methods for Morphology – p.10/41

slide-15
SLIDE 15

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

Finite State Methods for Morphology – p.11/41

slide-16
SLIDE 16

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

q0 q1

the

Finite State Methods for Morphology – p.11/41

slide-17
SLIDE 17

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

q0 q1

the

q2

extremely

Finite State Methods for Morphology – p.11/41

slide-18
SLIDE 18

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

q0 q1

the

q2

extremely

q3

gifted

Finite State Methods for Morphology – p.11/41

slide-19
SLIDE 19

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

q0 q1

the

q2

extremely

q3

gifted

ǫ

Finite State Methods for Morphology – p.11/41

slide-20
SLIDE 20

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

q0 q1

the

q2

extremely

q3

gifted

ǫ

gifted

Finite State Methods for Morphology – p.11/41

slide-21
SLIDE 21

Nondeterministic Finite Automata

Automata where δ is a relation and ǫ arcs are allowed are called nondeterministic automata The move may not be uniquely determined based on the next input symbol ex: the (extremely gifted|ǫ) gifted* student

q0 q1

the

q2

extremely

q3

gifted

ǫ

gifted

q4

student

Finite State Methods for Morphology – p.11/41

slide-22
SLIDE 22

Closure Properties

Language type A is closed unter operation x means: applying x to members of A results in element of the same type Regular languages are closed under Concatenation, Union (trivial) Complementation: Exchange final and nonfinal states

  • f an automaton

Intersection: L1 ∩ L2 = ¬(¬L1 ∪ ¬L2) Applicability of these operations facilitates modularization E.g., concatenate automaton for base word forms with one for inflectional suffixes

Finite State Methods for Morphology – p.12/41

slide-23
SLIDE 23

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.13/41

slide-24
SLIDE 24

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.14/41

slide-25
SLIDE 25

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Failure!

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.15/41

slide-26
SLIDE 26

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.16/41

slide-27
SLIDE 27

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Failure!

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.17/41

slide-28
SLIDE 28

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.18/41

slide-29
SLIDE 29

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking Failure!

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.18/41

slide-30
SLIDE 30

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.19/41

slide-31
SLIDE 31

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Failure!

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.20/41

slide-32
SLIDE 32

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.21/41

slide-33
SLIDE 33

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Failure!

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.22/41

slide-34
SLIDE 34

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.23/41

slide-35
SLIDE 35

Finite Automata: Search

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Backtracking Success!

German adjective ending Input: klein + er + es

Finite State Methods for Morphology – p.23/41

slide-36
SLIDE 36

Nondeterministic vs. Deterministic

Search becomes a problem in big automata Solution: determinisation the transition relation has to be a total function Q×Σ → Q: exactly one choice for every nondeterministic automaton, a deterministic automaton can be constructed that accepts the same language recognition linear in size of the string but: the size of the automaton can be exponential in size of original automaton

Finite State Methods for Morphology – p.24/41

slide-37
SLIDE 37

Advantages of Finite Automata

efficiency very fast if deterministic or low-degree non-determinism space: compressed representations of data system development and maintenance modular design and automatic compilation of system components high level specifications language modelling uniform framework for modelling dictionaries and rules

Finite State Methods for Morphology – p.25/41

slide-38
SLIDE 38

FSA for Morphology

Let’s first have a look at concatenative morphology cats : cat + s unbelieveable: un + believe + able Use different automata for prefixes base form ⇒ lexicon (we’ll do this first) suffixes and combine them with concatenation recognition is not enough: analysis should return information, e.g., inflectional class idea: associate final states with information

Finite State Methods for Morphology – p.26/41

slide-39
SLIDE 39

Lexicon representation

Why not simply list all words?

Finite State Methods for Morphology – p.27/41

slide-40
SLIDE 40

Lexicon representation

Why not simply list all words? stiff pos stiffer comp stiffest sup stiffly adv still pos & adv stiller comp stillest adv stout pos & adv stouter comp stoutest sup stony pos stonier com . . . large, wasteful, incomplete

Finite State Methods for Morphology – p.27/41

slide-41
SLIDE 41

Lexicon representation

Why not simply list all words? stiff pos stiffer comp stiffest sup stiffly adv still pos & adv stiller comp stillest adv stout pos & adv stouter comp stoutest sup stony pos stonier com . . . large, wasteful, incomplete no (morphological) handling of new words

Finite State Methods for Morphology – p.27/41

slide-42
SLIDE 42

Lexicon representation

Why not simply list all words? stiff pos stiffer comp stiffest sup stiffly adv still pos & adv stiller comp stillest adv stout pos & adv stouter comp stoutest sup stony pos stonier com . . . large, wasteful, incomplete no (morphological) handling of new words what about languages with a more productive morphology, e.g., Finnish or Turkish?

Finite State Methods for Morphology – p.27/41

slide-43
SLIDE 43

Lexicon representation

Why not simply list all words? stiff pos stiffer comp stiffest sup stiffly adv still pos & adv stiller comp stillest adv stout pos & adv stouter comp stoutest sup stony pos stonier com . . . large, wasteful, incomplete no (morphological) handling of new words what about languages with a more productive morphology, e.g., Finnish or Turkish? Encode each phenomenon / process in one automaton Combine them and get an efficient machine

Finite State Methods for Morphology – p.27/41

slide-44
SLIDE 44

Lexicon representation

stiff pos stiffer comp stiffest sup stiffly adv still pos & adv stiller comp stillest adv stout pos & adv stouter comp stoutest sup stony pos stonier com . . . Separate base form and modifications e.g., (inflectional) affixes: stiff still stout stony stolen straight . . .

      

+ ǫ pos + er comp + est sup + ly adv really? Other morphological processes like un- negation: un + happy un + clear + ly

Finite State Methods for Morphology – p.28/41

slide-45
SLIDE 45

Lexicon Automaton

. . ., sandy, still, stolen, stony, stout, . . .

  • 1. construct a letter tree (or trie); leaves ≡ final nodes

s

. . .

t

. . .

r . . . . . . t i

  • a

l u n l n l d y e n t y

Finite State Methods for Morphology – p.29/41

slide-46
SLIDE 46

Lexicon Automaton

. . ., sandy, still, stolen, stony, stout, . . .

  • 1. construct a letter tree (or trie); leaves ≡ final nodes
  • 2. associate the leaves with lexical information

s

. . .

t

. . .

r . . . . . . t i

  • a

l u n l n l d y

  • ly-adv

+ly-adv, y→i e n t y

  • ly-adv

+ly-adv +ly-adv, y→i

Finite State Methods for Morphology – p.29/41

slide-47
SLIDE 47

Lexicon Automaton

. . ., sandy, still, stolen, stony, stout, . . .

  • 1. construct a letter tree (or trie); leaves ≡ final nodes
  • 2. associate the leaves with lexical information
  • 3. merge the nodes with identical information

minimize the automaton s

. . .

t

. . .

r . . . . . . t i

  • a

l u n l n e n t y

  • ly-adv

+ly-adv +ly-adv, y→i l d

Finite State Methods for Morphology – p.29/41

slide-48
SLIDE 48

Suffixes: German Adjectives

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

Only one final state: How to get the different values?

Finite State Methods for Morphology – p.30/41

slide-49
SLIDE 49

Suffixes: German Adjectives

q0 q1 q2 q3

st e n er m

ǫ

r

ǫ

s

ǫ

final states with different information can not be combined: expand automaton

e r m

. . .

n

. . .

s

. . .

e

. . .

n m

. . .

r

. . .

s

  • mp+ntr+sg+(nom|a
)

s t e n

. . .

m

. . .

r

. . .

s

sup+ntr+sg+(nom|a )

Finite State Methods for Morphology – p.30/41

slide-50
SLIDE 50

Combining the Levels

q0 q1

un

ǫ

adj-lex

q2

+ly

q3

  • ly

q4 q5 q6

ly est er est er What about: un. . . with big; . . .ly with still?

Finite State Methods for Morphology – p.31/41

slide-51
SLIDE 51

Combining the Levels

q0 q1

un

ǫ

adj-lex

q2

+ly

q3

  • ly

q4 q5 q6

ly est er est er

q′

1

ǫ

+un

  • un

What about: un. . . with big; . . .ly with still? Split startnodes in adj-lex, like the final nodes But: splits the lexicon, less compact Alternative: special flags that are handled by the machinery

Finite State Methods for Morphology – p.31/41

slide-52
SLIDE 52

Two-Level Morphology

Represents a word as correspondence between two levels Lexical level: abstract morphemes and features Surface level: the actual spelling of the word Can be implemented using finite state transducers A finite state transducer rewrites the input onto a second, additional tape Surface Lexical c a t s c a t +N +PL

Finite State Methods for Morphology – p.32/41

slide-53
SLIDE 53

Automaton vs. Transducer

Finite-state Automaton Arcs are labeled with symbols like a and b Accepts strings like aaab Defines a regular language: { a, ab, aab, aaab, . . . } Finite-state Transducer Arcs are labeled with symbol pairs like a:b and b:b, but also b:ǫ and ǫ:a (and b as shorthand for b:b) Accepts a pair of strings like aaab:aabb Defines a regular relation: { a:b, aa:bb, aaa:bbb, . . . } We will use it to accept string pairs like cat+N+PL:cats and fox+N+PL:foxes

Finite State Methods for Morphology – p.33/41

slide-54
SLIDE 54

Four Views on Transducers

Surface Lexical c a t s c a t +N +PL

  • 1. Recognizer: machine that accepts or rejects pairs of

strings

  • 2. Generator: machine that outputs pairs of strings
  • 3. Translator: machine that reads one string and outputs

another string (in both directions)

  • 4. Set Relator: machine that computes relations between

sets

Finite State Methods for Morphology – p.34/41

slide-55
SLIDE 55

Cascaded Transducers

To accomodate for all spelling / pronounciation changes,

  • ne transducer alone is not powerful enough

Use intermediate tapes that contain the output of one transducer and serves as input to another transducer To handle irregular spelling changes, we can add intermediate tapes with intermediate symbols: ˆ for morpheme boundary, # for word boundary Surface Lexical f

  • x

ˆ s # f

  • x

+N +PL

Finite State Methods for Morphology – p.35/41

slide-56
SLIDE 56

Some English Orthograpic Rules

English orthographic rules that apply at particular morpheme boundaries Name Description of rule Example consonant doubling consonant doubled before

  • ing/-ed

beg / begging e-deletion silent e dropped before

  • ing/-ed

make / making e-insertion e added between -s, -z, -x,

  • ch, -sh and -s

watch / watches y-replacement

  • y changes to -ie before -s,

to -i before -ed try / tries k-insertion verbs ending with vowel + -c add -k panic / panicked

Finite State Methods for Morphology – p.36/41

slide-57
SLIDE 57

Orthograpic Rules II

Spelling rules take the concatenation of morphemes – the intermediate tape – as input and produce the surface form Example: e-insertion rule is applied to the intermediate form foxˆs# Surface Intermediate Lexical f

  • x

e s f

  • x

ˆ s # f

  • x

+N +PL

Finite State Methods for Morphology – p.37/41

slide-58
SLIDE 58

e-Insertion

.* ((x|z|s) ˆ:ǫ ǫ:e | ¬(x|z|s) ˆ:ǫ) s# q0 q1 q2 q3 q4 q5

z,s,x ˆ:ǫ, #, ⋆ #,⋆ z,s,x ˆ:ǫ z,x #, ⋆

ǫ:e

s # ˆ:ǫ s z,s,x

rule: ((z|s|x) ˆ:ǫ ǫ:e | ¬(z|s|x) ˆ:ǫ) s #

⋆: all pairs not in this transducer, remember y is y:y

States q0 and q1 accept default pairs like catˆs#:cats# State q5 rejects incorrect pairs like foxˆs#:foxs#

Finite State Methods for Morphology – p.38/41

slide-59
SLIDE 59

y-Replacement

q0 q1 q2 q3 q4 q5

y:i ˆ:e

ˆ:ǫ s:s #:# Ex.: spy+s → spies rule: .* ((y:i ˆ:e)|(¬ y ˆ:ǫ)) # All these machines do not change input to which they do not apply Nevertheless, the rule writer must take care of all interactions

Finite State Methods for Morphology – p.39/41

slide-60
SLIDE 60

Summary

The task of morphological analysis/generation (Very short) introduction to formal languages Basics of regular languages Nondeterministic and deterministic finite automata Applying finite state techniques to morphological knowledge Lexicon: compacted tries Concatenative phenomena: finite automata Associating information with final states Derivational phenomena: finite state transducers

Finite State Methods for Morphology – p.40/41

slide-61
SLIDE 61

References

Beesley , Kenneth R. and Lauri Karttunen (2003). Finite-State Morpholo gy. CSLI Publi ations. www.fsmbook.com Jurafsky , Daniel and James H. Martin (2000). Sp e e h and L anguage Pr
  • essing.
A n Intr
  • du tion
to Natur al L anguage Pr
  • essing,
Computational Linguisti s and Sp e e h R e
  • gnition.
New Jersey: Pren ti e Hall. K
  • sk
enniemi, Kimmo (1983). Two-level morpholo gy: a gener al
  • mputational
mo del for wor d-form r e
  • gnition
and pr
  • du tion.
Publi ation No:11, Univ ersit y
  • f
Helsinki, Departmen t
  • f
General Linguisti s, 1983. Mohri, Mehry ar (1996). On some Appli ations
  • f
nite-state automata the
  • ry
to natur al language pr
  • essing.
In: Journal
  • f
Natural Language Egineering, 2, pp 1-20. Xero x Finite State Compiler (W eb Demo):

http://www.xrce.xerox.com/competencies/content-analysis/ fsCompiler/fsinput.html

Finite State Methods for Morphology – p.41/41