Example Applications of Finite State Machines Data structures and - - PowerPoint PPT Presentation

example applications of finite state machines
SMART_READER_LITE
LIVE PREVIEW

Example Applications of Finite State Machines Data structures and - - PowerPoint PPT Presentation

Example Applications of Finite State Machines Data structures and algorithms for Computational Linguistics III ar ltekin ccoltekin@sfs.uni-tuebingen.de University of Tbingen Seminar fr Sprachwissenschaft Winter Semester


slide-1
SLIDE 1

Example Applications of Finite State Machines

Data structures and algorithms for Computational Linguistics III Çağrı Çöltekin ccoltekin@sfs.uni-tuebingen.de

University of Tübingen Seminar für Sprachwissenschaft

Winter Semester 2018–2019

slide-2
SLIDE 2

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Applications of fjnite-state methods

  • Finite state methods are attractive for formal and

computational reasons

  • They are applied in a vast diversity of fjelds

– Electronic circuit design – Workfmow management – Games – Pattern matching – Tokenization, stemming – Morphological analysis – Chunking – …

  • This lecture

– FSA for pattern matching – FSA for storing a lexicon – Finite-state morphology

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 1 / 19

slide-3
SLIDE 3

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Finite state automata

a refresher

  • An FSA recognizes and generates a regular language, also

equivalent to regular expressions

  • FSA are closed under

– Concatenation – Kleene star – Union – Intersection – Complement – Reversal

  • Two types:

DFA single transition from each state on each input symbol NFA transitions to possibly multiple states on a single input symbol, or without consuming an input symbol (ϵ-NFA)

  • Every FSA has a unique minimal DFA

– For every NFA there is a DFA that accepts the same regular language (determinization) – A DFA can be minimized to equivalent DFA with minimum nodes (minimization)

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 2 / 19

slide-4
SLIDE 4

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Finite state transducers

a refresher

  • FST transitions are defjned on a pair of input–output

symbols

  • An FST moves between the states on the input symbol,

while outputting the output symbol

  • FSTs defjne a regular relation
  • FSTs are closed under

– Concatenation – Kleene star – Union – Reversal – Inversion – Composition

  • Not all FSTs can be determinized

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 3 / 19

slide-5
SLIDE 5

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Naive string match

Example: searching ‘abab’ in ‘abbabbbabababbab’

a b b a b b b a b a b a b b a b a b × × × a b × × × × a b a b × a b a b × a b × × × a b ×

Note the wasted efgort after a partial match.

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 4 / 19

slide-6
SLIDE 6

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Naive string match

Example: searching ‘abab’ in ‘abbabbbabababbab’

a b b a b b b a b a b a b b a b a b × × × a b × × × × a b a b × a b a b × a b × × × a b ×

Note the wasted efgort after a partial match.

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 4 / 19

slide-7
SLIDE 7

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

String matching with an NFA

Another solution

Consider running the following NFA over the string. 1 2 3 4 a b a b a b

  • The NFA will be in the accepting state when last four letters

processed matches abab (including overlapping matches) Is this faster than the naive algorithm?

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 5 / 19

slide-8
SLIDE 8

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

String matching with an NFA

Another solution

Consider running the following NFA over the string. 1 2 3 4 a b a b a b

  • The NFA will be in the accepting state when last four letters

processed matches abab (including overlapping matches)

  • Is this faster than the naive algorithm?

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 5 / 19

slide-9
SLIDE 9

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

DFA version

Knuth-Morris-Pratt (KMP) algorithm

01 02 013 024 b a a b a b a b a b

  • DFA processes every input symbol only once
  • The resulting DFA has the same number of states

(generally, not much larger than the NFA)

  • Approach generalizes to arbitrary regular expressions

without additional computational cost

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 6 / 19

slide-10
SLIDE 10

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Finite state lexicons

  • FSA are an effjcient way to

store lexicons

  • One can start from NFA for

individual words, and minimize/determinize the union of them

  • Or there are algorithms for

constructing fjnite-state lexicons incrementally 1 2 3 4 5 6 7 b c d a a

  • t

w g

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 7 / 19

slide-11
SLIDE 11

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Morphology

some defjnitions

Morpheme is an abstract linguistic unit, often defjned as smallest meaningful or grammatical unit. Morphemes make up words Root of a word is a free morpheme, often carrying the semantic information Derivational morphemes change the meaning of a word, sometimes changing the POS Infmectional morphemes change the syntactic properties of words Lemma of a word is its ‘citation’ form, what you look up in a lexicon Stem of a (possibly derived) word is the common string shared by all morphologically related forms

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 8 / 19

slide-12
SLIDE 12

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Morphological typology

Languages of the world behave difgerently with respect to how words are formed.

  • Isolating languages have little or no morphology, all words

are simple (e.g., Vietnamese, Chinese)

  • Analytic languages have little or no infmectional

morphology (e.g., English)

  • Synthetic languages have rich morphological system

– In agglutinative languages each morpheme has a single function (e.g., Finnish, Turkish) – In infmecting/fusional a single morpheme indicates multiple functions (e.g., Latin, Russian) – Polysynthetic languages may pack multiple ‘words’ in a single word (e.g., Ainu, Chukchi)

Note that these are tendencies.

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 9 / 19

slide-13
SLIDE 13

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Where do morphemes go

  • Affjxation:

attach → un-attach-ed

  • Infjxes:

aussteigen → auszusteigen

  • Circumfjxation:

spiel → gespielt

  • Root-pattern morphology:

ktb → kitāb ‘book’ ktb → kātib ‘writer’

(Arabic)

  • Reduplication:
  • rang ‘person’ → orang-orang ‘people’

(some Austronesian languages)

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 10 / 19

slide-14
SLIDE 14

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Interaction of morphology and phonology

  • r morphology and orthography

Morphology and phonology/orthography interact. A few examples:

  • dog-s, but fox-es
  • city →citi-es
  • stop →stopping
  • panic →panick-ed
  • goose →geese
  • Vowel harmony

ev ‘house’ → ev-ler ‘houses’

  • da ‘room’

  • da-lar ‘rooms’

(Turkish)

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 11 / 19

slide-15
SLIDE 15

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Two-level morphology

  • We assume that there are two ‘levels’ of representation

– A surface representation which is what we hear or see – An underlying, an abstract representation for the word

Surface: cat s Underlying: cat ⟨PL⟩

  • An FST is used to map the underlying representation to the

surface representation (generation)

  • If we run the FST in the inverse direction, we get an

analysis

  • Often the FST is a complex combination of many small FSA
  • r FSTs

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 12 / 19

slide-16
SLIDE 16

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Two-level morphology

a typical architecture

  • Typically, lexicon is converted to FSA
  • Concatenated (or composed) with morphological rules

(affjxation, applying templates, …)

  • The result is composed with phonological/orthographic

alternations

  • The phonological/orthographic rules can be designed as

cascades (composition), or can be applied in parallel

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 13 / 19

slide-17
SLIDE 17

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Two-level morphology

a (simplifjed) example

L 1 2 3 4 5 6 7 b c f a a

  • t

w x M 1

⟨PL⟩:⟨S⟩

P 1 2 3 x not x ϵ:e

⟨S⟩:s

Generator: LM ◦ P Analyzer: (LM ◦ P)−1

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 14 / 19

slide-18
SLIDE 18

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

How to specify morphological analyzers

  • Lexicons are easiest to specify as lists of (root) words

cat dog fox …

  • For affjxation, regular expressions (or regular rewrite rules)

Nplu → N ⟨PL⟩:⟨S⟩

  • For phonological/orthographic alternations context

sensitive rules ⟨S⟩ →es / x _

  • There are a few standard languages for specifying

morphological analyzers

– SFST – Xerox languages: XFST, Twolc, lexc – OpenFST OpenGRM (more general purpose)

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 15 / 19

slide-19
SLIDE 19

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

XFST

A quick reference some common notation/operations

? any symbol empty string (ϵ) (a)

  • ptional a

[a|b] grouping a* Kleene start a+ Kleene plus a b concatenation a&b intersection a|b union ~b complement a-b difgerence {cat} concatenation of c a t a:b FST rule with input ‘a’ and output ‘b’ a .o. b compose a with b a -> b unconditionally replace a to b

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 16 / 19

slide-20
SLIDE 20

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

XFST (cont.)

A quick reference some common notation/operations

a (->)b

  • ptionally replace a to b

a -> b || c _ replace a to b only after c a -> b || c _ d replace a to b only after c and before d

  • There are (at least) two free implementations of xfst

– Foma – hfst-xfst (part of HFST)

  • You will receive a separate ‘tutorial’ (and an exercise) on

working with xfst and lexc

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 17 / 19

slide-21
SLIDE 21

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Tools of the trade

Some of the practical, feely-available, tools (with an emphasis

  • n ones targeted for CL) include:
  • Gertjan van Noord’s FSA tools
  • OpenFST: a general purpose fjnite state library
  • Helsinki fjnite-state technology (HFST): library tools from

University of Helsinki

  • Foma: a re-implementation of Xerox’s xfst, a

language/toolbox for defjning/manipulating FST

  • SFST another language/toolbox for

defjning/manipulating FSTs

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 18 / 19

slide-22
SLIDE 22

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Wrapping up

  • Finite-state tools are commonly used in a number of CL

task

  • There are ofg-the-shelf free tools

Next: Dependency grammars and dependency parsing Constituency parsing

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 19 / 19

slide-23
SLIDE 23

Introduction Pattern matching Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction

Wrapping up

  • Finite-state tools are commonly used in a number of CL

task

  • There are ofg-the-shelf free tools

Next:

  • Dependency grammars and dependency parsing
  • Constituency parsing

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 19 / 19

slide-24
SLIDE 24

References / additional reading material

  • Jurafsky and Martin (2009, Ch. 3)
  • Roche and Schabes (1997) includes more examples of FSTs

used for NLP

  • The Xerox languages and tools are described in Beesley

and Karttunen (2003)

  • HFST and Foma web pages include some documentation

and (links to) tutorials

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 A.1

slide-25
SLIDE 25

References / additional reading material (cont.)

Beesley, Kenneth R. and Lauri Karttunen (2003). “Finite-state morphology: Xerox tools and techniques”. In: CSLI, Stanford. Hulden, Mans (2009). “Foma: a fjnite-state compiler and library”. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 29–32. Jurafsky, Daniel and James H. Martin (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech

  • Recognition. second. Pearson Prentice Hall. isbn: 978-0-13-504196-3.

Lindén, Krister, Erik Axelson, Senka Drobac, Sam Hardwick, Juha Kuokkala, Jyrki Niemi, Tommi A. Pirinen, and Miikka Silfverberg (2013). “HFST — A System for Creating NLP Tools”. In: Systems and Frameworks for Computational Morphology.

  • Ed. by Cerstin Mahlow and Michael Piotrowski. Berlin, Heidelberg: Springer Berlin

Heidelberg, pp. 53–71. Roche, Emmanuel and Yves Schabes (1997). Finite-state Language Processing. A Bradford

  • book. MIT Press. isbn: 9780262181822.

Ç. Çöltekin, SfS / University of Tübingen WS 18–19 A.2