Finite-State Methods in Natural-Language Processing: 1Motivation - - PowerPoint PPT Presentation

finite state methods in natural language processing 1
SMART_READER_LITE
LIVE PREVIEW

Finite-State Methods in Natural-Language Processing: 1Motivation - - PowerPoint PPT Presentation

Finite-State Methods in Natural-Language Processing: 1Motivation Ronald M. Kaplan and Martin Kay Motivation 1 Finite-State Methods in Language Processing The Application of a branch of mathematics The regular branch of automata


slide-1
SLIDE 1

1 Motivation —

Finite-State Methods in Natural-Language Processing: 1—Motivation

Ronald M. Kaplan and Martin Kay

slide-2
SLIDE 2

2 Motivation —

Finite-State Methods in Language Processing

The Application of a branch of mathematics —The regular branch of automata theory to a branch of computational linguistics in which what is crucial is (or can be reduced to) —Properties of string sets and string relations with —A notion of bounded dependency

slide-3
SLIDE 3

3 Motivation —

Applications

  • Finite Languges

—Dictionaries —Compression

  • Phenomena involving

bounded dependency

—Morpholgy

  • Spelling
  • Hyphenation
  • Tokenization
  • Morphological Analysis

—Phonology

  • Approximations to

phenomena involving mostly bounded dependency

—Syntax

  • Phenomena that can be

translated into the realm of strings with bounded dependency

—Syntax

slide-4
SLIDE 4

4 Motivation —

Correspondences

Computational Device Finite-State Automaton Descriptive Notation: Regular Expression Set of Objects: Regular Language

slide-5
SLIDE 5

5 Motivation —

The Basic Idea

  • At any given moment,an automaton is in one of a

finite number of states

  • A transition from one state to another is possible

when the automaton contains a corresponding transition.

  • The process can stop only when the automaton is in
  • ne of a subset of the states, called final.
  • Transitions are labeled with symbols so that a

sequence of transitions corresponds to a sequence

  • f symbols.
slide-6
SLIDE 6

6 Motivation —

Bounded Dependency

The choice between γ1 and γ2 depends on a bounded number of preceding symbols. γ1 γ2

?

slide-7
SLIDE 7

7 Motivation —

Bounded Dependency

The choice between γ1 and γ2 depends on a bounded number of preceding symbols. γ1 γ2

?

si si irrelevant

slide-8
SLIDE 8

8 Motivation —

Closure Properties and Operations

  • By definition

—Union —Concatenation —Iteration

  • By deduction

—Intersection —Complementation —Substitution —Reversal — ...

slide-9
SLIDE 9

9 Motivation —

Operations on Languages and Automata

For the set-theoretic operations on languages there are corresponding operations on automata. M(L) is a machine that characterizes the language L. We will use the same symbols for corresponding operations We will use the same symbols for corresponding operations

M(L1 ⊗ L2) = M(L 1)⊕ M(L2)

slide-10
SLIDE 10

10 Motivation —

Automata-based Calculus

  • Closure gives:

—Complementation → Universal quantification —Intersection → Combinations of constraints

  • Machines give:

—Finite representations for (potentially) infinite sets —Practical implementation

  • Combination gives:

—Coherence —Robustness —Reasonable machine transformations

slide-11
SLIDE 11

11 Motivation —

Quantification

There is an x followed by a y in the string There is no xy sequence in the string There is a y preceded by something that is not an x Σ*xyΣ* Σ*xyΣ* Every y is preceded by an x. Σ*xyΣ* Σ*xyΣ*

∃y.∃x.precedes(x, y)

slide-12
SLIDE 12

12 Motivation —

Universal Quantification— i before e except after c

e

  • t

h e r c c e c e, i, other i, other Σ*ceiΣ*

slide-13
SLIDE 13

13 Motivation —

Universal Quantification— i before e except after c

e

  • t

h e r c c e c e, i, other i, other After e: no i After c: anything Not after c or e: anything but ei

slide-14
SLIDE 14

14 Motivation —

Only e i after c

Σ*ceiΣ* ∩Σ*cieΣ* i,other 4 e

  • ther

e c c c e ,

  • t

h e r c i,other i

slide-15
SLIDE 15

15 Motivation —

Only e i after c

i,other 4 e

  • ther

e c c c e ,

  • t

h e r c i,other i After e: no i After c: not ie Not after c or e: anything but ei After ci: no e

slide-16
SLIDE 16

16 Motivation —

Alternative Notations

Closure ⇒ Recursive Formalisms ⇒ Higher-level Constructs Choose notation for theoretical significance and practical convenience. L1 ← L2 ≡ L1L2 Σ*ceiΣ* ≡ Σ*c ← eiΣ*

slide-17
SLIDE 17

17 Motivation —

What is a Finite-State Automaton?

  • An alphabet of symbols,
  • A finite set of states,
  • A transition function from states and symbols to

states,

  • A distinguished member of the set of states called

the start state, and

  • A distinguished subset of the set of states called

final states. Pace terminology, same definition as for directed graphs with labeled edges, plus initial and final states. Pace terminology, same definition as for directed graphs with labeled edges, plus initial and final states.

slide-18
SLIDE 18

18 Motivation —

i to x

x v i i i i i v, x 1 2 3 4 5 Unless otherwise marked, the start state is usually the leftmost in the diagram We draw final states with a double circle

slide-19
SLIDE 19

19 Motivation —

Regular Languages

  • Languages — sets of strings
  • Regular languages — a subset of languages
  • Closed under concatenation, union, and iteration
  • Every regular language is chracterized by (at least)
  • ne finite-state automaton
  • Languages may contain infinitely many strings but

automata are finite

slide-20
SLIDE 20

20 Motivation —

Regular Expressions

  • Formulae with operators that denote

—union —concatenation —iteration

a* [b | c] a* [b | c] Any number of a’s followed by either b or c.

slide-21
SLIDE 21

21 Motivation —

Some Motivations

  • Word Recognition
  • Dictionary Lookup
  • Spelling Conventions
slide-22
SLIDE 22

22 Motivation —

A word recognizer takes a string of characters as input and returns “yes” or “no” according as the word is or is not in a given set. Solves the membership problem.

Word Recognition

e.g. Spell Checking, Scrabble

slide-23
SLIDE 23

23 Motivation —

  • Has right set of letters (any order).
  • Has right sounds (Soundex).
  • Random (suprimposed) coding (Unix Spell)

Approximate methods

Word hash1 hash2 hashk Bit Table

slide-24
SLIDE 24

24 Motivation —

Exact Methods

  • Hashing
  • Search (linear, binary ...)
  • Digital search (“Tries”)

a a r d v a r k b a c k s h e d s i n g z e d i n g s b u z a c k Folds together common prefixes

slide-25
SLIDE 25

25 Motivation —

Exact Methods (continued)

  • Finite-state automata

Folds together common prefixes and suffixes

slide-26
SLIDE 26

26 Motivation —

Enumeration vs. Description

  • Enumeration

—Representation includes an item for each object. Size = f(Items)

  • Description

—Representation provides a characterization of the set of all items. Size = g(Common properties, Exceptions) —Adding item can decrease size.

slide-27
SLIDE 27

27 Motivation —

Classification

Exact Approximate Enumeration Hash table Soundex Binary search Description Trie Unix Spell FSM Right letter

slide-28
SLIDE 28

28 Motivation —

FSM Extends to Infinite Sets

Productive compounding Kindergartensgeselschaft

slide-29
SLIDE 29

29 Motivation —

Statistics

English Portuguese Vocabulary Words 81,142 206,786 KBytes 858 2,389 PKPAK 313 683 PKZIP 253 602 FSM States 29,317 17,267 Transitions 67,709 45,838 KBytes 203 124 From Lucchese and Kowaltowski (1993)

slide-30
SLIDE 30

30 Motivation —

Dictionary lookup takes a string of characters as input and returns “yes” or “no” according as the word is or is not in a given set and returns information about the word.

Dictionary Lookup

slide-31
SLIDE 31

31 Motivation —

Lookup Methods

Approximate — guess the information

If it ends in “ed”, it’s a past-tense verb.

Exact — store the information for finitely many words

Table Lookup

  • Hash
  • Search
  • Trie —store at word-endings.

FSM

  • Store at final states?

No suffix collapse — reverts to Trie.

slide-32
SLIDE 32

32 Motivation —

Word Identifiers

Associate a unique, useful, identifier with each of n words, e.g. an integer from 1 to n. This can be used to index a vector of dictionary information.

n word → i i Information

slide-33
SLIDE 33

33 Motivation —

Pre-order Walk

A pre-order walk of an n-word FSM, counting final states, assigns such integers, even if suffixes are collapsed ⇒ Linear Search.

drip → 1 drips → 2 drop → 3 drops → 4

slide-34
SLIDE 34

34 Motivation —

Suffix Counts

  • Store with each state the size of its suffix set
  • Skip irrelevant transitions, incrementing count by

destination suffix sizes.

drip → 1 drips → 2 drop → 3 drops → 4 4 4 4 2 1

slide-35
SLIDE 35

35 Motivation —

  • Minimal Perfect Hash (Lucchesi and Kowaltowski)
  • Word-number mapping (Kaplan and Kay, 1985)
slide-36
SLIDE 36

36 Motivation —

Spelling Conventions

iN+tractable → intractable iN+practical → impractical iN is the common negative prefix — im before labial — in otherwise c.f. input → input

slide-37
SLIDE 37

37 Motivation —

An in/im Transducer

No exit from this state except over a labial. No labials from this state

slide-38
SLIDE 38

38 Motivation —

Generation — “intractable”

i i N m N n 2 t t r r a a 1

slide-39
SLIDE 39

39 Motivation —

Generation — “impractical”

i i N m N n 2 1 p p r r a a

slide-40
SLIDE 40

40 Motivation —

Recognition — “intractable”

i i n n N n 2 t t r r a a t t r r a a

slide-41
SLIDE 41

41 Motivation —

Generation — “input”

i i N m N n 2 1 p p u u t t

slide-42
SLIDE 42

42 Motivation —

A Word Transducer

Base Forms Morphology Spelling Rules Text Forms Finite-State Transducers Finite-state Machine Finite-state machine

slide-43
SLIDE 43

43 Motivation —

Bibliography

Kaplan, Ronald M. and Martin Kay. “Regular models of phololgical rule systems.” Computational Linguistics, 23:3, September 1994. Hopcroft, John E. and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and

  • Computation. Addison Wesley, 1979. Chapters 2

and 3. Partee, Barbara H., Alice ter Meulen and Robert E.

  • Wall. Mathematical Methods in Linguistics. Kluwer,
  • 1990. Chapters 16 and 17.

Lucchesi, Cláudio L. and Tomasz Kowaltowski. “Applications of Finite Automata Representing Large Vocabularies”. Software Practice and

  • Esperience. 23:1, January 1993.