Introduction to Natural Language Processing MORPHOLOGY - - PowerPoint PPT Presentation

introduction to natural language processing morphology
SMART_READER_LITE
LIVE PREVIEW

Introduction to Natural Language Processing MORPHOLOGY - - PowerPoint PPT Presentation

Introduction to Natural Language Processing MORPHOLOGY TRANSDUCERS Martin Rajman Martin.Rajman@epfl.ch and Jean-C edric Chappelier Jean-Cedric.Chappelier@epfl.ch Artificial Intelligence Laboratory M. Rajman LIA


slide-1
SLIDE 1

✬ ✫ ✩ ✪ Introduction to Natural Language Processing

MORPHOLOGY – TRANSDUCERS

Martin Rajman

Martin.Rajman@epfl.ch

and Jean-C´ edric Chappelier

Jean-Cedric.Chappelier@epfl.ch

Artificial Intelligence Laboratory

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 1/24

slide-2
SLIDE 2

✬ ✫ ✩ ✪ Objectives of this lecture ➥ Present morphology, important part of NLP ➥ Introduce transducers, tools for computational morphology

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 2/24

slide-3
SLIDE 3

✬ ✫ ✩ ✪ Contents ➥ Morphology ➥ Transducers ➥ Operations and Regular Expressions on Transducers

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 3/24

slide-4
SLIDE 4

✬ ✫ ✩ ✪

Morphology Study of the internal structure and the variability of the words in a language:

✏ verbs conjugation ✏ plurals ✏ nominalization (enjoy → enjoyment) ➜ inflectional morphology: preserves the grammatical category

give given gave gives ...

➜ derivational morphology: change in category

process processing processable processor processabilty

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 4/24

slide-5
SLIDE 5

✬ ✫ ✩ ✪

Morphology (2) Interest: use a priori knowledge about word structure to decompose it into morphemes and produce additional syntactic and semantic information (on the current word) processable → process-

  • able

☞ 2 morphemes

meaning: process possible role: root suffix semantic information: main less The importance and complexity of morphology vary from language to language Some information represented at the morphological level in English may be represented differently in other languages (and vice-versa). The paradigmatic/syntagmatic repartition changes from one language to another Example in Chinese: ate −

→ expressed as ”eat yesterday”

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 5/24

slide-6
SLIDE 6

✬ ✫ ✩ ✪

Stems – Affixes Words are decomposed into morphemes: roots (or stems) and affixes. There are several kinds of affixes:

➊ prefixes:

in- -credible

➋ suffixes:

incred- -ible

➌ infixes:

Example in Tagalog ( Philippines): hingi (to borrow) → humingi (agent of the action) In slang English! → ”fucking” in the middle of a word Man-fucking-hattan

➍ circumfixes:

Example in German: sagen (to say) → gesagt (said)

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 6/24

slide-7
SLIDE 7

✬ ✫ ✩ ✪

Stems – Affixes (2) several affixes may be combined: examples in Turkish where you can have up to 10 (!) affixes. uygarlas ¸tıramadıklarimizdanmıs ¸sınızcasına uygar las ¸ tır ama dık lar imiz dan mıs ¸ sınız casına civilized +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +ASIF as if you are among those whom we could not cause to become civilized When only prefixes and suffixes are involved: concatenative morphology Some languages are not concatenative:

  • infixes
  • pattern-based morphology

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 7/24

slide-8
SLIDE 8

✬ ✫ ✩ ✪

Example of semitic languages Pattern-based morphology In Hebrew, the verb morphology is based on the association of

  • a root, often made of 3 consonents, which indicates the main meaning,
  • and a vocalic structure (insertion ov vowels) that refines the meaning.

Example: LMD (learn or teach) LAMAD → he was learning LUMAD → he was taught

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 8/24

slide-9
SLIDE 9

✬ ✫ ✩ ✪

Computational Morphology Let us consider flexional morphology, for instance for verbs and nouns Noun flexions: plural General rule: +s but several exceptions (e.g. foxes, mice) Verb flexions: conjugations

  • tense, mode
  • regular/irregular

☞ How to handle flexions (comptutationaly)?

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 9/24

slide-10
SLIDE 10

✬ ✫ ✩ ✪

Computational Morphology Example:surface form: is canonical representation at the lexicon level (formalization): be+3+s+Ind+Pres The objective of computational morphology tools is precisely to go from one to the other:

  • Analysis: Find the canonical representation corresponding to the surface form
  • Generation: Produce the surface form described by the canonical representation

Challenge: have a ”good” implementation of these two transformations Tools: associations of strings → transducers

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 10/24

slide-11
SLIDE 11

✬ ✫ ✩ ✪

String associations

(X1, X′

1)

. . .

(Xn, X′

n)

(eaten, eat) (processed, process) . . . (thought, think) Easy situation: ∀i,

|Xi| = |X′

i|

Example: (abc, ABC)

⇒ represented as a sequence of character transductions

(abc, ABC) = (a,A)(b,B)(c,C)

☞strings on a new alphabet: strings of character couples

Not so easy: If ∃i,

|Xi| = |X′

i| ⇒ requires the introduction of empty string ε

Example: (ab, ABC) ≃ (εab, ABC) = (ε,A)(a,B)(b,C)

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 11/24

slide-12
SLIDE 12

✬ ✫ ✩ ✪

Dealing with ε Where to put the ε? Example:(ab,ABC) ≃ (εab, ABC) but also (ab,ABC) ≃ (aεb, ABC)

  • r (ab,ABC) ≃ (abε, ABC)

General case:

 n m   (with m < n)

Hard problem in general → need for a convention

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 12/24

slide-13
SLIDE 13

✬ ✫ ✩ ✪

Transducer (definition) Let Σ1 and Σ2 be two enumerable sets (alphabets), and

Σ =

  • (Σ1 ∪ {ε}) × (Σ2 ∪ {ε})
  • \ {(ε, ε)}

A transducer is a DFSA on Σ Σ1 : ”left” language

: upper language : input language

Σ2 : ”right” language

: lower language : output language

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 13/24

slide-14
SLIDE 14

✬ ✫ ✩ ✪

Example 1 2 initial state final state(s)

b a:b a b:a b:a b:ε a

Some transductions: (bb,b) [0,0,2] (ababb,baab) [0,1,2,0,0,2]

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 14/24

slide-15
SLIDE 15

✬ ✫ ✩ ✪

Different usages of a transducer

➊ association checking

(abba, baaa)∈ Σ∗ ?

➋ Generation: string1 → string2 bbab → ? ➌ Analysis: string2 → string1

? → ba

➊: easy: (= FSA: nothing special)

What about ➋ and ➌ ?

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 15/24

slide-16
SLIDE 16

✬ ✫ ✩ ✪

Transduction Walk through the FSA following one or the other element of the couple (projections)

❢ ☞ not deterministic in general!

The fact that a transducer is a deterministic (couple-)FSA does not at all imply that the automaton resulting from one projection or the other is also deterministic! non-deterministic evaluation backtracking on ”wrong” solutions

   ⇒ The projection is not constant time (in general)

When a transducer is deterministic with respect to one projection or the other, it is called a sequential transducer A transducer in not sequential in general. In particular if one language or the other (upper

  • r lower) is not finite, it is not sure that a sequential transducer can be produced.

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 16/24

slide-17
SLIDE 17

✬ ✫ ✩ ✪

Transduction (2) Example: bbab → ?

b:b b:b

1

a:b

2

b:a

2

b:ε a:a b:b

2

b:ε

2

b:ε

1

b:b

2

a:b

1

b:a bbab → bbba bbab → ba

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 17/24

slide-18
SLIDE 18

✬ ✫ ✩ ✪

Transduction (3) Example:? → ba

b:b

2

b:ε a:a

2

b:ε

1

a:a

(FAIL)

1

a:b

2

a:a

2

b:a

2

b:ε

(FAIL)

aa → ba ab → ba bbab → ba

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 18/24

slide-19
SLIDE 19

✬ ✫ ✩ ✪

Operations and Regular Expressions

  • n Transducers

➮ All FSA regular expressions: concatenation, or, Kleene closure (*), ...

Example:(concatenation) ”a:b c:a” recognizes ac and produces ba

➮ cross-product of regular languages: E1 ⊗ E2 recognizes L1 × L2

example: a+ ⊗ b+ → (an, bm)

∀ n ≥ 1, m ≥ 1

!! this is = (a ⊗ b)+

➮ Composition of transducers: T = T1 ◦ T2 (X1, X2) ∈ T ⇐ ⇒ ∃Y : (X1, Y ) ∈ T1 and (Y, X2) ∈ T2 ➮ Reduction: extraction of the upper or the lower FSA

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 19/24

slide-20
SLIDE 20

✬ ✫ ✩ ✪

(Other) examples of applications (morphology)

★ text-to-speech (grapheme to phoneme transduction) ★ specific lexicon representation (composition of some access and inverse fonctions) ★ filters (remove/add/modify marks; e.g. HTML) ★ text segmentation

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 20/24

slide-21
SLIDE 21

✬ ✫ ✩ ✪

Computational morphology using transducers Use of composition:

➛ Identification of a paradigm (T1) ➛ Implementation of this paradim (T2) ➛ Exception handling (T3)

Example: input: chat+NP , fox+NP , ... (”+NP” means ”noun plural”)

T1: ([a-z]+)(\+NP ⊗ \+1)

paradigm identification: plural nouns (trivial here:

  • nly one paradigm (+1))

T2: ([a-z]+)(\+1 ⊗ \+Xs)

plural inflection of nouns (regular part)

T3: ([a-z]+)(h\+Xs ⊗ hes | x\+Xs ⊗ xen | ... | [ˆhx...](\+X⊗ε)s) correction of exceptions T1 ◦ T2 ◦ T3: plural for nouns

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 21/24

slide-22
SLIDE 22

✬ ✫ ✩ ✪

Computational morphology using transducers (2) Detailed example on the plural of nouns: general case: add a terminal ’s’ cat+NP → cats, dog+NP → dogs, ... Exceptions (several kind):

  • fly flies
  • fox foxes, but ox oxen!
  • ..

Method: find all the paradigms (linguists’ role) and implement a transducer for each of them

☞ add the paradigm identification in the lexical description

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 22/24

slide-23
SLIDE 23

✬ ✫ ✩ ✪

Keypoints

➟ Flexional and derivational morphologies, their roles ➟ Main functions of transducers: association checking, generation and analysis ➟ Deterministic and not deterministic nature of transduction

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 23/24

slide-24
SLIDE 24

✬ ✫ ✩ ✪

References

  • E. Roche, Y. Schabes, Finite-state Language Processing, pp. 14-63, 67-96, A Bradforf

Book, 1997.

LIA I&C Introduction to Natural Language Processing (CS-431)

  • M. Rajman

J.-C. Chappelier 24/24

slide-25
SLIDE 25

A quick reminder about noun plurals in English

Computational Linguistics Martin Rajman Artificial Intelligence Laboratory

slide-26
SLIDE 26

Fully regular plurals

  • default rule:

Add “s” to the end of the singular form

  • Examples:

(dog, dogs) (arrow, arrows) ...

slide-27
SLIDE 27

Semi-regular plurals

Some “regular” plurals need to be modified to be easy to pronounce (“euphonic rules)

  • Euphonic rule 1: if the singular noun ends in “s”, “x”, “z”, “ch”, or “sh”, add

“es” instead of “s”

(guess, guesses) (box, boxes) (buzz, buzzes) (catch, catches) (dish, dishes)

... but (systematic exception) if the final “ch” is pronounced “k”, add “s” instead of “es”

(stomach, stomachs)

as well as some fully irregular exceptions

(ox, oxen)

slide-28
SLIDE 28

Semi-regular plurals (2)

  • Euphonic rule 2: if the singular noun ends in a consonant followed by

“y”, change the “y” to “ies”

(baby, babies) (fly, flies)

Note: there must be a consonant before the “y”...

(boy, boys) (buy, buys)

slide-29
SLIDE 29

Irregular plurals

  • Collective nouns (aka uncountable nouns) have no plural form

(hair, ---) (mud, ---)

... but the regular plurals may also be acceptable in specific contexts: “Her hair is black” ... but ... “I saw at least one grey hair, and there are probably

more grey hairs there”  (hair, hairs) “They throw mud at each other” ... but ... “These subterranean muds are being removed”  (mud, muds)

slide-30
SLIDE 30

Irregular plurals (2)

  • Invariant nouns (aka invariable nouns) do not change when inflected

to the plural “Deer have antlers”

Note that there is a (subtle) difference for a noun not to have a plural (i.e. to be uncountable), or to have a plural form that is the same as the singular one  uncountable: “Her hair is black” is correct, while “Her hair are black” is not  invariable: “This deer is fast” and “Deer are fast” are both correct (but do not mean the same)

slide-31
SLIDE 31

Other irregular plurals

  • Case 1: For most nouns ending in “f” or “fe”, change the ending “f” or

“fe” to “ves”

(half, halves) (knife, knives)

... but

(belief, beliefs) (if, ifs)  “There are so many ifs and buts in this policy"

slide-32
SLIDE 32

Other irregular plurals (2)

  • Case 2: For most nouns ending in “is”, change the ending “is” to “es”

(crisis, crises) (hypothesis, hypotheses)

... but

(vis, vires) where “vis” is a Latin word meaning “power” that has been imported in English, while preserving its Latin plural (“vires”)  “An example of vis is the influence of the leader"

slide-33
SLIDE 33

Other irregular plurals (3)

  • Case 3: For many nouns ending in “o”, change the ending “o” to “oes”

(tomato, tomatoes) (mosquito, mosquitoes) (volcano, volcanoes)

... but

(photo, photos) (video, videos) (piano, pianos)

slide-34
SLIDE 34

Fully irregular plurals

  • For some (often very frequent) words, the plural corresponds to a

much more complicated modification

(man, men) (mouse, mice) (foot, feet) (tooth, teeth) ...

slide-35
SLIDE 35

Computational morphology for English nouns

Computational Linguistics Martin Rajman Artificial Intelligence Laboratory

slide-36
SLIDE 36

Fundamentals

  • Goal: use transducers to represent associations between strings representing:
  • surface forms, i.e. words as they appear in texts;

and

  • canonical representations, i.e. formal representations of the morphological analysis of these

words

  • Examples of surface forms:

cats, book, flies, ...

  • Example of canonical representations:

cat+N+p, book+N+s, fly+N+p, ...

slide-37
SLIDE 37

Canonical representations

The typical format of a canonical representations is: Lemma+GrammaticalCategory+MorphoSyntacticFeature1+MorphoSyntacticFeature2+... where:

  • Lemma (or Root) is the canonical form of an inflected word; i.e. the form usually found in

dictionaries, e.g. the singular form for nouns, or the infinitive for verbs;

  • GrammaticalCategory (or Part-of-Speech) is the tag used to represent the grammatical

category of the word, e.g. N for a noun, Adj for an adjective, or V for a verb;

  • MorphoSyntacticFeaturek (k=1, 2, 3, ...) are the tags used to represent the

morphosyntactic features (e.g. the number, the gender, the tense, the person, etc.) that are relevant to identify a specific inflection of a word; and

  • "+" is a (conventional) separating character.
slide-38
SLIDE 38

Examples of canonical representations

  • (cat+N+p, cats): associating the canonical representation "cat+N+p"

to the surface form "cats" expresses in a formal way that "cats" is the flection of the noun "cat" corresponding to its plural form ("p" being the tag for the value "plural" of the morphosyntactic feature "number");

  • (turn+V+Ind+Pres+3+s, turns): associating the canonical

representation "turn+V+Ind+Pres+3+s" to the surface form "turns" expresses in a formal way that the surface form corresponding to the flection of the verb ("V") "to turn" at the 3rd person ("3") singular ("s") of the present ("Pres") indicative ("Ind") is "turns".

slide-39
SLIDE 39

In other words...

Implementing some Computational Morphology for English nouns is finding an efficient way of representing a, potentially very large, set of (canonical representation, surface form) associations, such as:

(cat+N+s, cat) (cat+N+p, cats) (book+N+s, book) (book+N+p, books) (fly+N+s, fly) (fly+N+p, flies) (fox+N+s, fox) (fox+N+p, foxes) (deer+N+s, deer) (deer+N+p, deer) (mouse+N+s, mouse) (mouse+N+p, mice) (ox+N+s, ox) (ox+N+p, oxen) ... By "efficient way", we mean a method that:

  • allows to describe all the targeted associations without

having to write them explicitly one-by-one;

  • provides a computational mechanism with a low

algorithmic complexity able to produce the surface form(s) associated with a given canonical representation ("generation"),

  • r

the canonical representation(s) associated with a given surface form ("analysis")

slide-40
SLIDE 40

How to do this with transducers?

The idea is to use the composition T1 o T2 o T3 of 3 transducers:

  • 1. a transducer T1 that identifies the morphological paradigm, i.e. the

systematic transformation rule(s) to be implemented for regular forms

  • 2. a transducer T2 that implements the identified systematic rule(s)
  • 3. a transducer T3 that handles all the exceptions to the implemented

rules

slide-41
SLIDE 41

T1 : Identifying the morphological paradigm

  • In English, the morphology of regular noun plurals is very simple, as it

corresponds to a single systematic rule

  • The morphological paradigm thus consists of only one rule, arbitrarily

numbered here as rule 1

  • T1 is therefore the transducer that associates a canonical representation of

the form “root+N+p”, where root is any possible nominal root, to the intermediate string “root+1”: T1 = ([a-z]+)((\+N\+p)x(\+1)) where “x” represents the “cross-product” operator, “\” is a special character that prevents the character “+” to be interpreted as the Kleene plus operator, and “[a-z]” represents any alphabetic character

slide-42
SLIDE 42

T1 : Example

When applied to the list

( cat+N+p, book+N+p, fly+N+p, fox+N+p, deer+N+p, mouse+N+p,

  • x+N+p )

T1 represents the following list of associations (cat+N+p, cat+1) (book+N+p, book+1) (fly+N+p, fly+1) (fox+N+p, fox+1) (deer+N+p, deer+1) (mouse+N+p, mouse+1) (ox+N+p, ox+1)

slide-43
SLIDE 43

T2 : Implementing the morphological paradigm

  • The identified (single) systematic rule for English regular noun plurals is:

Add “s” to the end of the root (as, for nouns, the root corresponds to the singular form)

  • T2 is therefore the transducer that associates an intermediate string of the

form “root+1” to a new intermediate string of the form “rootXs”, where the character X (called the “trace”) identifies the “border” between the root and the suffix “s”: T2 = ([a-z]+)((\+1)x(Xs)) Note: placing a trace X in the new intermediate string will make it easier to handle the various exceptions to be implemented in T3

slide-44
SLIDE 44

T2 : Example

When applied to the list (resulting from T1)

( cat+1, book+1, fly+1, fox+1, deer+1, mouse+1,

  • x+1 )

T2 represents the following list of associations (cat+1, catXs) (book+1, bookXs) (fly+1, flyXs) (fox+1, foxXs) (deer+1, deerXs) (mouse+1, mouseXs) (ox+1, oxXs)

slide-45
SLIDE 45

T3 : Handling the exceptions

  • In this illustrative example, we will only consider 2 types of exceptions:
  • 1. Euphonic rule 1 (simplified) :

If the root ends in “x”, change the ending “x” to “xes”

  • 2. Euphonic rule 2 (simplified) :

If the root ends in “y”, change the ending “y” to “ies”

  • T3 is therefore the transducer that associates an intermediate string of the form

“rootxXs” (resp. “rootyXs”) to a new intermediate string of the form “rootxes” (resp. “rooties”), where “rootx” (resp. “rooty”) is any root ending in “x” (resp. “y”) : T3 = ([a-z]+)(((xXs)x(xes))|((yXs)x(ies))|([^xy]((Xs)x(s)))) where “[^xy]” represents any character but “x” or “y”

slide-46
SLIDE 46

T3 : Example

When applied to the list (resulting from T2)

( catXs, bookXs, flyXs, foxXs, deerXs, mouseXs,

  • xXs )

T3 represents the following list of associations (catXs, cats) (bookXs, books) (flyXs, flies) (foxXs, foxes) (deerXs, deers) (mouseXs, mouses) (oxXs, oxes)

slide-47
SLIDE 47

T1 o T2 o T3 : Example

When applied to the original list

(cat+N+p, book+N+p, fly+N+p, fox+N+p, deer+N+p, mouse+N+p,

  • x+N+p )

T1 o T2 o T3 represents the following list of associations (cat+N+p, cats) (book+N+p, books) (fly+N+p, flies) (fox+N+p, foxes) (deer+N+p, deers) (mouse+N+p, mouses) (ox+N+p, oxes) where the first 4 associations are correct, but the last 3 (in red) are erroneous and would require a more sophisticated definition of the transducer T3 responsible for handling the exceptions