Morphological Analysis Morphological Analysis and Generation for - - PowerPoint PPT Presentation

morphological analysis morphological analysis and
SMART_READER_LITE
LIVE PREVIEW

Morphological Analysis Morphological Analysis and Generation for - - PowerPoint PPT Presentation

Morphological Analysis Morphological Analysis and Generation for Pali and Generation for Pali David Alfter Jrgen Knauth 18 September 2015 @daalft @daalft Pali Pali Pali Pali (Dead) Indo-aryan language Fusional language Rich


slide-1
SLIDE 1

Morphological Analysis Morphological Analysis and Generation for Pali and Generation for Pali

David Alfter Jürgen Knauth 18 September 2015

slide-2
SLIDE 2

@daalft @daalft

slide-3
SLIDE 3

Pali Pali

slide-4
SLIDE 4

Pali Pali

(Dead) Indo-aryan language Fusional language Rich morphology Sandhi

slide-5
SLIDE 5

Source: https://commons.wikimedia.org/wiki/File:BoreanLanguageTree.png

slide-6
SLIDE 6

Fusional language Fusional language

Morphological information added by affigation No 1:1 correspondence

slide-7
SLIDE 7

DEVO DEVO

Base: DEV- god/deity Ending: -O noun singular masculine nominative

slide-8
SLIDE 8

Compounding Compounding

naccagītavāditavisūkadassanamālāgandhavilepanadhār aṇamaṇḍanavibhūsanaṭṭhānā

slide-9
SLIDE 9

Compounding Compounding

naccagītavāditavisūka- dassanamālāgandhavilepanadhāraṇamaṇḍanavibhūsana- ṭṭhānā dancing singing music show-watching garland perfume cosmetics wearing decoration decoration

slide-10
SLIDE 10

Compounding Compounding

naccagītavāditavisūka- dassanamālāgandhavilepanadhāraṇamaṇḍanavibhūsana- ṭṭhānā dancing, singing, music, going to see entertainments, wearing garlands, using perfumes, and beautifying the body with cosmetics

slide-11
SLIDE 11

7th precept 7th precept

naccagītavāditavisūkadassanamālāgandhavilepanadhāraṇamaṇḍana vibhūsanaṭṭhānā veramaṇi sikkhāpadaṃ samādiyāmi I adopt the precept of refraining from ...

slide-12
SLIDE 12

Sandhi Sandhi

slide-13
SLIDE 13

External sandhi External sandhi

evaṃ ca (and thus) → evañca

slide-14
SLIDE 14

Internal sandhi Internal sandhi

paca + ti → pacati (he cooks) paca + mi → pacāmi (I cook) canda (moon) + udayo (rising) → candodayo (rising of the moon)

slide-15
SLIDE 15

Internal sandhi Internal sandhi

paca + ti → pacati (he cooks) paca + mi → pacāmi (I cook) canda (moon) + udayo (rising) → candodayo (rising of the moon)

slide-16
SLIDE 16

The Problem The Problem

slide-17
SLIDE 17

Low-resource language Low-resource language

slide-18
SLIDE 18

Why don't we adapt Why don't we adapt resources from resources from Sanskrit? Sanskrit?

slide-19
SLIDE 19

Top Resources Top Resources

Dictionaries Dictionaries Morphological analyzers Morphological analyzers

slide-20
SLIDE 20

Credit: http://iflizwerequeen.com

slide-21
SLIDE 21

Lingua Franca Lingua Franca

slide-22
SLIDE 22

Lingua Franca Lingua Franca

Written in different scripts

slide-23
SLIDE 23

Lingua Franca Lingua Franca

Written in different scripts Introduces variation!

slide-24
SLIDE 24

Scripts Scripts

Sinhalese Devanagari Burmese Transliterations ...

slide-25
SLIDE 25

Literature Literature

slide-26
SLIDE 26

Literature Literature

Scarce and not exhaustive

slide-27
SLIDE 27

No annotated corpus No annotated corpus

slide-28
SLIDE 28

Generation Generation

slide-29
SLIDE 29

Generation Generation

and Overgeneration

slide-30
SLIDE 30

Irregular Irregular

Dictionary lookup Rule based generation: Lemma => Stem Stem + Ending => Form

Regular Regular

Dictionary lookup

slide-31
SLIDE 31

Word class specific lemma ending Word class specific lemma ending Lemma - Ending Lemma - Ending → Stem Stem Stem + Ending Stem + Ending → Surface Form Surface Form

slide-32
SLIDE 32

Stem + Ending Stem + Ending → Form Form Ending Ending Ending Ending Ending Ending Ending Ending Ending Ending Ending Ending

slide-33
SLIDE 33

Compiled Morphological Information

slide-34
SLIDE 34

<paradigms> <paradigm type="noun"> <number type="singular"> <declension type="a"> <gender type="masculine"> <case type="nominative"> <ending>o</ending> <ending type="Drare">e</ending> </case> <case type="vocative"> <ending>a</ending> <ending>ā</ending> <ending type="Drare">e</ending> <ending type="Drare">o</ending> </case> <case type="accusative"> <ending>aṃ</ending> </case>

slide-35
SLIDE 35

<paradigms> <paradigm type="noun"> <number type="singular"> <declension type="a"> <gender type="masculine"> <case type="nominative"> <ending>o</ending> <ending type="Drare">e</ending> </case> <case type="vocative"> <ending>a</ending> <ending>ā</ending> <ending type="Drare">e</ending> <ending type="Drare">o</ending> </case> <case type="accusative"> <ending>aṃ</ending> </case>

slide-36
SLIDE 36

<paradigms> <paradigm type="noun"> <number type="singular"> <declension type="a"> <gender type="masculine"> <case type="nominative"> <ending>o</ending> <ending type="Drare">e</ending> </case> <case type="vocative"> <ending>a</ending> <ending>ā</ending> <ending type="Drare">e</ending> <ending type="Drare">o</ending> </case> <case type="accusative"> <ending>aṃ</ending> </case>

slide-37
SLIDE 37

<paradigms> <paradigm type="noun"> <number type="singular"> <declension type="a"> <gender type="masculine"> <case type="nominative"> <ending>o</ending> <ending type="Drare">e</ending> </case> <case type="vocative"> <ending>a</ending> <ending>ā</ending> <ending type="Drare">e</ending> <ending type="Drare">o</ending> </case> <case type="accusative"> <ending>aṃ</ending> </case>

slide-38
SLIDE 38

Lemma => Stem Lemma => Stem Stem + Ending => Stem + Ending => Form Form

deva => dev- dev- + -o => devo

slide-39
SLIDE 39

Lemma => Stem Stem + Ending => Form

deva => dev- deva => dev- dev- + -o => devo dev- + -o => devo

slide-40
SLIDE 40

<declension type="ant"> <gender type="masculine"> <case type="nominative"> <ending>aṃ</ending> <ending>ā</ending> <ending type="Cm2">anto</ending> <ending type="Drare">o</ending> <ending>ato</ending> </case>

slide-41
SLIDE 41

I make I cook

karo + mi = karomi karo + mi = karomi paca + mi = pacāmi paca + mi = pacāmi

slide-42
SLIDE 42

bhavaṃ (sir) bhavaṃ (sir)

stem: bhav- ending: -anto form: bhavanto bhanto

slide-43
SLIDE 43

Lemma Lemma

Derive stem Select paradigm(s) based on word class Combine stem and endings Return generated forms and associated information

slide-44
SLIDE 44

Verbs Verbs

Of Roots and Bases Of Roots and Bases

slide-45
SLIDE 45

Abstract Root Abstract Root

√kar

(to make)

slide-46
SLIDE 46

Base Base

→ karo √kar → paca √pac

(to make) (to cook) (to fight)

→ yujjha √yudh

slide-47
SLIDE 47

Seven declension classes Seven declension classes

slide-48
SLIDE 48

1+ bases 1+ bases

√cur

core-, coraya- (to steal)

slide-49
SLIDE 49

1+ bases 1+ bases

√rudh

rundha-, rundhi-, rundhī-, rundhe-, rundho- (to obstruct)

slide-50
SLIDE 50

Verb forms based on Verb forms based on Root or Base? Root or Base?

slide-51
SLIDE 51
slide-52
SLIDE 52

Irregular forms Irregular forms

Dictionary lookup Dictionary lookup

Full/Partial Irregularity

slide-53
SLIDE 53

Output Output

JSON/XML JSON/XML

slide-54
SLIDE 54

Key:Value pairs Receiver can decide what information to use

slide-55
SLIDE 55

{" lemma":"eka","forms ":{"numeral":[{ "gender ":"masculine", "number ":" singular", "word ":" eko", "case":" nominative"}, {"gender ":"masculine", "number ":" singular","word ":"ekassa", "case":" genitive"},...

slide-56
SLIDE 56

Analysis Analysis

slide-57
SLIDE 57

Lookup Lookup

Dictionary/Table lookup

Heuristic approach Heuristic approach

Identify paradigmatic ending → Morphological Analysis → Separation Stem-Ending

slide-58
SLIDE 58

buddhe

<gender type="masculine"> <case type="nominative"> <ending>o</ending> <ending type="Drare">e</ending> </case> <case type="vocative"> <ending>a</ending> <ending>ā</ending> <ending type="Drare">e</ending> <ending type="Drare">o</ending> </case> <case type="accusative"> <ending>aṃ</ending> </case>

slide-59
SLIDE 59

buddhe

<gender type="masculine"> <case type="nominative"> <ending>o</ending> <ending type="Drare">e</ending> </case> <case type="vocative"> <ending>a</ending> <ending>ā</ending> <ending type="Drare">e</ending> <ending type="Drare">o</ending> </case> <case type="accusative"> <ending>aṃ</ending> </case>

slide-60
SLIDE 60

Word Class Guesser Word Class Guesser

slide-61
SLIDE 61

Heuristic Approach Heuristic Approach

Lemma Lemma Free Form Free Form

Identify possible endings Weigh by length Weigh by frequency Prune results Identify possible endings

slide-62
SLIDE 62

if (ends(lemma, "a", "ā", "i", "ī", "u", "ū", "ant", "vā", "mā", "at")) { guesses.add("adjective"); } if (ends(lemma, "a", "i", "aṃ", "ma", "ya")) { guesses.add("numeral"); } if (ends(lemma, "uṃ")) { guesses.add("indeclinable"); }

Word Class Guesser: Lemma Word Class Guesser: Lemma

Code Excerpt

slide-63
SLIDE 63

Results Results

Accuracy Nouns-Adjectives 99.96% Pronouns 88.57% Numerals 76.62% Verbs 63.37%

slide-64
SLIDE 64

Sandhi Sandhi

slide-65
SLIDE 65

Compound Sandhi Compound Sandhi

slide-66
SLIDE 66

Intuition Intuition

Identify possible sandhi loci Split into n words such that

∀n : w ∈ D

n

slide-67
SLIDE 67

Requires extensive Dictionary Requires extensive Dictionary More than one analysis possible More than one analysis possible Not a compound Not a compound

Problems Problems

slide-68
SLIDE 68

External Sandhi External Sandhi

slide-69
SLIDE 69

Sandhi-inducing words Sandhi-inducing words

ca (and) hi (because) pi (also)

Corpus-based resolution Corpus-based resolution

slide-70
SLIDE 70

Hand-written rules Hand-written rules

Regular Expressions

slide-71
SLIDE 71

Replacement rules \bpañca\b X ñca\b ṃ ca X pañca ñhi\b ṃ hi ñpi\b ṃ pi

slide-72
SLIDE 72

Replacement rules \bpañca\b X ñca\b ṃ ca X pañca ñhi\b ṃ hi ñpi\b ṃ pi

slide-73
SLIDE 73

Internal Sandhi Internal Sandhi

slide-74
SLIDE 74

Internal Sandhi Internal Sandhi

slide-75
SLIDE 75

Conclusion Conclusion

slide-76
SLIDE 76

Paradigms for Paradigms for Generation and Generation and Analysis Analysis

slide-77
SLIDE 77

Dictionary Integration Dictionary Integration for additional for additional information information

slide-78
SLIDE 78

Rule-based and Rule-based and heuristic backup heuristic backup

slide-79
SLIDE 79

RegEx-based External RegEx-based External Sandhi Resolution Sandhi Resolution

slide-80
SLIDE 80

Lookup Lookup

slide-81
SLIDE 81

Server Architecture Server Architecture

slide-82
SLIDE 82

Well documented REST API Well documented REST API

Easy integration Easy integration

slide-83
SLIDE 83

Data Processing Data Processing

slide-84
SLIDE 84

Extract structured data Extract structured data from unstructured data from unstructured data

slide-85
SLIDE 85

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

slide-86
SLIDE 86

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

slide-87
SLIDE 87

Pacati,[Ved.pacati,Idg.*peqǔō,Av.pac-; Obulg.peka to fry,roast,Lith,kepū bake,Gr.pέssw cook,pέpwn ripe] to cook,boil,roast Vin.IV,264; fig.torment in purgatory (trs.and intrs.):Niraye pacitvā after roasting in N.S.II, 225,PvA.10,14.-- ppr.pacanto tormenting,Gen.pacato (+Caus.pācayato) D.I,52 (expld at DA.I,159,where read pacato for paccato,by pare daṇḍena pīḷentassa).-- pp. pakka (q.v.).‹-› Caus.pacāpeti & pāceti (q.v.).-- Pass.paccati to be roasted or tormented (q.v.).(Page 382)

slide-88
SLIDE 88

Manual annotation Manual annotation

slide-89
SLIDE 89

Open Problems Open Problems

slide-90
SLIDE 90

Verbs Verbs

slide-91
SLIDE 91

Use verb form table Use verb form table

Attested forms only

slide-92
SLIDE 92

Internal Sandhi Internal Sandhi

slide-93
SLIDE 93

Illustrating Calculation Illustrating Calculation

Splitting Internal Sandhi

slide-94
SLIDE 94

"When two vowels meet, one may be elided." When two vowels meet: elide first vowel elide second vowel no elision

slide-95
SLIDE 95

8 vowels n-vowel-word

N = (1 + (2 ∗ 8))n n = 2 → N = 289 n = 1 → N = 17 n = 3 → N = 4913

slide-96
SLIDE 96
slide-97
SLIDE 97

"A final dental is assimilated to "A final dental is assimilated to the following consonant" the following consonant"

slide-98
SLIDE 98

"A final dental is assimilated to "A final dental is assimilated to the following consonant" the following consonant"

(DENTAL) (CONSONANT) : duplicate($2)

slide-99
SLIDE 99

kk: t k kk: th k kk: d k kk: dh k kk: n k kk: l k kk: s k ... 224 possibilities

slide-100
SLIDE 100

151 rules Sandhi merge rules

slide-101
SLIDE 101

151 rules Sandhi merge rules Sandhi split rules 1103 rules

slide-102
SLIDE 102
slide-103
SLIDE 103

Overall architecture Overall architecture

slide-104
SLIDE 104

Morphological analyzer and generator Dictionary

slide-105
SLIDE 105

Morphological analyzer and generator Dictionary Server

slide-106
SLIDE 106

Morphological analyzer and generator Dictionary Server Dictionary GUI Data processor and scripting engine Corpus management and processing tool

slide-107
SLIDE 107

Thank you for your attention! Thank you for your attention!

slide-108
SLIDE 108

Thank you for your attention! Thank you for your attention!

Questions?