Robust Multilingual Statistical Morphology Generation Models Ondej - - PowerPoint PPT Presentation

robust multilingual statistical morphology generation
SMART_READER_LITE
LIVE PREVIEW

Robust Multilingual Statistical Morphology Generation Models Ondej - - PowerPoint PPT Presentation

Introduction The system Results Robust Multilingual Statistical Morphology Generation Models Ondej Duek and Filip Jurek Institute of Formal and Applied Linguistics Charles University in Prague August 6, 2013 . . . . . . 1/


slide-1
SLIDE 1

. . . . . .

Introduction The system Results

Robust Multilingual Statistical Morphology Generation Models

Ondřej Dušek and Filip Jurčíček

Institute of Formal and Applied Linguistics Charles University in Prague

August 6, 2013

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 1/ 12

slide-2
SLIDE 2

. . . . . .

Introduction The system Results

Introduction

Morphology in NLG

  • Last step of the whole NLG pipeline
  • Usually does not get a lot of attention, but is necessary

What we do (Flect)

N a t u r a l L a n g u a g e G e n e r a t i

  • n

Semantics Syntax Morphology Text

We solve this

Natural Language Generation

Semantics Syntax Morphology Text

We solve this

CS EN ES CA DE JA

In these languages

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 2/ 12

slide-3
SLIDE 3

. . . . . .

Introduction The system Results

Introduction

Morphology in NLG

  • Last step of the whole NLG pipeline
  • Usually does not get a lot of attention, but is necessary

What we do (Flect)

N a t u r a l L a n g u a g e G e n e r a t i

  • n

Semantics Syntax Morphology Text

We solve this

Natural Language Generation

Semantics Syntax Morphology Text

We solve this

CS EN ES CA DE JA

In these languages

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 2/ 12

slide-4
SLIDE 4

. . . . . .

Introduction The system Results

Introduction

Morphology in NLG

  • Last step of the whole NLG pipeline
  • Usually does not get a lot of attention, but is necessary

What we do (Flect)

N a t u r a l L a n g u a g e G e n e r a t i

  • n

Semantics Syntax Morphology Text

We solve this

Natural Language Generation

Semantics Syntax Morphology Text

We solve this

CS EN ES CA DE JA

In these languages

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 2/ 12

slide-5
SLIDE 5

. . . . . .

Introduction The system Results

The need for morphology in generation

  • English – not so much:

hard-coded solutions often work well enough Languages with more inflection (e.g. Czech): even for the simplest things

Toto se líbí uživateli Jana Nováková.

  • -------- - -

ě é

This is liked by user (name)

[masc] [dat]

Děkujeme, Jan Novák , vaše hlasování

Thank you, (name) your poll has been created

bylo vytvořeno. e u

[fem] [nom] [nom]

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 3/ 12

slide-6
SLIDE 6

. . . . . .

Introduction The system Results

The need for morphology in generation

  • English – not so much:

hard-coded solutions often work well enough

  • Languages with more inflection (e.g. Czech):

even for the simplest things

Toto se líbí uživateli Jana Nováková.

  • -------- - -

ě é

This is liked by user (name)

[masc] [dat]

Děkujeme, Jan Novák , vaše hlasování

Thank you, (name) your poll has been created

bylo vytvořeno. e u

[fem] [nom] [nom]

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 3/ 12

slide-7
SLIDE 7

. . . . . .

Introduction The system Results

The task at hand

word NNS words + Wort NN Wörtern be VBZ is ser V

gen=c,num=s,person=3, mood=indicative,tense=present

es

Neut,Pl,Dat

+ + +

  • Input: Lemma (base form) or stem

+ morphological properties (POS, case, gender, etc.)

  • Output: Inflected word form
  • Inverse to POS tagging

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 4/ 12

slide-8
SLIDE 8

. . . . . .

Introduction The system Results

Possible solutions

Dictionary?

  • Works well, but has limited size
  • Not many large-coverage openly available ones

Hand-written rules?

Work well, but are hard to maintain

B C

rule x y

Machine learning!

Obtain the rules automatically Plenty of treebanks of sufficient size available Only work known to us: Bohnet et al. 2010

rule σ

x1 x2 xn w1 w2 wn

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 5/ 12

slide-9
SLIDE 9

. . . . . .

Introduction The system Results

Possible solutions

Dictionary?

  • Works well, but has limited size
  • Not many large-coverage openly available ones

Hand-written rules?

  • Work well, but are hard to maintain

B C

rule x y

Machine learning!

Obtain the rules automatically Plenty of treebanks of sufficient size available Only work known to us: Bohnet et al. 2010

rule σ

x1 x2 xn w1 w2 wn

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 5/ 12

slide-10
SLIDE 10

. . . . . .

Introduction The system Results

Possible solutions

Dictionary?

  • Works well, but has limited size
  • Not many large-coverage openly available ones

Hand-written rules?

  • Work well, but are hard to maintain

B C

rule x y

Machine learning!

  • Obtain the rules automatically
  • Plenty of treebanks of sufficient size available
  • Only work known to us: Bohnet et al. 2010

rule σ

x1 x2 xn w1 w2 wn

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 5/ 12

slide-11
SLIDE 11

. . . . . .

Introduction The system Results

Casting inflection patterns as multi-class classification

fly flies >1-ies

[at the end] [delete one letter] [and add these]

Our inflection rules: edit scripts

  • A kind of diffs: how to modify the lemma to get the form
  • Based on Levenshtein distance

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 6/ 12

slide-12
SLIDE 12

. . . . . .

Introduction The system Results

Casting inflection patterns as multi-class classification

fly flies >1-ies

[at the end] [delete one letter] [and add these]

Our inflection rules: edit scripts

  • A kind of diffs: how to modify the lemma to get the form
  • Based on Levenshtein distance

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 6/ 12

slide-13
SLIDE 13

. . . . . .

Introduction The system Results

Casting inflection patterns as multi-class classification

fly flies >1-ies

[at the end] [delete one letter] [and add these]

Our inflection rules: edit scripts

  • A kind of diffs: how to modify the lemma to get the form
  • Based on Levenshtein distance

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 6/ 12

slide-14
SLIDE 14

. . . . . .

Introduction The system Results

Casting inflection patterns as multi-class classification

fly flies >1-ies

[at the end] [delete one letter] [and add these]

Our inflection rules: edit scripts

  • A kind of diffs: how to modify the lemma to get the form
  • Based on Levenshtein distance

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

fly flies >1-ies

[at the end] [delete one letter] [and add these]

sparen gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter Mütter 5:1-ü

[5 letters from the end] [delete one letter] [and add this]

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 6/ 12

slide-15
SLIDE 15

. . . . . .

Introduction The system Results

Features useful for morphology generation

  • Same POS + same ending = (often) same inflection

sky fly

  • ies

NNS bind find

  • ound

VBD + +

Suffixes = good features to generalize to unseen inputs Machine learning should be able to deal with counter-examples Capitalization: no influence on morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 7/ 12

slide-16
SLIDE 16

. . . . . .

Introduction The system Results

Features useful for morphology generation

  • Same POS + same ending = (often) same inflection

sky fly

  • ies

NNS bind find

  • ound

VBD + +

  • Suffixes = good features to generalize to unseen inputs
  • Machine learning should be able to deal with

counter-examples Capitalization: no influence on morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 7/ 12

slide-17
SLIDE 17

. . . . . .

Introduction The system Results

Features useful for morphology generation

  • Same POS + same ending = (often) same inflection

sky fly

  • ies

NNS bind find

  • ound

VBD + +

  • Suffixes = good features to generalize to unseen inputs
  • Machine learning should be able to deal with

counter-examples

  • Capitalization: no influence on morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 7/ 12

slide-18
SLIDE 18

. . . . . .

Introduction The system Results

Our system Flect: Overall procedure

Wort NN

Pl Neut Dat

  • 1. Get features from lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort NN

Pl

  • rt

rt t

Neut Dat

  • 2. Predict edit scripts using Logistic regression

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

  • 3. Use them as rules to obtain form from lemma

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 8/ 12

slide-19
SLIDE 19

. . . . . .

Introduction The system Results

Our system Flect: Overall procedure

Wort NN

Pl Neut Dat

  • 1. Get features from lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort NN

Pl

  • rt

rt t

Neut Dat

  • 2. Predict edit scripts using Logistic regression

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

  • 3. Use them as rules to obtain form from lemma

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 8/ 12

slide-20
SLIDE 20

. . . . . .

Introduction The system Results

Our system Flect: Overall procedure

Wort NN

Pl Neut Dat

  • 1. Get features from lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort NN

Pl

  • rt

rt t

Neut Dat

  • 2. Predict edit scripts using Logistic regression

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

  • 3. Use them as rules to obtain form from lemma

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 8/ 12

slide-21
SLIDE 21

. . . . . .

Introduction The system Results

Our system Flect: Overall procedure

Wort NN

Pl Neut Dat

  • 1. Get features from lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort NN

Pl

  • rt

rt t

Neut Dat

  • 2. Predict edit scripts using Logistic regression

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

  • 3. Use them as rules to obtain form from lemma

Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern Wort NN

Pl

  • rt

rt t

Neut Dat

>ern, 3:1-ö

σ

w1 w2 wn

edit Wörtern

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 8/ 12

slide-22
SLIDE 22

. . . . . .

Introduction The system Results

Testing Flect on 6 languages

  • CoNLL 2009 data: varying morphology richness & tagsets

English German Czech 92 94 96 98

Unseen forms

accuracy (%)

90 100 Total

CS EN ES CA DE JA

Works well even on unseen forms: suffixes help

  • ver-generalization errors, e.g. torpedo + VBN = torpedone

German: syntax-sensitive morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 9/ 12

slide-23
SLIDE 23

. . . . . .

Introduction The system Results

Testing Flect on 6 languages

  • CoNLL 2009 data: varying morphology richness & tagsets

English German Czech 92 94 96 98

Unseen forms

accuracy (%)

90 100 Total

CS EN ES CA DE JA

Works well even on unseen forms: suffixes help

  • ver-generalization errors, e.g. torpedo + VBN = torpedone

German: syntax-sensitive morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 9/ 12

slide-24
SLIDE 24

. . . . . .

Introduction The system Results

Testing Flect on 6 languages

  • CoNLL 2009 data: varying morphology richness & tagsets

English German Czech 92 94 96 98

Unseen forms

accuracy (%)

90 100 Total

CS EN ES CA DE JA

  • Works well even on unseen forms: suffixes help
  • ver-generalization errors, e.g. torpedo + VBN = torpedone

German: syntax-sensitive morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 9/ 12

slide-25
SLIDE 25

. . . . . .

Introduction The system Results

Testing Flect on 6 languages

  • CoNLL 2009 data: varying morphology richness & tagsets

English German Czech 92 94 96 98

Unseen forms

accuracy (%)

90 100 Total

CS EN ES CA DE JA

  • Works well even on unseen forms: suffixes help
  • over-generalization errors, e.g. torpedo + VBN = torpedone
  • German: syntax-sensitive morphology

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 9/ 12

slide-26
SLIDE 26

. . . . . .

Introduction The system Results

Flect vs. a dictionary from the same data

  • English: Dictionary gets OK relatively soon

0,1 0,5 1 5 10 20 30 50 75 100

75

80 85 90 95 accuracy (%) training data part (%)

58% error reduction 76% error reduction Dictionary (Total) Dictionary (Unknown forms) Flect (Total) Flect (Unknown forms) 100

EN

Czech: Dictionary fails on unknown forms, our system works

0,1 0,5 1 5 10 20 30 50 75 100 50 60 70 80 90

100

accuracy(%) training data part (%)

92% error reduction 40 Dictionary (Total) Dictionary (Unknown forms) Flect (Total) Flect (Unknown forms)

CS

0,1 0,5 1 5 10 20 30 50 75 100 50 60 70 80 90

100

accuracy(%) training data part (%)

92% error reduction 40 Dictionary (Total) Dictionary (Unknown forms) Flect (Total) Flect (Unknown forms)

CS

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 10/ 12

slide-27
SLIDE 27

. . . . . .

Introduction The system Results

Flect vs. a dictionary from the same data

  • English: Dictionary gets OK relatively soon

0,1 0,5 1 5 10 20 30 50 75 100

75

80 85 90 95 accuracy (%) training data part (%)

58% error reduction 76% error reduction Dictionary (Total) Dictionary (Unknown forms) Flect (Total) Flect (Unknown forms) 100

EN

  • Czech: Dictionary fails on unknown forms, our system works

0,1 0,5 1 5 10 20 30 50 75 100 50 60 70 80 90

100

accuracy(%) training data part (%)

92% error reduction 40 Dictionary (Total) Dictionary (Unknown forms) Flect (Total) Flect (Unknown forms)

CS

0,1 0,5 1 5 10 20 30 50 75 100 50 60 70 80 90

100

accuracy(%) training data part (%)

92% error reduction 40 Dictionary (Total) Dictionary (Unknown forms) Flect (Total) Flect (Unknown forms)

CS

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 10/ 12

slide-28
SLIDE 28

. . . . . .

Introduction The system Results

Conclusions

General observations:

  • Inflection rules/patterns can be learned from a corpus
  • Suffix features are useful to inflect unseen words
  • Detailed morphological features and context features help

Our system Flect:

improves on a dictionary learnt from the same data gains more in morphologically rich languages (Czech) can be combined with a dictionary as a back-off for OOVs

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 11/ 12

slide-29
SLIDE 29

. . . . . .

Introduction The system Results

Conclusions

General observations:

  • Inflection rules/patterns can be learned from a corpus
  • Suffix features are useful to inflect unseen words
  • Detailed morphological features and context features help

Our system Flect:

  • improves on a dictionary learnt from the same data
  • gains more in morphologically rich languages (Czech)
  • can be combined with a dictionary as a back-off for OOVs

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 11/ 12

slide-30
SLIDE 30

. . . . . .

Introduction The system Results

Thank you for your attention

You may download Flect (and these slides) at: http://ufal.mff.cuni.cz/~odusek/flect/ http://bit.ly/flect The system is based on Python and Scikit-Learn. You may contact us: Ondřej Dušek & Filip Jurčíček Charles University in Prague

  • dusek@ufal.mff.cuni.cz

Ondřej Dušek and Filip Jurčíček Robust Multilingual Statistical Morphology Generation Models 12/ 12