Data Augmentation for Context-Sensitive Neural Lemmatization Using - - PowerPoint PPT Presentation

data augmentation for context sensitive neural
SMART_READER_LITE
LIVE PREVIEW

Data Augmentation for Context-Sensitive Neural Lemmatization Using - - PowerPoint PPT Presentation

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text Toms Bergmanis, Sharon Goldwater Lemmatization Sing Plural NOM ce cei GEN cea ceu DAT ceam ceiem ce ACC ceu


slide-1
SLIDE 1

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text

Toms Bergmanis, Sharon Goldwater

slide-2
SLIDE 2

Lemmatization

Sing Plural NOM ceļš ceļi GEN ceļa ceļu DAT ceļam ceļiem ACC ceļu ceļus INST ar ceļu ar ceļiem LOC ceļā ceļos VOC ceļ ceļi

Latvian: ceļš (English: road) ceļš

slide-3
SLIDE 3

“sentence context helps to lemmatize ambiguous and unseen words”

Bergmanis and Goldwater, 2018

Previous work:

slide-4
SLIDE 4

Ambiguous words: ceļu

Lemma could be:

  • A. ceļš (road): NOUN, sing., ACC
  • B. celis (knee): NOUN, plur., DAT
  • C. celt (to lift):VERB, 1st p., sing., pres.

` Latvian examples

slide-5
SLIDE 5
  • 1. Lemma annotated sentences are

scarce for low resource languages

  • 2. annotating sentences is slow
  • 3. N types > N (contiguous) tokens

Learning from sentences

slide-6
SLIDE 6
  • 1. Lemma annotated sentences are

scarce for low resource languages

  • 2. annotating sentences is slow
  • 3. N types > N (contiguous) tokens

Chakrabarty et al., 2017

Learning from sentences

slide-7
SLIDE 7
  • 1. Lemma annotated sentences are

scarce for low resource languages

  • 2. annotating sentences is slow
  • 3. N types > N (contiguous) tokens

Garrette et al., 2013

Learning from sentences

slide-8
SLIDE 8

N types > N tokens

Training on 1k UDT tokens/types

slide-9
SLIDE 9

Types in context

algorithms get smarter , computers faster

smart

Bergmanis and Goldwater, 2018

slide-10
SLIDE 10

ceļš UniMorph Inflection tables

Proposal: Data Augmentation

+

Combine... ...to get types in context

slide-11
SLIDE 11

Inflection

Dzīves pēdējā ceļā pavadot mūsu ceļš UniMorph Inflection tables:

... ceļš ceļš N;NOM;SG ceļš ceļā N;LOC;SG ...

Method: Data Augmentation

slide-12
SLIDE 12

Context

Dzīves pēdējā ceļā pavadot mūsu UniMorph Inflection tables:

... ceļš ceļš N;NOM;SG ceļš ceļā N;LOC;SG ...

Method: Data Augmentation

slide-13
SLIDE 13

Lemma

Dzīves pēdējā ceļā pavadot mūsu ceļš UniMorph Inflection tables:

... ceļš ceļš N;NOM;SG ceļš ceļā N;LOC;SG ...

Method: Data Augmentation

slide-14
SLIDE 14

Inflection Tables:

Sing Plural NOM ceļš ceļi GEN ceļa ceļu DAT ceļam ceļiem ACC ceļu ceļus INST ar ceļu ar ceļiem LOC ceļā ceļos VOC ceļ ceļi

Latvian: ceļš (English: road)

slide-15
SLIDE 15

Inflection Tables:

Sing Plural NOM ceļš ceļi GEN ceļa ceļu DAT ceļam ceļiem ACC ceļu ceļus INST ar ceļu ar ceļiem LOC ceļā ceļos VOC ceļ ceļi

celt (build) ceļot (travel) celis (knee)

slide-16
SLIDE 16

Inflection Tables:

Sing Plural NOM ceļš ceļi GEN ceļa ceļu DAT ceļam ceļiem ACC ceļu ceļus INST ar ceļu ar ceļiem LOC ceļā ceļos VOC ceļ ceļi

celt (build) ceļot (travel) celis (knee)

slide-17
SLIDE 17

Inflection Tables:

Sing Plural NOM ceļš ceļi GEN ceļa ceļu DAT ceļam ceļiem ACC ceļu ceļus INST ar ceļu ar ceļiem LOC ceļā ceļos VOC ceļ ceļi

celt (build) ceļot (travel) celis (knee)

slide-18
SLIDE 18

Key question:

If ambiguous words “enforce” the use of context: Is context still useful in the absence

  • f ambiguous forms?
slide-19
SLIDE 19

Experiments

Train: 1k types from universal dependency corpus Augment: 1k, 5k, 10k types of UniMorph in Wikipedia contexts Languages: Bulgarian, Czech, Estonian, Finnish, Latvian, Polish, Romanian, Russian, Swedish, Turkish

slide-20
SLIDE 20

Experiments

Metric: type level macro average accuracy Test: on standard splits of universal dependency corpus

slide-21
SLIDE 21

Results: Data augmentation

using context

slide-22
SLIDE 22

Does model learn from context?

context vs no context

slide-23
SLIDE 23

Afix ambiguity: wuger

Lemma depends on context:

  • A. if wuger is adjective then lemma

could be wug

  • B. if wuger is noun then lemma

could be wuger

` English examples

slide-24
SLIDE 24

Takeaways/conclusions:

Despite biased data and divergent lemmatization standards Type based data augmentation helps (+14% accuracy)

slide-25
SLIDE 25

Takeaways/conclusions:

Even without the ambiguous types that “enforce” the use of context Model use context to disambiguate affixes of unseen words (+5% accuracy)

slide-26
SLIDE 26

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text

toms.bergmanis@gmail.com

https://bitbucket.org/tomsbergmanis/data_augumentation_um_wiki