Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March - - PowerPoint PPT Presentation

tokenization and word segmentation
SMART_READER_LITE
LIVE PREVIEW

Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March - - PowerPoint PPT Presentation

Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March 6, 2020 NPFL120 Multilingual Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise


slide-1
SLIDE 1

Tokenization and Word Segmentation

Daniel Zeman, Rudolf Rosa

March 6, 2020

NPFL120 Multilingual Natural Language Processing

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Tokenization and Word Segmentation

  • IMPORTANT because:
  • Training tokenization ̸= test tokenization
  • ⇒ accuracy goes down
  • Not always trivial
  • May interact with morphology
  • May include normalization (character-level)

Tokenization and Word Segmentation

1/27

slide-3
SLIDE 3

Tokenization

“María, I love you!” Juan exclaimed. «¡María, te amo!», exclamó Juan. X PRON X VERB X « ¡ María , te amo ! » , PUNCT PUNCT PROPN PUNCT PRON VERB PUNCT PUNCT PUNCT

  • Classic tokenization:
  • Separate punctuation from words
  • Recognize certain clusters of symbols like “...”
  • Perhaps keep together things like user@mail.x.edu

Tokenization and Word Segmentation

2/27

slide-4
SLIDE 4

Using Unicode Character Categories

  • https://perldoc.perl.org/perlunicode.html

$text =˜ s/(\pP)/ $1 /g; $text =˜ s/ˆ\s+//; $text =˜ s/\s+$//;

  • $text =˜ s/(\pP)/ $1 /g;
  • Optionally recombine email addresses, URLs etc.
  • Some problems
  • haven ’ t (English; should be have n’t)
  • instal · lació (Catalan; should be 1 token)
  • single quote (punctuation) misspelled as acute accent (modifjer letter)
  • writing systems without spaces

Tokenization and Word Segmentation

3/27

slide-5
SLIDE 5

Using Unicode Character Categories

  • https://perldoc.perl.org/perlunicode.html

$text =˜ s/(\pP)/ $1 /g; $text =˜ s/ˆ\s+//; $text =˜ s/\s+$//;

  • $text =˜ s/(\pP)/ $1 /g;
  • Optionally recombine email addresses, URLs etc.
  • Some problems
  • haven ’ t (English; should be have n’t)
  • instal · lació (Catalan; should be 1 token)
  • single quote (punctuation) misspelled as acute accent (modifjer letter)
  • writing systems without spaces

Tokenization and Word Segmentation

3/27

slide-6
SLIDE 6

Normalization

  • Often part of tokenization
  • Decimal comma to decimal point; separator of thousands
  • Unicode directed quotes and long hyphens to undirected ASCII
  • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —

»magyar« — ’magyar’

  • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
  • T

EX-like ASCII directed quotes `` and '' and hyphens -- and ---

  • English/ASCII punctuation in foreign writing systems
  • 「你看過《三國演義》嗎?」他問我。
  • “你看過‘三國演義’嗎?”他問我.
  • European/ASCII digits in Arabic, Devanagari etc.
  • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)
  • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)
  • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation

4/27

slide-7
SLIDE 7

Normalization

  • Often part of tokenization
  • Decimal comma to decimal point; separator of thousands
  • Unicode directed quotes and long hyphens to undirected ASCII
  • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —

»magyar« — ’magyar’

  • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
  • T

EX-like ASCII directed quotes `` and '' and hyphens -- and ---

  • English/ASCII punctuation in foreign writing systems
  • 「你看過《三國演義》嗎?」他問我。
  • “你看過‘三國演義’嗎?”他問我.
  • European/ASCII digits in Arabic, Devanagari etc.
  • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)
  • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)
  • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation

4/27

slide-8
SLIDE 8

Normalization

  • Often part of tokenization
  • Decimal comma to decimal point; separator of thousands
  • Unicode directed quotes and long hyphens to undirected ASCII
  • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —

»magyar« — ’magyar’

  • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
  • T

EX-like ASCII directed quotes `` and '' and hyphens -- and ---

  • English/ASCII punctuation in foreign writing systems
  • 「你看過《三國演義》嗎?」他問我。
  • “你看過‘三國演義’嗎?”他問我.
  • European/ASCII digits in Arabic, Devanagari etc.
  • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)
  • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)
  • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation

4/27

slide-9
SLIDE 9

Normalization

  • Often part of tokenization
  • Decimal comma to decimal point; separator of thousands
  • Unicode directed quotes and long hyphens to undirected ASCII
  • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —

»magyar« — ’magyar’

  • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
  • T

EX-like ASCII directed quotes `` and '' and hyphens -- and ---

  • English/ASCII punctuation in foreign writing systems
  • 「你看過《三國演義》嗎?」他問我。
  • “你看過‘三國演義’嗎?”他問我.
  • European/ASCII digits in Arabic, Devanagari etc.
  • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)
  • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)
  • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation

4/27

slide-10
SLIDE 10

Normalization

  • Often part of tokenization
  • Decimal comma to decimal point; separator of thousands
  • Unicode directed quotes and long hyphens to undirected ASCII
  • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —

»magyar« — ’magyar’

  • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
  • T

EX-like ASCII directed quotes `` and '' and hyphens -- and ---

  • English/ASCII punctuation in foreign writing systems
  • 「你看過《三國演義》嗎?」他問我。
  • “你看過‘三國演義’嗎?”他問我.
  • European/ASCII digits in Arabic, Devanagari etc.
  • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)
  • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)
  • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation

4/27

slide-11
SLIDE 11

Word Segmentation

Let’s go to the sea. Vámonos al mar . VERB? X NOUN PUNCT Vamos nos a el mar . VERB PRON ADP DET NOUN PUNCT

  • Syntactic word vs. orthographic word
  • Multi-word tokens
  • Two-level scheme:
  • Tokenization (low level, punctuation, concatenative)
  • Word segmentation (higher level, not necessarily concatenative)

Tokenization and Word Segmentation

5/27

slide-12
SLIDE 12

Word Segmentation

  • Lexicalist hypothesis:
  • Words (not morphemes) are the basic units in syntax
  • Words enter in dependency relations
  • Words are forms of lemmas and have morphological features
  • Orthographic vs. syntactic word
  • Syntactically autonomous part of orthographic word
  • Contractions (al = a + el)
  • Clitics (vámonos = vamos + nos)
  • ¿A qué hora nos vamos mañana?
  • Nos despertamos a las cinco.

“We wake up at fjve.”

  • Nuestro guía nos despierta a las cinco.

“Our guide wakes us up at fjve.”

Tokenization and Word Segmentation

6/27

slide-13
SLIDE 13

Contractions in Arabic

He abdicated in favour of his son Baudouin. لزﺎﻨﺘﻳﻦﻋشﺮﻌﻟاﻪﻨﺑﻻناودﻮﺑ yatanāzalu ʿan al-ʿarši li+ibni+hi būdūān surrendered

  • n

the throne to son his Baudouin VERB ADP NOUN ADP+NOUN+PRON PROPN

Tokenization and Word Segmentation

7/27

slide-14
SLIDE 14

Segmentation as Part of Morphological Analysis

  • Arabic
  • ElixirFM: http://lindat.mff.cuni.cz/services/elixirfm/run.php
  • Enter ”ﻪﻨﺑﻻ“(labnh)
  • Sanskrit
  • Sanskrit Reader Companion: http://sanskrit.inria.fr/DICO/reader.fr.html
  • Select Input convention = Devanagari
  • Enter “सकलाथशासारं जगित समालोय िवणुशमदम्” (sakalārthaśāstrasāraṁ jagati samālokya

viṣṇuśarmedam)

  • German compound splitting (unsupervised)

Tokenization and Word Segmentation

8/27

slide-15
SLIDE 15

Chinese Word Segmentation

We are now in Valencia. 現在我們在⽡倫西亞。 Xiàn zài wǒ men zài wǎ lún xī yǎ. We are now in Valencia. 現在 我們 在 ⽡倫西亞 。 Xiànzài wǒmen zài Wǎlúnxīyǎ . Now we in Valencia . ADV PRON ADP PROPN PUNCT

Tokenization and Word Segmentation

9/27

slide-16
SLIDE 16

Words in Japanese

I went to the beauty salon of Kyōdō [, Beyond-R.] 経堂 の 美容室 に ⾏っ て き まし た Kyōdō no miyōshitsu ni it te ki mashi ta 経堂 の 美容室 に ⾏く て 来る ます た Kyōdō

  • f

beauty-salon to go CONV come will PAST PROPN ADP NOUN ADP VERB SCONJ AUX AUX AUX

nmod case

  • bl

case aux aux aux mark

Tokenization and Word Segmentation

10/27

slide-17
SLIDE 17

Words in Japanese

I went to the beauty salon of Kyōdō [, Beyond-R.] 経堂 の 美容室 に ⾏って きました Kyōdō no miyōshitsu ni itte kimashita 経堂 の 美容室 に ⾏く 来る Kyōdō

  • f

beauty-salon to going come PROPN ADP NOUN ADP VERB VERB

VerbForm=Conv VerbForm=Fin Tense=Past Polite=Form

nmod case

  • bl

case advcl

Tokenization and Word Segmentation

11/27

slide-18
SLIDE 18

Words in Japanese

I went to the beauty salon of Kyōdō [, Beyond-R.] 経堂の 美容室に ⾏って きました Kyōdōno miyōshitsuni itte kimashita 経堂 美容室 ⾏く 来る

  • f-Kyōdō to-beauty-salon

going come PROPN NOUN VERB VERB

Case=Gen Case=Dat VerbForm=Conv VerbForm=Fin Tense=Past Polite=Form

nmod

  • bl

advcl

Tokenization and Word Segmentation

12/27

slide-19
SLIDE 19

Vietnamese: Words with Spaces

All the concrete country roads are the result of… Tất cả đường bêtông nội đồng là thành quả … All road concrete country is achievement … PRON NOUN NOUN NOUN AUX NOUN PUNCT

  • Spaces delimit monosyllabic morphemes, not words.
  • Multiple syllables without space occur in loanwords (bêtông).
  • Spaces are allowed to occur word-internally in Vietnamese UD.

Tokenization and Word Segmentation

13/27

slide-20
SLIDE 20

Numbers with Spaces

# text = Il touche environ 100 000 sesterces par an. 1 Il il PRON … 2 nsubj _ _ 2 touche toucher VERB … root _ _ 3 environ environ ADV … 4 advmod _ _ 4 100 000 100 000 NUM … 5 nummod _ _ 5 sesterces sesterce NOUN … 2

  • bj

_ _ 6 par par ADP … 7 case _ _ 7 an an NOUN … 2

  • bl

_ SpaceAfter=No 8 . . PUNCT … 2 punct _ _

Tokenization and Word Segmentation

14/27

slide-21
SLIDE 21

Fixed Expressions

One syntactic word spans several orthographic words? # text = Bin nach wie vor sehr zufrieden. # text_en = I am still very satisfjed. 1 Bin sein AUX … 6 cop _ _ 2 nach nach ADP … 6

  • bl

_ _ 3 wie wie ADV … 2 fjxed _ _ 4 vor vor ADP … 2 fjxed _ _ 5 sehr sehr ADV … 6 advmod _ _ 6 zufrieden zufrieden ADJ … root _ SpaceAfter=No 7 . . PUNCT … 6

  • bl

_ _

Tokenization and Word Segmentation

15/27

slide-22
SLIDE 22

Fixed Expressions

One syntactic word spans several orthographic words? I am still very satisfjed. Bin nach wie vor sehr zufrieden . Am after like before very satisfjed . AUX ADP ADV ADP ADV ADJ PUNCT

cop

  • bl

fjxed fjxed advmod punct

Tokenization and Word Segmentation

16/27

slide-23
SLIDE 23

Multi-Word Expressions outside UD

Some corpora use the underscore character to glue MWEs together. I am still very satisfjed. Bin nach_wie_vor sehr zufrieden . Am after_like_before very satisfjed . AUX ADV ADV ADJ PUNCT

cop advmod advmod punct

Tokenization and Word Segmentation

17/27

slide-24
SLIDE 24

Multi-Word Expressions outside UD

Some corpora use the underscore character to glue MWEs together.

  • Durante la presentación del libro ”

La_prosperidad_por_medio_de_la_investigación_._La_investigación_básica_en_EEUU ” , editado por la Comunidad_de_Madrid , el secretario general de la Confederación_Empresarial_de_Madrid-CEOE ( CEIM ) , Alejandro_Couceiro , abogó por la formación de los investigadores en temas de innovación tecnológica .

  • Lemmas?
  • Tags?

Tokenization and Word Segmentation

18/27

slide-25
SLIDE 25

Word Segmentation Summary

  • When to split?
  • Only part of the token involved in a relation to something outside the token? Split!
  • Hard time fjnding POS tag? Split!
  • Hard time fjnding dependency relation? Don’t split!
  • Or not hard time but the relation would be compound, flat, fixed or goeswith.
  • Border case? Keep orthographic words (if they exist).
  • Words with spaces
  • Vietnamese writing system
  • Very restricted set of exceptions (numbers)
  • Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation

19/27

slide-26
SLIDE 26

Word Segmentation Summary

  • When to split?
  • Only part of the token involved in a relation to something outside the token? Split!
  • Hard time fjnding POS tag? Split!
  • Hard time fjnding dependency relation? Don’t split!
  • Or not hard time but the relation would be compound, flat, fixed or goeswith.
  • Border case? Keep orthographic words (if they exist).
  • Words with spaces
  • Vietnamese writing system
  • Very restricted set of exceptions (numbers)
  • Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation

19/27

slide-27
SLIDE 27

Word Segmentation Summary

  • When to split?
  • Only part of the token involved in a relation to something outside the token? Split!
  • Hard time fjnding POS tag? Split!
  • Hard time fjnding dependency relation? Don’t split!
  • Or not hard time but the relation would be compound, flat, fixed or goeswith.
  • Border case? Keep orthographic words (if they exist).
  • Words with spaces
  • Vietnamese writing system
  • Very restricted set of exceptions (numbers)
  • Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation

19/27

slide-28
SLIDE 28

Word Segmentation Summary

  • When to split?
  • Only part of the token involved in a relation to something outside the token? Split!
  • Hard time fjnding POS tag? Split!
  • Hard time fjnding dependency relation? Don’t split!
  • Or not hard time but the relation would be compound, flat, fixed or goeswith.
  • Border case? Keep orthographic words (if they exist).
  • Words with spaces
  • Vietnamese writing system
  • Very restricted set of exceptions (numbers)
  • Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation

19/27

slide-29
SLIDE 29

Word Segmentation Summary

  • When to split?
  • Only part of the token involved in a relation to something outside the token? Split!
  • Hard time fjnding POS tag? Split!
  • Hard time fjnding dependency relation? Don’t split!
  • Or not hard time but the relation would be compound, flat, fixed or goeswith.
  • Border case? Keep orthographic words (if they exist).
  • Words with spaces
  • Vietnamese writing system
  • Very restricted set of exceptions (numbers)
  • Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation

19/27

slide-30
SLIDE 30

Recoverability: CoNLL-U Format

# text = Vámonos al mar. # text_en = Let’s go to the sea. ID FORM LEMMA UPOS … HEAD _ MISC 1-2 Vámonos _ _ … _ _ _ _ 1 Vamos ir VERB … root _ _ 2 nos nosotros PRON … 1

  • bj

_ _ 3-4 al _ _ … _ _ _ _ 3 a a ADP … 5 case _ _ 4 el el DET … 5 det _ _ 5 mar mar NOUN … 1

  • bl

_ SpaceAfter=No 6 . . PUNCT … 1 punct _ _

Tokenization and Word Segmentation

20/27

slide-31
SLIDE 31

Recoverability: CoNLL-U Format

# text = Vámonos al mar. # text_en = Let’s go to the sea. ID FORM LEMMA UPOS … HEAD _ MISC 1-2 Vámonos _ _ … _ _ _ _ 1 Vamos ir VERB … root _ _ 2 nos nosotros PRON … 1

  • bj

_ _ 3-4 al _ _ … _ _ _ _ 3 a a ADP … 5 case _ _ 4 el el DET … 5 det _ _ 5-6 mar. _ _ … _ _ _ _ 5 mar mar NOUN … 1

  • bl

_ _ 6 . . PUNCT … 1 punct _ _

Tokenization and Word Segmentation

20/27

slide-32
SLIDE 32

Tokenization vs. Multi-word Tokens

  • Parallelism among closely related languages
  • ca: informar-se sobre el patrimoni cultural
  • es: informarse sobre el patrimonio cultural
  • en: learn about cultural heritage
  • ca: L’únic que veig és => L’ únic que veig és
  • en: don’t => do n’t
  • No strict guidelines for tokenization (yet)
  • UD English: non-stop, post-war: single-word tokens
  • UD Czech: non-stop would be split to three tokens
  • Abbreviations: etc.
  • End of sentence…

Tokenization and Word Segmentation

21/27

slide-33
SLIDE 33

Tokenization vs. Multi-word Tokens Summary

  • Punctuation involved? Low level!
  • Exceptions: Spanish-Catalan parallelism.
  • Boundary between two letters? Typically high level.
  • Exceptions: Chinese, Japanese.
  • Non-concatenative? High level!

Tokenization and Word Segmentation

22/27

slide-34
SLIDE 34

Tokenization vs. Multi-word Tokens Summary

  • Punctuation involved? Low level!
  • Exceptions: Spanish-Catalan parallelism.
  • Boundary between two letters? Typically high level.
  • Exceptions: Chinese, Japanese.
  • Non-concatenative? High level!

Tokenization and Word Segmentation

22/27

slide-35
SLIDE 35

Tokenization vs. Multi-word Tokens Summary

  • Punctuation involved? Low level!
  • Exceptions: Spanish-Catalan parallelism.
  • Boundary between two letters? Typically high level.
  • Exceptions: Chinese, Japanese.
  • Non-concatenative? High level!

Tokenization and Word Segmentation

22/27

slide-36
SLIDE 36

Errors in Underlying Text

  • We do not want to hide errors (learning robust parsers!)
  • But: reference corpora (linguistic research) may want to hide them.
  • Possibilities:
  • Typo not involving word boundary
  • FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:

Correct=annotation

  • Wrongly split word:

ann otation X X

goeswith

  • Wrongly merged words: thecar
  • Fix tokenization (i.e. two lines); fjrst line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes
  • Sentence segmentation can be afgected, too!

Tokenization and Word Segmentation

23/27

slide-37
SLIDE 37

Errors in Underlying Text

  • We do not want to hide errors (learning robust parsers!)
  • But: reference corpora (linguistic research) may want to hide them.
  • Possibilities:
  • Typo not involving word boundary
  • FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:

Correct=annotation

  • Wrongly split word:

ann otation X X

goeswith

  • Wrongly merged words: thecar
  • Fix tokenization (i.e. two lines); fjrst line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes
  • Sentence segmentation can be afgected, too!

Tokenization and Word Segmentation

23/27

slide-38
SLIDE 38

Errors in Underlying Text

  • We do not want to hide errors (learning robust parsers!)
  • But: reference corpora (linguistic research) may want to hide them.
  • Possibilities:
  • Typo not involving word boundary
  • FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:

Correct=annotation

  • Wrongly split word:

ann otation X X

goeswith

  • Wrongly merged words: thecar
  • Fix tokenization (i.e. two lines); fjrst line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes
  • Sentence segmentation can be afgected, too!

Tokenization and Word Segmentation

23/27

slide-39
SLIDE 39

Errors in Underlying Text

  • We do not want to hide errors (learning robust parsers!)
  • But: reference corpora (linguistic research) may want to hide them.
  • Possibilities:
  • Typo not involving word boundary
  • FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:

Correct=annotation

  • Wrongly split word:

ann otation X X

goeswith

  • Wrongly merged words: thecar
  • Fix tokenization (i.e. two lines); fjrst line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes
  • Sentence segmentation can be afgected, too!

Tokenization and Word Segmentation

23/27

slide-40
SLIDE 40

Errors in Underlying Text

  • Wrong morphology: the cars is produced in Detroit
  • Not like normal typo (the car iss produced…)
  • Not obvious what is correct
  • the car is
  • the cars are
  • Suggestion: select which word to fjx, e.g. cars to car
  • FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing
  • cs: viděl moři “he saw the sea”
  • Should be moře
  • Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)
  • This form is Case=Dat,Loc (but which one?)
  • cestoval k moři “he traveled to the sea” Case=Dat
  • plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation

24/27

slide-41
SLIDE 41

Errors in Underlying Text

  • Wrong morphology: the cars is produced in Detroit
  • Not like normal typo (the car iss produced…)
  • Not obvious what is correct
  • the car is
  • the cars are
  • Suggestion: select which word to fjx, e.g. cars to car
  • FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing
  • cs: viděl moři “he saw the sea”
  • Should be moře
  • Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)
  • This form is Case=Dat,Loc (but which one?)
  • cestoval k moři “he traveled to the sea” Case=Dat
  • plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation

24/27

slide-42
SLIDE 42

Errors in Underlying Text

  • Wrong morphology: the cars is produced in Detroit
  • Not like normal typo (the car iss produced…)
  • Not obvious what is correct
  • the car is
  • the cars are
  • Suggestion: select which word to fjx, e.g. cars to car
  • FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing
  • cs: viděl moři “he saw the sea”
  • Should be moře
  • Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)
  • This form is Case=Dat,Loc (but which one?)
  • cestoval k moři “he traveled to the sea” Case=Dat
  • plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation

24/27

slide-43
SLIDE 43

Errors in Underlying Text

  • Wrong morphology: the cars is produced in Detroit
  • Not like normal typo (the car iss produced…)
  • Not obvious what is correct
  • the car is
  • the cars are
  • Suggestion: select which word to fjx, e.g. cars to car
  • FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing
  • cs: viděl moři “he saw the sea”
  • Should be moře
  • Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)
  • This form is Case=Dat,Loc (but which one?)
  • cestoval k moři “he traveled to the sea” Case=Dat
  • plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation

24/27

slide-44
SLIDE 44

Errors in Underlying Text

  • Wrong morphology: the cars is produced in Detroit
  • Not like normal typo (the car iss produced…)
  • Not obvious what is correct
  • the car is
  • the cars are
  • Suggestion: select which word to fjx, e.g. cars to car
  • FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing
  • cs: viděl moři “he saw the sea”
  • Should be moře
  • Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)
  • This form is Case=Dat,Loc (but which one?)
  • cestoval k moři “he traveled to the sea” Case=Dat
  • plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation

24/27

slide-45
SLIDE 45

Tokenization Alignment

  • If you need to match two difgerent tokenizations
  • Use case: evaluation of end-to-end parsing systems
  • Normalization involved? Bad luck…
  • Normalization rules needed
  • Or: Longest common subsequence (LCS) algorithm
  • Otherwise easy
  • Non-whitespace character ofgsets

Tokenization and Word Segmentation

25/27

slide-46
SLIDE 46

Tokenization Alignment

  • If you need to match two difgerent tokenizations
  • Use case: evaluation of end-to-end parsing systems
  • Normalization involved? Bad luck…
  • Normalization rules needed
  • Or: Longest common subsequence (LCS) algorithm
  • Otherwise easy
  • Non-whitespace character ofgsets

Tokenization and Word Segmentation

25/27

slide-47
SLIDE 47

Tokenization Alignment

  • If you need to match two difgerent tokenizations
  • Use case: evaluation of end-to-end parsing systems
  • Normalization involved? Bad luck…
  • Normalization rules needed
  • Or: Longest common subsequence (LCS) algorithm
  • Otherwise easy
  • Non-whitespace character ofgsets

Tokenization and Word Segmentation

25/27

slide-48
SLIDE 48

Evaluation Metrics

  • Align system-output tokens to gold tokens

Al-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque in the town of Qaim, near the Syrian border. GOLD: Al

  • Zaman

: American forces killed Shaikh OFFSET: 0-1 2 3-7 8 9-16 17-22 23-28 29-34

  • All characters except for whitespace match => easy align!

SYSTEM: Al-Zaman : American forces killed Shaikh OFFSET: 0-7 8 9-16 17-22 23-28 29-34

Tokenization and Word Segmentation

26/27

slide-49
SLIDE 49

Evaluation Metrics

  • Align system-output tokens to gold tokens

Die Kosten sind defjnitiv auch im Rahmen. GOLD: Die Kosten sind defjnitiv auch im Rahmen . SPLIT: Die Kosten sind defjnitiv auch in dem Rahmen . OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34

  • Corresponding but not identical spans?
  • Find longest common subsequence

SYSTEM: Kosten sind defjnitiv auch im Rahmen . SPLIT: Kosten sind de fjnitiv auch im Rahmen . OFFSET: 3-8 9-12 13-21 22-25 26-27 28-33 34

Tokenization and Word Segmentation

27/27

slide-50
SLIDE 50

Evaluation Metrics

  • Align system-output tokens to gold tokens

Die Kosten sind defjnitiv auch im Rahmen. GOLD: Die Kosten sind defjnitiv auch im Rahmen . SPLIT: Die Kosten sind defjnitiv auch in dem Rahmen . OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34

  • Corresponding but not identical spans?
  • Find longest common subsequence

SYSTEM: auch im Rahmen . SPLIT: auch in einem , dem alle zustimmen , Rahmen . OFFSET: 22-25 26-27 28-33 34

Tokenization and Word Segmentation

27/27