Tokenization and Word Segmentation
Daniel Zeman, Rudolf Rosa
March 6, 2020
NPFL120 Multilingual Natural Language Processing
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March - - PowerPoint PPT Presentation
Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March 6, 2020 NPFL120 Multilingual Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise
Daniel Zeman, Rudolf Rosa
March 6, 2020
NPFL120 Multilingual Natural Language Processing
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Tokenization and Word Segmentation
1/27
“María, I love you!” Juan exclaimed. «¡María, te amo!», exclamó Juan. X PRON X VERB X « ¡ María , te amo ! » , PUNCT PUNCT PROPN PUNCT PRON VERB PUNCT PUNCT PUNCT
Tokenization and Word Segmentation
2/27
$text =˜ s/(\pP)/ $1 /g; $text =˜ s/ˆ\s+//; $text =˜ s/\s+$//;
Tokenization and Word Segmentation
3/27
$text =˜ s/(\pP)/ $1 /g; $text =˜ s/ˆ\s+//; $text =˜ s/\s+$//;
Tokenization and Word Segmentation
3/27
»magyar« — ’magyar’
EX-like ASCII directed quotes `` and '' and hyphens -- and ---
Tokenization and Word Segmentation
4/27
»magyar« — ’magyar’
EX-like ASCII directed quotes `` and '' and hyphens -- and ---
Tokenization and Word Segmentation
4/27
»magyar« — ’magyar’
EX-like ASCII directed quotes `` and '' and hyphens -- and ---
Tokenization and Word Segmentation
4/27
»magyar« — ’magyar’
EX-like ASCII directed quotes `` and '' and hyphens -- and ---
Tokenization and Word Segmentation
4/27
»magyar« — ’magyar’
EX-like ASCII directed quotes `` and '' and hyphens -- and ---
Tokenization and Word Segmentation
4/27
Let’s go to the sea. Vámonos al mar . VERB? X NOUN PUNCT Vamos nos a el mar . VERB PRON ADP DET NOUN PUNCT
Tokenization and Word Segmentation
5/27
“We wake up at fjve.”
“Our guide wakes us up at fjve.”
Tokenization and Word Segmentation
6/27
He abdicated in favour of his son Baudouin. لزﺎﻨﺘﻳﻦﻋشﺮﻌﻟاﻪﻨﺑﻻناودﻮﺑ yatanāzalu ʿan al-ʿarši li+ibni+hi būdūān surrendered
the throne to son his Baudouin VERB ADP NOUN ADP+NOUN+PRON PROPN
Tokenization and Word Segmentation
7/27
viṣṇuśarmedam)
Tokenization and Word Segmentation
8/27
We are now in Valencia. 現在我們在⽡倫西亞。 Xiàn zài wǒ men zài wǎ lún xī yǎ. We are now in Valencia. 現在 我們 在 ⽡倫西亞 。 Xiànzài wǒmen zài Wǎlúnxīyǎ . Now we in Valencia . ADV PRON ADP PROPN PUNCT
Tokenization and Word Segmentation
9/27
I went to the beauty salon of Kyōdō [, Beyond-R.] 経堂 の 美容室 に ⾏っ て き まし た Kyōdō no miyōshitsu ni it te ki mashi ta 経堂 の 美容室 に ⾏く て 来る ます た Kyōdō
beauty-salon to go CONV come will PAST PROPN ADP NOUN ADP VERB SCONJ AUX AUX AUX
nmod case
case aux aux aux mark
Tokenization and Word Segmentation
10/27
I went to the beauty salon of Kyōdō [, Beyond-R.] 経堂 の 美容室 に ⾏って きました Kyōdō no miyōshitsu ni itte kimashita 経堂 の 美容室 に ⾏く 来る Kyōdō
beauty-salon to going come PROPN ADP NOUN ADP VERB VERB
VerbForm=Conv VerbForm=Fin Tense=Past Polite=Form
nmod case
case advcl
Tokenization and Word Segmentation
11/27
I went to the beauty salon of Kyōdō [, Beyond-R.] 経堂の 美容室に ⾏って きました Kyōdōno miyōshitsuni itte kimashita 経堂 美容室 ⾏く 来る
going come PROPN NOUN VERB VERB
Case=Gen Case=Dat VerbForm=Conv VerbForm=Fin Tense=Past Polite=Form
nmod
advcl
Tokenization and Word Segmentation
12/27
All the concrete country roads are the result of… Tất cả đường bêtông nội đồng là thành quả … All road concrete country is achievement … PRON NOUN NOUN NOUN AUX NOUN PUNCT
Tokenization and Word Segmentation
13/27
# text = Il touche environ 100 000 sesterces par an. 1 Il il PRON … 2 nsubj _ _ 2 touche toucher VERB … root _ _ 3 environ environ ADV … 4 advmod _ _ 4 100 000 100 000 NUM … 5 nummod _ _ 5 sesterces sesterce NOUN … 2
_ _ 6 par par ADP … 7 case _ _ 7 an an NOUN … 2
_ SpaceAfter=No 8 . . PUNCT … 2 punct _ _
Tokenization and Word Segmentation
14/27
One syntactic word spans several orthographic words? # text = Bin nach wie vor sehr zufrieden. # text_en = I am still very satisfjed. 1 Bin sein AUX … 6 cop _ _ 2 nach nach ADP … 6
_ _ 3 wie wie ADV … 2 fjxed _ _ 4 vor vor ADP … 2 fjxed _ _ 5 sehr sehr ADV … 6 advmod _ _ 6 zufrieden zufrieden ADJ … root _ SpaceAfter=No 7 . . PUNCT … 6
_ _
Tokenization and Word Segmentation
15/27
One syntactic word spans several orthographic words? I am still very satisfjed. Bin nach wie vor sehr zufrieden . Am after like before very satisfjed . AUX ADP ADV ADP ADV ADJ PUNCT
cop
fjxed fjxed advmod punct
Tokenization and Word Segmentation
16/27
Some corpora use the underscore character to glue MWEs together. I am still very satisfjed. Bin nach_wie_vor sehr zufrieden . Am after_like_before very satisfjed . AUX ADV ADV ADJ PUNCT
cop advmod advmod punct
Tokenization and Word Segmentation
17/27
Some corpora use the underscore character to glue MWEs together.
La_prosperidad_por_medio_de_la_investigación_._La_investigación_básica_en_EEUU ” , editado por la Comunidad_de_Madrid , el secretario general de la Confederación_Empresarial_de_Madrid-CEOE ( CEIM ) , Alejandro_Couceiro , abogó por la formación de los investigadores en temas de innovación tecnológica .
Tokenization and Word Segmentation
18/27
Tokenization and Word Segmentation
19/27
Tokenization and Word Segmentation
19/27
Tokenization and Word Segmentation
19/27
Tokenization and Word Segmentation
19/27
Tokenization and Word Segmentation
19/27
# text = Vámonos al mar. # text_en = Let’s go to the sea. ID FORM LEMMA UPOS … HEAD _ MISC 1-2 Vámonos _ _ … _ _ _ _ 1 Vamos ir VERB … root _ _ 2 nos nosotros PRON … 1
_ _ 3-4 al _ _ … _ _ _ _ 3 a a ADP … 5 case _ _ 4 el el DET … 5 det _ _ 5 mar mar NOUN … 1
_ SpaceAfter=No 6 . . PUNCT … 1 punct _ _
Tokenization and Word Segmentation
20/27
# text = Vámonos al mar. # text_en = Let’s go to the sea. ID FORM LEMMA UPOS … HEAD _ MISC 1-2 Vámonos _ _ … _ _ _ _ 1 Vamos ir VERB … root _ _ 2 nos nosotros PRON … 1
_ _ 3-4 al _ _ … _ _ _ _ 3 a a ADP … 5 case _ _ 4 el el DET … 5 det _ _ 5-6 mar. _ _ … _ _ _ _ 5 mar mar NOUN … 1
_ _ 6 . . PUNCT … 1 punct _ _
Tokenization and Word Segmentation
20/27
Tokenization and Word Segmentation
21/27
Tokenization and Word Segmentation
22/27
Tokenization and Word Segmentation
22/27
Tokenization and Word Segmentation
22/27
Correct=annotation
ann otation X X
goeswith
Tokenization and Word Segmentation
23/27
Correct=annotation
ann otation X X
goeswith
Tokenization and Word Segmentation
23/27
Correct=annotation
ann otation X X
goeswith
Tokenization and Word Segmentation
23/27
Correct=annotation
ann otation X X
goeswith
Tokenization and Word Segmentation
23/27
Tokenization and Word Segmentation
24/27
Tokenization and Word Segmentation
24/27
Tokenization and Word Segmentation
24/27
Tokenization and Word Segmentation
24/27
Tokenization and Word Segmentation
24/27
Tokenization and Word Segmentation
25/27
Tokenization and Word Segmentation
25/27
Tokenization and Word Segmentation
25/27
Al-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque in the town of Qaim, near the Syrian border. GOLD: Al
: American forces killed Shaikh OFFSET: 0-1 2 3-7 8 9-16 17-22 23-28 29-34
SYSTEM: Al-Zaman : American forces killed Shaikh OFFSET: 0-7 8 9-16 17-22 23-28 29-34
Tokenization and Word Segmentation
26/27
Die Kosten sind defjnitiv auch im Rahmen. GOLD: Die Kosten sind defjnitiv auch im Rahmen . SPLIT: Die Kosten sind defjnitiv auch in dem Rahmen . OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34
SYSTEM: Kosten sind defjnitiv auch im Rahmen . SPLIT: Kosten sind de fjnitiv auch im Rahmen . OFFSET: 3-8 9-12 13-21 22-25 26-27 28-33 34
Tokenization and Word Segmentation
27/27
Die Kosten sind defjnitiv auch im Rahmen. GOLD: Die Kosten sind defjnitiv auch im Rahmen . SPLIT: Die Kosten sind defjnitiv auch in dem Rahmen . OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34
SYSTEM: auch im Rahmen . SPLIT: auch in einem , dem alle zustimmen , Rahmen . OFFSET: 22-25 26-27 28-33 34
Tokenization and Word Segmentation
27/27