Natural Language Processing for historical language varieties
Cristina S´ anchez Marco Gjøvik University College MTL lectures
April 3 2013
- C. S´
anchez Marco, GUC NLP for historical language varieties
April 3 2013
1 / 28
Natural Language Processing for historical language varieties - - PowerPoint PPT Presentation
Natural Language Processing for historical language varieties Cristina S anchez Marco Gjvik University College MTL lectures April 3 2013 April 3 2013 C. S anchez Marco, GUC NLP for historical language varieties 1 / 28 NLP and its
Cristina S´ anchez Marco Gjøvik University College MTL lectures
April 3 2013
anchez Marco, GUC NLP for historical language varieties
April 3 2013
1 / 28
Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Applications: question answering, sentiment analysis, machine translation, information extraction, ...
→ Email: Subject: curriculum meeting Date: January 15, 2012 To: Dan Hi Dan, we’ve now scheduled the curriculum
→ Create new Calendar entry:
Event: Curriculum meeting Date: Jan-16-2012 Start: 10:00am End: 11:30am Where: Gates 159
anchez Marco, GUC NLP for historical language varieties
April 3 2013
2 / 28
Morphological analysis and tagging is a basic NLP task useful in many
Part-of-speech tagging: the process of marking up a word in a text as corresponding to a particular part of speech. 9 parts of speech: noun, verb, adjective, adverb, preposition, determiner, conjunction, pronoun, interjection. For example, the word the is a determiner. Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, the word better has good as its lemma. → Words often have more than one POS (“ambiguity”). The POS tagging problem is to determine the POS tag for a particular instance of a word in a sentence. → Uses: text-to-speech (how do we pronounce “lead”?), spelling checkers, as input to or to speed up a full parser, OCR scanning,...
anchez Marco, GUC NLP for historical language varieties
April 3 2013
3 / 28
anchez Marco, GUC NLP for historical language varieties
April 3 2013
4 / 28
anchez Marco, GUC NLP for historical language varieties
April 3 2013
5 / 28
anchez Marco, GUC NLP for historical language varieties
April 3 2013
6 / 28
→ Electronic editions prepared by the Hispanic Seminary of Medieval Studies
[fol. 32v] {CB1. Ca delo q<ue> mas amaua yal viene el mandado Dozi[en]tos cauall<er>os mando exir p<r><<i>>uado Q<ue> Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie<n> sabe q<ue> albarfanez t<r><<a>>he todo Recabdo
anchez Marco, GUC NLP for historical language varieties
April 3 2013
7 / 28
[fol. 32v] {CB1. Ca delo q<ue> mas amaua yal viene el mandado Dozi[en]tos cauall<er>os mando exir p<r><i>uado Q<ue> Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie<n> sabe q<ue> albarfanez t<r><a>he todo Recabdo
anchez Marco, GUC NLP for historical language varieties
April 3 2013
8 / 28
[fol. 32v] {CB1. Ca delo (de lo) que mas amaua (amaba) yal (ya le) viene el mandado Dozientos (Doscientos) caualleros (caballeros) mando (mand´
Que Re¸ ciban (reciban) a myanaya & alas (a las) duenas (due˜ nas) fijas dalgo (hidalgas) El (´ El) sedie (ser´ ıa) en valen¸ cia (valencia) curiando (curando) & guardando Ca bien sabe que albarfanez trahe (trae) todo Recabdo (Recaudo)
anchez Marco, GUC NLP for historical language varieties
April 3 2013
9 / 28
[fol. 32v] {CB1. Ca delo que mas amaua yal viene el mandado(,) Dozientos caualleros mando exir priuado(,) Que Re¸ ciban a myanaya (Myanaya) & alas duenas fijas dalgo(,) El sedie en valen¸ cia (Valencia) curiando & guardando(,) Ca bien sabe que albarfanez (Albarfanez) trahe todo Recabdo (recaudo)(.)
anchez Marco, GUC NLP for historical language varieties
April 3 2013
10 / 28
[fol. 32v] {CB1. Ca delo que mas amaua yal viene el mandado (Ca yal viene el mandado delo que mas amaua) Dozientos caualleros mando exir priuado Que Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bien sabe que albarfanez trahe todo Recabdo
anchez Marco, GUC NLP for historical language varieties
April 3 2013
11 / 28
The challenge is to enrich the text with lemma and POS tag.
1
Manually
2
Build a tagger from scratch
3
Use an existing tool for a modern language variety
Advantages: resource saving Disadvantages: non-acceptable accuracy, manual correction
4
Adapt an existing tool for a modern language variety
Advantages: reusable, sustainable, relatively easy to adapt, resource saving, extensible to other language varieties Disadvantages: Is it easy and resource-saving to do this? Is it easy to adapt to
anchez Marco, GUC NLP for historical language varieties
April 3 2013
12 / 28
Adapt the tool Tool: Freeling http://nlp.lsi.upc.edu/freeling Language: (standard) Modern Spanish Specific advantages of adapting Freeling
well documented and actively mantained modular, relatively easy to adapt
anchez Marco, GUC NLP for historical language varieties
April 3 2013
13 / 28
ra raw t text tokenizer probabilities affixation dictionary morphological analysis ANAL ANALYZER TA TAGGER ER ta tagged cor corpus pus
anchez Marco, GUC NLP for historical language varieties
April 3 2013
14 / 28
Using the existing standard Modern Spanish tool as a basis to create an Old Spanish analyzer Expansion of the dictionary Retraining of the tagger Modification of other modules: tokenization, affixation
anchez Marco, GUC NLP for historical language varieties
April 3 2013
15 / 28
1
Old Spanish Corpus
2
Gold Standard Corpus
3
(Standard) Modern Spanish Corpus
anchez Marco, GUC NLP for historical language varieties
April 3 2013
16 / 28
Electronic editions by the Hispanic Seminary of Medieval Studies Critical editions of the original manuscripts 12th to 16th century Spanish Representative corpus:
more than 20 million tokens, 470 thousand types variety of genres (fiction and non-fiction)
→ To expand the dictionary
anchez Marco, GUC NLP for historical language varieties
April 3 2013
17 / 28
60,000 tokens from the Old Spanish Corpus (50%) and a Modern Spanish tagged corpus (50%) It mirrors the Old Spanish Corpus in size and text-type distribution → To retrain the tagger and carry out the evaluation and error analysis
anchez Marco, GUC NLP for historical language varieties
April 3 2013
18 / 28
Corpus LexEsp (Sebasti´ an et al 2000) from 1975 to 1995 more than 5 million words variety of genres → baseline performance for the tagger
anchez Marco, GUC NLP for historical language varieties
April 3 2013
19 / 28
556,210 Standard Spanish words (669,121 lemma- tag pairs) + 58,435 Old Spanish words = 614,000 word forms (744,160 lemma-tag pairs) Distribution of words added to the dictionary Verbs 83.4% Pronouns 1.3% Nouns 26.8% Determiners 1% Adjectives 9.4% Adverbs 0.7% Prepositions 2.1% Conjunctions 0.5% Numbers 1.7% Interjections 0.3% Proper names 1.4% Punctuation 0.01%
anchez Marco, GUC NLP for historical language varieties
April 3 2013
20 / 28
Mapping rules:
Substring rules (54 sequences of characters): 42% of the words added Old Modern Example euo evo nueuo → nuevo ‘new’ uio vio uio → vio ‘saw’
nuf → nube ‘cloud’ sp- esp- spera → espera ‘wait’ Word rules: 39% of the words added consul → c´
catholica → cat´
VARD 2 (69 spelling rules): 19% of the words added Old Modern j ´ ı nn ˜ n rr r
anchez Marco, GUC NLP for historical language varieties
April 3 2013
21 / 28
Use of the Gold Standard Corpus Two taggers:
Hybrid (relax), integrating statistical and hand-coded grammatical rules Hidden Markov Model (hmm), trigram markovian tagger based on TnT (Brants, 2000)
anchez Marco, GUC NLP for historical language varieties
April 3 2013
22 / 28
C0: original tools for standard Modern Spanish (baseline) C1-hmm: expanded dict. + modules + hmm trained tagger (60,000-token gold standard corpus) C1-relax: expanded dict. + modules + relax trained tagger (60,000-token gold standard corpus)
Lemma PoS-1 PoS-2 C0 72.4 70.9 77.4 C1-hmm 95.8 90.1 95.3 C1-relax 95.8 92.6 95.7 SS 99.1 94 97.6 → PoS-1: whole label. E.g. viene VMIP3S0 → PoS-2: word class. E.g. viene VMIP3S0
anchez Marco, GUC NLP for historical language varieties
April 3 2013
23 / 28
100 most frequent errors in the tagging, under condition C1-relax 81.6% errors are due to ambiguity
determiner vs clitic readings of la, las ‘the/it’ first vs third person singular (queria ‘I/he wanted’) accentuation (llego ‘arrive/arrived’) (standard: llego/lleg´
18.4% errors are words out of the dictionary
proper nouns (pierres, antolinez) words not covered by any mapping rule (coita ‘wish’) (standard: cuita)
anchez Marco, GUC NLP for historical language varieties
April 3 2013
24 / 28
Ca delo que mas amaua yal viene el mandado Dozientos caualleros mando exir priuado Que Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bien sabe que albarfanez trahe todo Recabdo
anchez Marco, GUC NLP for historical language varieties
April 3 2013
25 / 28
Ca/ca/CS delo de/de/SPS00 delo el/el/DA0NS0 que/que/PR0CN000 mas/m´ as/RG amaua/amar/VMII3S0 yal/yal/AQ0CS0 viene/venir/VMIP3S0 el/el/DA0MS0 mandado/mandado/NCMS000 Dozientos/200/Z caualleros/caballero/NCMP000 mando/mandar/VMIP1S0 exir/salir/VMN0000 priuado/privar/VMP00SM Que/que/PR0CN000 Re¸ ciban/recibir/VMSP3P0 a/a/SPS00 myanaya/myanaya/NCFS000 &/y/CC alas a/a/SPS00 alas el/el/DA0FP0 duenas/duenas/AQ0FP0 fijas/fijo/AQ0FP0 dalgo/dalgo/NCMS000 El/´ el/PP3MS000 sedie/ser/VSIC3S0 en/en/SPS00 valen¸ cia/valencia/NCFS000 curiando/curiando/VMG0000 &/y/CC guardando/guardar/VMG0000 Ca/ca/CS bien/bien/RG sabe/saber/VMIP3S0 que/que/CS albarfanez/albarfanez/NCFS000 trahe/traer/VMIP3S0 todo/todo/DI0MS0 Recabdo/recaudo/NCMS000
anchez Marco, GUC NLP for historical language varieties
April 3 2013
26 / 28
Simple and general method to adapt an existing tool for Modern standard Spanish in order to annotate Old Spanish Advantages: resource-saving, sustainable, relatively easy Benefit from:
similarities between historical and modern standard language varieties existing tools existing philological editions of old manuscripts
The quality of the tagging is close to the state of the art taggers (but it can still be improved!) The greatest improvement is obtained when the lexicon is expanded with non-standard variants Method extensible to other non-standard language varieties → Other NLP applications can be built to deal with historical language varieties: syntactic and semantic analyzers,...
anchez Marco, GUC NLP for historical language varieties
April 3 2013
27 / 28
Thanks also to Gemma Boleda, Llu´ ıs Padr´
anchez Marco, GUC NLP for historical language varieties
April 3 2013
28 / 28