forewords tagging in a nutshell
play

Forewords Tagging in a nutshell Sources Slides inspired by M. - PDF document

Tagging in a nutshell Tagging in a nutshell Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier, EPFL Vincent Claveau Vocabulary tagging, French: etiquetage IRISA - CNRS tag, Fr.


  1. Tagging in a nutshell Tagging in a nutshell Forewords Tagging in a nutshell Sources ◮ Slides inspired by M. Rajman and J.-C. Chappelier, EPFL Vincent Claveau Vocabulary ◮ tagging, French: ´ etiquetage IRISA - CNRS ◮ tag, Fr. ´ Rennes, France etiquette ◮ Part-Of-Speech (partie-du-discours), morpho-syntactic Master 2 RI/BIG/MTIBH : DSS/ADT categories, grammatical categories Tagging in a nutshell Tagging in a nutshell Generalities The big picture Outline Generalities What do we want to do? ◮ assigns a sequence of symbols (tags) to a sequence of symbols Symbolic approaches (sentence) ◮ usually: one output symbol for each input symbol Stochastic approaches ◮ in this course, only one/two well-known approaches Zoom on TreeTagger In the framework of the track P4 ◮ cf. HMM and other stochastic approaches Conclusion Tagging in a nutshell Tagging in a nutshell Generalities Generalities About the tagging task About lemmatization Goals Reducing word-forms to their lemmas ◮ associate morpho-syntactic information to each word-form ◮ lemma = canonical form of word-forms that only differs by ◮ finite number of tags: tagset inflection ◮ canonical form is language-dependant and arbitrary ◮ collateral advantage: lemmatization Examples (French) Interests ◮ ADJ: m´ ◮ generalization: suppress morphological and lexical variability edicales → m´ edical ◮ NOUN: chiens → chien , but chiennes → chienne ◮ reducing the vocabulary size

  2. Tagging in a nutshell Tagging in a nutshell Generalities Generalities Examples of tagged texts 1/2 Examples of tagged texts 2/2 Vous/PRV:pl faites/VCJ:pl preuve/SBC:sg de/PREP mesure/SBC:sg dans/PREP vos/DTN:pl propos/SBC:pl ,/, ===== D´ et/COO votre/DTN:sg discours/SBC:sg est/ECJ:sg toujours/ADV empreint/ADJ1PAR:sg de/PREP EBUT DE PHRASE ===== 1 3 6 Bien sˆ ur bien sˆ ur ADV 0x0000 Rgp - H 1 oblige 2 3 6 , , PCTFAIB - Ypw - H 1 oblige r´ eserve/SBC:sg ./. Vous/PRV:pl n’/ADV ˆ etes/ECJ:pl certainement/ADV pas/ADV indiff´ erent/SBC:sg ,/, 3 3 6 rien rien A2 PII 0xE080 Pi-.sn 3—3 S 1 oblige mais/COO peu/ADV expansif/SBC:pl ./. Votre/DTN:sg approche/SBC:sg plutˆ ot/ADV formaliste/SBC:sg 4 3 6 n’ ne A2 ADV 0x0200 Rpn 5 V 1 oblige 5 3 6 oblige obliger A5 VINDP3S - Vmip3s 5 V 1 oblige peut/VCJ:sg amener/VNCFF vos/DTN:pl interlocuteurs/SBC:pl ` a/PREP penser/VNCFF que/SUB vous/PRV:pl 6 3 6 un un A3 DETIMS 0xA000 Da-ms-i 7—7 D 1 oblige 7 3 6 site Web site web NCMS 0xA040 Ncms 7—7 D 1 oblige portez/VCJ:pl une/DTN:sg grande/ADJ:sg attention/SBC:sg aux/DTC:pl conventions/SBC:pl ou/COO 8 3 6 a ` a ` PREP 0x0000 Sp 9 F 1 oblige 9 3 6 choisir choisir VINF - Vmn– 9 F 1 oblige aux/DTC:pl usages/SBC:pl ./. Votre/DTN:sg comportement/SBC:sg peut/VCJ:sg ,/, par/PREP contre/PREP ,/, 10 3 6 un un A3 DETIMS 0xA000 Da-ms-i 11 D 1 oblige 11 3 6 nom nom NCMS 0xA040 Ncms 11 D 1 oblige paraˆ ıtre/VNCFF assez/ADV ferm´ e/ADJ2PAR:sg ` a/PREP ceux/PRO:pl qui/REL ont/ACJ:pl coutume/ADJ:sg 12 3 6 en en A3 PREP 0x0000 Sp 13 H 1 oblige 13 3 6 www www NCI 0xF020 Nc.. 13 H 1 oblige de/PREP r´ eagir/VNCFF spontan´ ement/ADV ./. Votre/DTN:sg approche/SBC:sg s´ erieuse/ADJ:sg peut/VCJ:sg 14 3 6 : : PCTFORTE - Yps - - 0 ===== FIN DE PHRASE ===== amener/VNCFF vos/DTN:pl interlocuteurs/SBC:pl ` a/PREP penser/VNCFF que/SUB vous/PRV:pl consid´ erez/VCJ:pl le/DTN:sg temps/SBC:sg comme/SUB un/DTN:sg... Tagging in a nutshell Tagging in a nutshell Generalities Generalities Problems Problems Ambiguities ◮ most words are polyfunctional Unknown word-form ◮ Ex Fr.: r` ◮ named entities egle common noun, verb indicative 1st person, 3rd person, subjunctive... ◮ person names, places, companies... ◮ imports ◮ depends on the tagset ◮ words, phrases or sentences from another language: leasing ... ◮ specialized terms Contextual disambiguation ◮ from specialized domains: parenth´ ◮ use context to choose the most reliable part-of-speech esage, kinesim´ etrie ... ◮ language register ◮ je r` egle la longueur avec la r` egle ◮ hard task, not always possible ◮ je la kiffe ` a donf, un fruit sur ◮ la belle ferme le voile ◮ la petite brise la glace Tagging in a nutshell Tagging in a nutshell Generalities Generalities Formalization of the task Evaluation Sequence to sequence Comparison with ground-truth ◮ foreach word, given its context, find the correct tag ◮ human annotation ◮ correct means there exists a ground truth (given by a human ◮ costly: only a big-enough abstract of the corpus expert), but even humans may disagree in some cases Standard measures Two families of approaches ◮ precision (sometimes recall) ◮ symbolic: Brill’s tagger ◮ possibly evaluation category by category ◮ stochastic: Multext tagger (HMM)

  3. Tagging in a nutshell Tagging in a nutshell Generalities Generalities Evaluation Evaluation Exercise Exercise ◮ from example 1, compute precison of recall of the tagger on ◮ from example 1, compute precison and recall of the tagger on the common noun (SBC) common nouns (SBC) ◮ R = 14/15, P = 14/17 ◮ how could you easily obtain a 100% recall? ◮ how could you easily obtain a 100% recall? ◮ what would the precision then? ◮ what would the precision then? ◮ how could you easily obtain a 100% precision? ◮ how could you easily obtain a 100% precision? ◮ what would the recall then? ◮ what would the recall then? Tagging in a nutshell Tagging in a nutshell Symbolic approaches Symbolic approaches Brill’s tagger Outline Outline Generalities Generalities Symbolic approaches Symbolic approaches Brill’s tagger Transducers Stochastic approaches Stochastic approaches Zoom on TreeTagger Zoom on TreeTagger Conclusion Conclusion Tagging in a nutshell Tagging in a nutshell Symbolic approaches Symbolic approaches Brill’s tagger Brill’s tagger Brill’s tagger Overview Well-known approach ◮ much used during the 90’s ◮ freely available, developed for many languages ◮ conceptually simple Error-driven transformation based tagger ◮ error-driven → supervised learning ◮ transformation based → using induced transformation rules

  4. Tagging in a nutshell Tagging in a nutshell Symbolic approaches Symbolic approaches Brill’s tagger Brill’s tagger Brill’s algorithm Brill’s algorithm Learning transformation rules Input ◮ for each rule, compute a score = # errors before applying the ◮ PoS lexicon: for each word-form, list of all the possible tags rule minus # errors after ◮ choose the best rule, add it to the rule base Initialization ◮ repeat while rules with score > threshold are proposed ◮ for known words (ie. in the lexicon): most frequent tag for this word-form Type of transformation rules ◮ for unknown words ◮ lexical: assign a tag to an unknown word (not in lexicon) ◮ 1992: proper noun for words with a capital, noun for others ◮ 1994: machine learning of “guessing rules” ◮ contextual: change the tag of a given word based on its context Tagging in a nutshell Tagging in a nutshell Symbolic approaches Symbolic approaches Brill’s tagger Transducers Brill’s algorithm Outline Examples of rules Generalities ◮ lexical: if condition then word ← tag ◮ suffix(word) = x or xy or xyz Symbolic approaches ◮ prefix(word) = x or xy or xyz Brill’s tagger ◮ word contains character x Transducers ◮ suppressing prefix/suffix gives a known word ◮ word is preceded by w’ (fixed for a given rule) ◮ contextual: if condition then tag ← tag Stochastic approaches ◮ (1st/2nd/3rd) tag before/after word is X ◮ tag bigram before/after word is YZ Zoom on TreeTagger ◮ preceding or next word before/after is W’ ◮ word is W and preceding or next word is W’ ◮ word is W and preceding or next tag is Z Conclusion Tagging in a nutshell Tagging in a nutshell Symbolic approaches Stochastic approaches Transducers Inferring transducers Outline Generalities Symbolic approaches To be done not in the course for this year Stochastic approaches Zoom on TreeTagger Conclusion

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend