something from nothing
play

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL Integrating linguistic and data-driven methods Use linguistic knowledge to guide


  1. T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjærholt LTG seminar

  2. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL ◮ Integrating linguistic and data-driven methods ◮ Use linguistic knowledge to guide data-driven methods ◮ Leverage data-driven approaches to inform linguistic and rule-driven methods ?

  3. T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front

  4. T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources

  5. T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources ◮ Latin ◮ Long tradition of linguistic inquiry ◮ Quality and quantity of annotated data extremely variable

  6. T HE PROJECT U SING C ZECH TO PARSE L ATIN P LANS ◮ Dependency corpus adaptation ◮ Constrained CRF models ◮ Annotation studies

  7. T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it

  8. T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ???

  9. T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ??? 4. Profit!

  10. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus

  11. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM

  12. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM 4. Train model, parse target

  13. T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation

  14. T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation ◮ Latin Dependency Treebank (LDT) ◮ 53,143 tokens ◮ Annotation scheme based on PDT

  15. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled

  16. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled Prose 40,884 Poetry 12,259 Prose/poetry distribution

  17. T HE PROJECT U SING C ZECH TO PARSE L ATIN W ORKFLOW PDT LDT reformat reformat CoNLL CoNLL tagset map tagset map Common tagset(s) Common tagset(s) delexicalise delexicalise Delexicalised Delexicalised filter train train Parser LM Parse Latin

  18. T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t

  19. T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t ◮ Deprel mappings: ◮ Reflexive tantum ◮ Reflexive passive ◮ Emotional dative

  20. T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold

  21. T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold ◮ LDT: ◮ Distributed as one file/author ◮ Round-robin split into 10 folds ◮ Fold 10 held out for evaluation

  22. T HE PROJECT U SING C ZECH TO PARSE L ATIN L ANGUAGE MODELLING ◮ LM over LDT PoS sequences ◮ Best order: trigrams ◮ Best smoothing: constant discounting ( D = 0 . 1)

  23. T HE PROJECT U SING C ZECH TO PARSE L ATIN PDT PERPLEXITY 10000 8000 Frequency 6000 4000 2000 0 0 10 20 30 40 50 Perplexity

  24. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set

  25. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set ◮ Numbers forthcoming.. .

  26. T HE PROJECT U SING C ZECH TO PARSE L ATIN F UTURE WORK ◮ Further analysis of Latin baseline ◮ Per author/genre performance ◮ Why is MaltParser so bad? ◮ Feature engineering ◮ Learning curve: performance vs. perplexity cutoff

  27. T HE PROJECT U SING C ZECH TO PARSE L ATIN F URTHER FORWARD ◮ Extend workflow to Talbanken/Norwegian Dependency Treebank ◮ Evaluate impact of preprocessing data for annotation ◮ Annotation speed? ◮ Annotator agreement? ◮ Annotator error?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend