Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjærholt LTG seminar

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL ◮ Integrating linguistic and data-driven methods ◮ Use linguistic knowledge to guide data-driven methods ◮ Leverage data-driven approaches to inform linguistic and rule-driven methods ?

T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front

T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources

T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources ◮ Latin ◮ Long tradition of linguistic inquiry ◮ Quality and quantity of annotated data extremely variable

T HE PROJECT U SING C ZECH TO PARSE L ATIN P LANS ◮ Dependency corpus adaptation ◮ Constrained CRF models ◮ Annotation studies

T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it

T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ???

T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ??? 4. Profit!

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM 4. Train model, parse target

T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation

T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation ◮ Latin Dependency Treebank (LDT) ◮ 53,143 tokens ◮ Annotation scheme based on PDT

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled Prose 40,884 Poetry 12,259 Prose/poetry distribution

T HE PROJECT U SING C ZECH TO PARSE L ATIN W ORKFLOW PDT LDT reformat reformat CoNLL CoNLL tagset map tagset map Common tagset(s) Common tagset(s) delexicalise delexicalise Delexicalised Delexicalised filter train train Parser LM Parse Latin

T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t

T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t ◮ Deprel mappings: ◮ Reflexive tantum ◮ Reflexive passive ◮ Emotional dative

T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold

T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold ◮ LDT: ◮ Distributed as one file/author ◮ Round-robin split into 10 folds ◮ Fold 10 held out for evaluation

T HE PROJECT U SING C ZECH TO PARSE L ATIN L ANGUAGE MODELLING ◮ LM over LDT PoS sequences ◮ Best order: trigrams ◮ Best smoothing: constant discounting ( D = 0 . 1)

T HE PROJECT U SING C ZECH TO PARSE L ATIN PDT PERPLEXITY 10000 8000 Frequency 6000 4000 2000 0 0 10 20 30 40 50 Perplexity

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set ◮ Numbers forthcoming.. .

T HE PROJECT U SING C ZECH TO PARSE L ATIN F UTURE WORK ◮ Further analysis of Latin baseline ◮ Per author/genre performance ◮ Why is MaltParser so bad? ◮ Feature engineering ◮ Learning curve: performance vs. perplexity cutoff

T HE PROJECT U SING C ZECH TO PARSE L ATIN F URTHER FORWARD ◮ Extend workflow to Talbanken/Norwegian Dependency Treebank ◮ Evaluate impact of preprocessing data for annotation ◮ Annotation speed? ◮ Annotator agreement? ◮ Annotator error?

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL Integrating linguistic and data-driven methods Use linguistic knowledge to guide

Financial Disclosure Statement Something Old, Something New, Something Unbreakable, and Something

Nothing And I am convinced that nothing can ever separate us from Gods love. Romans 8:38

VoIP Security Title : Something Old (H.323), Something New (IAX), Something Hallow ( Security ),

1 To check something out (pv): to see, watch, examine, try. Something/someone is not ones cup of

You You aint You You aint aint see nothing yet aint see nothing

Walking in Love through the Darkness Theres nothing you can do that cant be done. Nothing

Control Announcements Print and None (Demo) None Indicates that Nothing is Returned 4 None

There is nothing wrong with having friends! There is nothing wrong with having friends.

A Heart For Your heart is filled with the joy of Christmas, so Im scheduling you for

Pitch and Loudness By: Chase Lenhart How High or Low Something Is How Loud or Soft Something

CSCE 790 Computer Systems Security Biometrics (Something You Are) Professor Qiang Zeng

Something Ancient and Something Recent Raymond W. Yeung Institute of Network Coding, CUHK

And now for something completely different And now for something completely different Algorithms

A cross product is 1) Something I know about. 2) Something Ive heard about but I cant

Womans Market Women without her man , is nothing.. Women, without her, MAN is nothing. If

I have nothing to I have nothing to disclose disclose UC UC SF SF University of California

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum III and Jason Eisner EMNLP

The Quest for the One T rue Parser Terence Parr The ANTLR guy University of San Francisco

The Parse Machine Chris Healy Department of Computer Science Furman University Rationale

Bottom up parsing Construct a parse tree for an input string beginning at leaves and going

Alpha Presentation Stack Life 2.0: Library Search and Display Tool The Capstone Experience Team

Identifiability and Unmixing of Latent Parse Trees Daniel Hsu, Sham Kakade, Percy Liang NIPS

Understanding Database Usage in PHP Systems: Current and Future Work Mark Hills (@hillsma on

A GENERAL FORMULA FOR OPTION PRICES IN A STOCHASTIC VOLATILITY MODEL Stephen Chin and Daniel