Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - - PowerPoint PPT Presentation

something from nothing
SMART_READER_LITE
LIVE PREVIEW

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL Integrating linguistic and data-driven methods Use linguistic knowledge to guide


slide-1
SLIDE 1

THE PROJECT USING CZECH TO PARSE LATIN

Something from nothing

Arne Skjærholt LTG seminar

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

THE PROJECT USING CZECH TO PARSE LATIN

THE GOAL

◮ Integrating linguistic and data-driven methods ◮ Use linguistic knowledge to guide data-driven methods ◮ Leverage data-driven approaches to inform linguistic and

rule-driven methods?

slide-8
SLIDE 8

THE PROJECT USING CZECH TO PARSE LATIN

WHAT TO DO?

◮ Focus on syntax ◮ Focus on languages with little resources up-front

slide-9
SLIDE 9

THE PROJECT USING CZECH TO PARSE LATIN

WHAT TO DO?

◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian

◮ Decent resources at word-level ◮ No syntactic resources

slide-10
SLIDE 10

THE PROJECT USING CZECH TO PARSE LATIN

WHAT TO DO?

◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian

◮ Decent resources at word-level ◮ No syntactic resources

◮ Latin

◮ Long tradition of linguistic inquiry ◮ Quality and quantity of annotated data extremely variable

slide-11
SLIDE 11

THE PROJECT USING CZECH TO PARSE LATIN

PLANS

◮ Dependency corpus adaptation ◮ Constrained CRF models ◮ Annotation studies

slide-12
SLIDE 12

THE PROJECT USING CZECH TO PARSE LATIN

CURRENT PROJECT

  • 1. Take a large corpus
  • 2. Remove 90% of the information in it
slide-13
SLIDE 13

THE PROJECT USING CZECH TO PARSE LATIN

CURRENT PROJECT

  • 1. Take a large corpus
  • 2. Remove 90% of the information in it
  • 3. ???
slide-14
SLIDE 14

THE PROJECT USING CZECH TO PARSE LATIN

CURRENT PROJECT

  • 1. Take a large corpus
  • 2. Remove 90% of the information in it
  • 3. ???
  • 4. Profit!
slide-15
SLIDE 15

THE PROJECT USING CZECH TO PARSE LATIN

THE GENERAL IDEA

  • 1. Delexicalise source language corpus
slide-16
SLIDE 16

THE PROJECT USING CZECH TO PARSE LATIN

THE GENERAL IDEA

  • 1. Delexicalise source language corpus
  • 2. Train language model over target language PoS sequences
  • 3. Filter source corpus with LM
slide-17
SLIDE 17

THE PROJECT USING CZECH TO PARSE LATIN

THE GENERAL IDEA

  • 1. Delexicalise source language corpus
  • 2. Train language model over target language PoS sequences
  • 3. Filter source corpus with LM
  • 4. Train model, parse target
slide-18
SLIDE 18

THE PROJECT USING CZECH TO PARSE LATIN

CORPORA

◮ Prague Dependency Treebank (PDT)

◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation

slide-19
SLIDE 19

THE PROJECT USING CZECH TO PARSE LATIN

CORPORA

◮ Prague Dependency Treebank (PDT)

◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation

◮ Latin Dependency Treebank (LDT)

◮ 53,143 tokens ◮ Annotation scheme based on PDT

slide-20
SLIDE 20

THE PROJECT USING CZECH TO PARSE LATIN

PARSING LATIN

◮ Previous baseline: MSTParser, 65% unlabelled, 53%

labelled accuracy (Bamman & Crane 2008)

◮ New baseline: MSTParser, 64% unlabelled, 54% labelled

slide-21
SLIDE 21

THE PROJECT USING CZECH TO PARSE LATIN

PARSING LATIN

◮ Previous baseline: MSTParser, 65% unlabelled, 53%

labelled accuracy (Bamman & Crane 2008)

◮ New baseline: MSTParser, 64% unlabelled, 54% labelled

Prose 40,884 Poetry 12,259 Prose/poetry distribution

slide-22
SLIDE 22

THE PROJECT USING CZECH TO PARSE LATIN

WORKFLOW

PDT LDT CoNLL CoNLL Common tagset(s) Common tagset(s) Delexicalised Delexicalised LM Parser Parse Latin

reformat reformat tagset map tagset map delexicalise delexicalise train train filter

slide-23
SLIDE 23

THE PROJECT USING CZECH TO PARSE LATIN

TAGSETS

◮ LDT annotation guidelines derived from PDT ◮ PoS mappings:

◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t

slide-24
SLIDE 24

THE PROJECT USING CZECH TO PARSE LATIN

TAGSETS

◮ LDT annotation guidelines derived from PDT ◮ PoS mappings:

◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t

◮ Deprel mappings:

◮ Reflexive tantum ◮ Reflexive passive ◮ Emotional dative

slide-25
SLIDE 25

THE PROJECT USING CZECH TO PARSE LATIN

DATA SPLITS

◮ PDT:

◮ 8 training folds ◮ development fold ◮ evaluation fold

slide-26
SLIDE 26

THE PROJECT USING CZECH TO PARSE LATIN

DATA SPLITS

◮ PDT:

◮ 8 training folds ◮ development fold ◮ evaluation fold

◮ LDT:

◮ Distributed as one file/author ◮ Round-robin split into 10 folds ◮ Fold 10 held out for evaluation

slide-27
SLIDE 27

THE PROJECT USING CZECH TO PARSE LATIN

LANGUAGE MODELLING

◮ LM over LDT PoS sequences ◮ Best order: trigrams ◮ Best smoothing: constant discounting (D = 0.1)

slide-28
SLIDE 28

THE PROJECT USING CZECH TO PARSE LATIN

PDT PERPLEXITY

2000 4000 6000 8000 10000 10 20 30 40 50 Frequency Perplexity

slide-29
SLIDE 29

THE PROJECT USING CZECH TO PARSE LATIN

PARSER OPTIMISATION

◮ Do parameter tuning on the Czech development set

slide-30
SLIDE 30

THE PROJECT USING CZECH TO PARSE LATIN

PARSER OPTIMISATION

◮ Do parameter tuning on the Czech development set ◮ Numbers forthcoming.. .

slide-31
SLIDE 31

THE PROJECT USING CZECH TO PARSE LATIN

FUTURE WORK

◮ Further analysis of Latin baseline

◮ Per author/genre performance ◮ Why is MaltParser so bad?

◮ Feature engineering ◮ Learning curve: performance vs. perplexity cutoff

slide-32
SLIDE 32

THE PROJECT USING CZECH TO PARSE LATIN

FURTHER FORWARD

◮ Extend workflow to Talbanken/Norwegian Dependency

Treebank

◮ Evaluate impact of preprocessing data for annotation

◮ Annotation speed? ◮ Annotator agreement? ◮ Annotator error?