Typology & IGT Robin Westphal, 13.07.16 Institute for - - PowerPoint PPT Presentation

typology igt
SMART_READER_LITE
LIVE PREVIEW

Typology & IGT Robin Westphal, 13.07.16 Institute for - - PowerPoint PPT Presentation

HS: Computational Linguistics for Low- Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text Papers 3/36 Papers Developing


slide-1
SLIDE 1
slide-2
SLIDE 2

Typology & IGT

The Online Database of Interlinear Text

HS: Computational Linguistics for Low-‐Resource Languages Robin Westphal, 13.07.16 Institute for Computational Linguistics, University Heidelberg

slide-3
SLIDE 3

Papers

3/36

slide-4
SLIDE 4

Papers

Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages (2006) Automatically Identifying Computationally Relevant Typological Features (2008) by William D. Lewis & Fei Xia

4/36

slide-5
SLIDE 5

Overview

5/36

slide-6
SLIDE 6

Overview

ODIN: 1. What? 2. Why? 3. How? 4. Practical Use?

6/36

slide-7
SLIDE 7
  • 1. What is ODIN?

7/36

slide-8
SLIDE 8

What is ODIN?

“ODIN is a database of interlinear text ‘snippets’, harvested mostly from scholarly documents posted to the web” Developed by:

  • GOLD Community of Practice (Farrar and Lewis, 2006)
  • Electronic Metastructure for Endangered Languages Data efforts (EMELD)

8/36

slide-9
SLIDE 9
  • 2. Why develop ODIN?

9/36

slide-10
SLIDE 10

Why develop ODIN? - Problem

The web contains a vast amount of maintained data. BUT:

  • Spread everywhere
  • No uni-form search strategy
  • Cannot be easily manipulated or used

10/36

slide-11
SLIDE 11

Why develop ODIN? – Solution

A database like ODIN provides:

  • Summary of most IGT instances on the web
  • Easy-to-use search-engine
  • A normative presentation for easier access

11/36

slide-12
SLIDE 12

What is IGT?

12/36

slide-13
SLIDE 13

Reminder: What is IGT?

  • “Interlinear Glossed Text”

(Baylin, 2001)

13/36

Source Gloss Translation

slide-14
SLIDE 14

IGT – Challenging benefits

Challenges

  • Unclear structural associoations between elements
  • Descriptions of grammatical concepts are inconsistent

Benefits:

  • Consistent format for mining & enrichment

14/36

slide-15
SLIDE 15
  • 3. How to get all the data?

15/36

slide-16
SLIDE 16

How to get data?

1.) Find documents that could contain IGT. 2.) Detect & extract IGT via resembling patterns. 3.) Store in ODIN database.

16/36

slide-17
SLIDE 17

3.1. Crawler

17/36

slide-18
SLIDE 18

Crawler

Query Type (Top100) Avg no.docs Avg no. docs w/IGT

Gram(s) 1,184 239 Language name(s) 1,314 259 Both grams and names 1,536 289 Language words 1,159 193 # of findings at the time of writing the article: 150.000 / 1,5 Million (10%)

18/36

slide-19
SLIDE 19

Crawler - Method 1

Regex approach: \t*(\()\d*\).*\n first line begins with a number in parentheses \t*.*\n second line can be anything \t*\’.*\n third line begins with a quote check first line with surrounding language codes

19/36

slide-20
SLIDE 20

Reminder: What is IGT?

  • “Interlinear Glossed Text”

(Baylin, 2001)

20/36

Source Gloss Translation

slide-21
SLIDE 21

Crawler - Method 1 - Problems

  • rigid formality
  • clusters of IGT with multiple languages are incorrectly identified
  • .PDF screws formats

21/36

slide-22
SLIDE 22

Crawler - Method 2

Machine Learning:

  • Tag each line based on a feature list
  • convert the best tag sequence into a span sequence “B [ I | BL ]* E”

B = Begin I = Inside BL = Blank E = End O = Outside

22/36

slide-23
SLIDE 23

Crawler - Method 2 - Features

Feature1 words of current line Feature2 collection of 16 IGT features (quotes, numbering, tokens) Feature3 tags for previous lines Feature4 tags for neighboring lines

23/36

slide-24
SLIDE 24

Crawler - Results

Precision Recall F-score Regex 74,95% 52,19% 61,54% F2 57,02% 48,64% 52,50% F2+F4 75,50% 76,04% 75,77% F1+F2+F3+F4 82,29% 81,02% 81,65%

24/36

slide-25
SLIDE 25

3.2. Converting raw data

25/36

slide-26
SLIDE 26

Language ID

Problems for classifiers:

  • way too many languages to discern from
  • not enough training data for “rarer” languages
  • clusters of IGT with multiple languages

26/36

slide-27
SLIDE 27

Language ID - Features

Feature1 nearest language code Feature2 neighboring language codes Feature3 n-grams in current IGT Feature4 n-grams in all IGT

83,08% accuracy for 7,816 language codes and 47,728 (code,name) pairs

27/36

slide-28
SLIDE 28

The final product

28/36

slide-29
SLIDE 29

29/36

German

slide-30
SLIDE 30

30/36

German

slide-31
SLIDE 31
  • 5. How is ODIN used?

31/36

slide-32
SLIDE 32

Usage

  • Searching via
  • Language name / code
  • Language family
  • Concept / Gram
  • Data enrichment
  • for English
  • for source language

32/36

slide-33
SLIDE 33

5.1 Typology research

33/36

slide-34
SLIDE 34

Typology research – IGT enrichment

Typology = study of classificating languages, by organising them in an enumerated list of possible types and identifying them via structural features Based on: ODIN data

  • >enriched source languages

34/36

slide-35
SLIDE 35

Typology research – IGT enrichment

  • parse the English translation using an English parser
  • align the target sentence and the English translation using the gloss line
  • project the phrase structures onto the target sentence

Possible flaws: IGT / english bias (unnatural examples based on another language)

35/36

slide-36
SLIDE 36

Typology research - Features

36/36

slide-37
SLIDE 37

Typology research – Results & Error analysis

37/36

slide-38
SLIDE 38

Typology research - Results - Error analysis

  • Insufficient data
  • Skewed or inaccurate data
  • Projection error
  • Free constituent order

38/36

slide-39
SLIDE 39