Typology & IGT Robin Westphal, 13.07.16 Institute for - - PowerPoint PPT Presentation

▶

Jul 07, 2023 249 likes •652 views

HS: Computational Linguistics for Low- Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text Papers 3/36 Papers Developing

SLIDE 1

SLIDE 2

Typology & IGT

The Online Database of Interlinear Text

HS: Computational Linguistics for Low-‐Resource Languages Robin Westphal, 13.07.16 Institute for Computational Linguistics, University Heidelberg

SLIDE 3

Papers

3/36

SLIDE 4

Papers

Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages (2006) Automatically Identifying Computationally Relevant Typological Features (2008) by William D. Lewis & Fei Xia

4/36

SLIDE 5

Overview

5/36

SLIDE 6

Overview

ODIN: 1. What? 2. Why? 3. How? 4. Practical Use?

6/36

SLIDE 7

1. What is ODIN?

7/36

SLIDE 8

What is ODIN?

“ODIN is a database of interlinear text ‘snippets’, harvested mostly from scholarly documents posted to the web” Developed by:

GOLD Community of Practice (Farrar and Lewis, 2006)
Electronic Metastructure for Endangered Languages Data efforts (EMELD)

8/36

SLIDE 9

2. Why develop ODIN?

9/36

SLIDE 10

Why develop ODIN? - Problem

The web contains a vast amount of maintained data. BUT:

Spread everywhere
No uni-form search strategy
Cannot be easily manipulated or used

10/36

SLIDE 11

Why develop ODIN? – Solution

A database like ODIN provides:

Summary of most IGT instances on the web
Easy-to-use search-engine
A normative presentation for easier access

11/36

SLIDE 12

What is IGT?

12/36

SLIDE 13

Reminder: What is IGT?

“Interlinear Glossed Text”

(Baylin, 2001)

13/36

Source Gloss Translation

SLIDE 14

IGT – Challenging benefits

Challenges

Unclear structural associoations between elements
Descriptions of grammatical concepts are inconsistent

Benefits:

Consistent format for mining & enrichment

14/36

SLIDE 15

3. How to get all the data?

15/36

SLIDE 16

How to get data?

1.) Find documents that could contain IGT. 2.) Detect & extract IGT via resembling patterns. 3.) Store in ODIN database.

16/36

SLIDE 17

3.1. Crawler

17/36

SLIDE 18

Crawler

Query Type (Top100) Avg no.docs Avg no. docs w/IGT

Gram(s) 1,184 239 Language name(s) 1,314 259 Both grams and names 1,536 289 Language words 1,159 193 # of findings at the time of writing the article: 150.000 / 1,5 Million (10%)

18/36

SLIDE 19

Crawler - Method 1

Regex approach: \t*(\()\d*\).*\n first line begins with a number in parentheses \t*.*\n second line can be anything \t*\’.*\n third line begins with a quote check first line with surrounding language codes

19/36

SLIDE 20

Reminder: What is IGT?

“Interlinear Glossed Text”

(Baylin, 2001)

20/36

Source Gloss Translation

SLIDE 21

Crawler - Method 1 - Problems

rigid formality
clusters of IGT with multiple languages are incorrectly identified
.PDF screws formats

21/36

SLIDE 22

Crawler - Method 2

Machine Learning:

Tag each line based on a feature list
convert the best tag sequence into a span sequence “B [ I | BL ]* E”

B = Begin I = Inside BL = Blank E = End O = Outside

22/36

SLIDE 23

Crawler - Method 2 - Features

Feature1 words of current line Feature2 collection of 16 IGT features (quotes, numbering, tokens) Feature3 tags for previous lines Feature4 tags for neighboring lines

23/36

SLIDE 24

Crawler - Results

Precision Recall F-score Regex 74,95% 52,19% 61,54% F2 57,02% 48,64% 52,50% F2+F4 75,50% 76,04% 75,77% F1+F2+F3+F4 82,29% 81,02% 81,65%

24/36

SLIDE 25

3.2. Converting raw data

25/36

SLIDE 26

Language ID

Problems for classifiers:

way too many languages to discern from
not enough training data for “rarer” languages
clusters of IGT with multiple languages

26/36

SLIDE 27

Language ID - Features

Feature1 nearest language code Feature2 neighboring language codes Feature3 n-grams in current IGT Feature4 n-grams in all IGT

83,08% accuracy for 7,816 language codes and 47,728 (code,name) pairs

27/36

SLIDE 28

The final product

28/36

SLIDE 29

29/36

German

SLIDE 30

30/36

German

SLIDE 31

5. How is ODIN used?

31/36

SLIDE 32

Usage

Searching via
Language name / code
Language family
Concept / Gram
Data enrichment
for English
for source language

32/36

SLIDE 33

5.1 Typology research

33/36

SLIDE 34

Typology research – IGT enrichment

Typology = study of classificating languages, by organising them in an enumerated list of possible types and identifying them via structural features Based on: ODIN data

>enriched source languages

34/36

SLIDE 35

Typology research – IGT enrichment

parse the English translation using an English parser
align the target sentence and the English translation using the gloss line
project the phrase structures onto the target sentence

Possible flaws: IGT / english bias (unnatural examples based on another language)

35/36

SLIDE 36

Typology research - Features

36/36

SLIDE 37

Typology research – Results & Error analysis

37/36

SLIDE 38

Typology research - Results - Error analysis

Insufficient data
Skewed or inaccurate data
Projection error
Free constituent order

38/36

SLIDE 39