Typology & IGT Robin Westphal, 13.07.16 Institute for - - PowerPoint PPT Presentation
Typology & IGT Robin Westphal, 13.07.16 Institute for - - PowerPoint PPT Presentation
HS: Computational Linguistics for Low- Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text Papers 3/36 Papers Developing
Typology & IGT
The Online Database of Interlinear Text
HS: Computational Linguistics for Low-‐Resource Languages Robin Westphal, 13.07.16 Institute for Computational Linguistics, University Heidelberg
Papers
3/36
Papers
Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages (2006) Automatically Identifying Computationally Relevant Typological Features (2008) by William D. Lewis & Fei Xia
4/36
Overview
5/36
Overview
ODIN: 1. What? 2. Why? 3. How? 4. Practical Use?
6/36
- 1. What is ODIN?
7/36
What is ODIN?
“ODIN is a database of interlinear text ‘snippets’, harvested mostly from scholarly documents posted to the web” Developed by:
- GOLD Community of Practice (Farrar and Lewis, 2006)
- Electronic Metastructure for Endangered Languages Data efforts (EMELD)
8/36
- 2. Why develop ODIN?
9/36
Why develop ODIN? - Problem
The web contains a vast amount of maintained data. BUT:
- Spread everywhere
- No uni-form search strategy
- Cannot be easily manipulated or used
10/36
Why develop ODIN? – Solution
A database like ODIN provides:
- Summary of most IGT instances on the web
- Easy-to-use search-engine
- A normative presentation for easier access
11/36
What is IGT?
12/36
Reminder: What is IGT?
- “Interlinear Glossed Text”
(Baylin, 2001)
13/36
Source Gloss Translation
IGT – Challenging benefits
Challenges
- Unclear structural associoations between elements
- Descriptions of grammatical concepts are inconsistent
Benefits:
- Consistent format for mining & enrichment
14/36
- 3. How to get all the data?
15/36
How to get data?
1.) Find documents that could contain IGT. 2.) Detect & extract IGT via resembling patterns. 3.) Store in ODIN database.
16/36
3.1. Crawler
17/36
Crawler
Query Type (Top100) Avg no.docs Avg no. docs w/IGT
Gram(s) 1,184 239 Language name(s) 1,314 259 Both grams and names 1,536 289 Language words 1,159 193 # of findings at the time of writing the article: 150.000 / 1,5 Million (10%)
18/36
Crawler - Method 1
Regex approach: \t*(\()\d*\).*\n first line begins with a number in parentheses \t*.*\n second line can be anything \t*\’.*\n third line begins with a quote check first line with surrounding language codes
19/36
Reminder: What is IGT?
- “Interlinear Glossed Text”
(Baylin, 2001)
20/36
Source Gloss Translation
Crawler - Method 1 - Problems
- rigid formality
- clusters of IGT with multiple languages are incorrectly identified
- .PDF screws formats
21/36
Crawler - Method 2
Machine Learning:
- Tag each line based on a feature list
- convert the best tag sequence into a span sequence “B [ I | BL ]* E”
B = Begin I = Inside BL = Blank E = End O = Outside
22/36
Crawler - Method 2 - Features
Feature1 words of current line Feature2 collection of 16 IGT features (quotes, numbering, tokens) Feature3 tags for previous lines Feature4 tags for neighboring lines
23/36
Crawler - Results
Precision Recall F-score Regex 74,95% 52,19% 61,54% F2 57,02% 48,64% 52,50% F2+F4 75,50% 76,04% 75,77% F1+F2+F3+F4 82,29% 81,02% 81,65%
24/36
3.2. Converting raw data
25/36
Language ID
Problems for classifiers:
- way too many languages to discern from
- not enough training data for “rarer” languages
- clusters of IGT with multiple languages
26/36
Language ID - Features
Feature1 nearest language code Feature2 neighboring language codes Feature3 n-grams in current IGT Feature4 n-grams in all IGT
83,08% accuracy for 7,816 language codes and 47,728 (code,name) pairs
27/36
The final product
28/36
29/36
German
30/36
German
- 5. How is ODIN used?
31/36
Usage
- Searching via
- Language name / code
- Language family
- Concept / Gram
- Data enrichment
- for English
- for source language
32/36
5.1 Typology research
33/36
Typology research – IGT enrichment
Typology = study of classificating languages, by organising them in an enumerated list of possible types and identifying them via structural features Based on: ODIN data
- >enriched source languages
34/36
Typology research – IGT enrichment
- parse the English translation using an English parser
- align the target sentence and the English translation using the gloss line
- project the phrase structures onto the target sentence
Possible flaws: IGT / english bias (unnatural examples based on another language)
35/36
Typology research - Features
36/36
Typology research – Results & Error analysis
37/36
Typology research - Results - Error analysis
- Insufficient data
- Skewed or inaccurate data
- Projection error
- Free constituent order
38/36