blackfoot corpus
play

Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language - PowerPoint PPT Presentation

Treebanking a Blackfoot Corpus Joel Dunham UBC Overview Blackfoot language Online Linguistic Database (OLD) Blackfoot OLD (BOLD) BOLD Annotation/treebanking Blackfoot language Algonquian (Plains): Alberta & Montana


  1. Treebanking a Blackfoot Corpus Joel Dunham UBC

  2. Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking

  3. Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana

  4. Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative

  5. Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG- CONJ • „Why don‟t you eat with her?‟

  6. OLD • Online Linguistic Database • www.onlinelinguisticdatabase.org • Web application for documenting and analyzing languages

  7. OLD • Open source (GPL): Python (Pylons), MySQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes

  8. Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http://blaold.webfactional.com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.

  9. BLAOLD

  10. BLAOLD • Forms (morphemes & sentences): 21,788 (2011-07-25) • morphemes: 5,094 • sentences: 3,193 • unclassified: 13,501 • (word tokens: 20,577)

  11. BLAOLD • Sources: • textual: 16,209 forms • field work: 5,569 forms (and growing...)

  12. BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E.g., Creation Story: • http://blaold.webfactional.com/creati onstory

  13. BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • ...

  14. BLAOLD • Files: • Associate Forms, Collections & Files • 2,159 files (2011-07-25) • 1,744 audio • 259 image • 148 text • 4 video

  15. Morpheme segmentation Form with and morpheme gloss lines. morphemic analysis Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev -asp-vta drt-num nan Associated WAV file (tagged as an object drt-num agra-nan adt-asp- vai-oth- num” language utterance) Associated JPG (used as a stimulus in elicitation)

  16. BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?

  17. Morphological Parser • „A morphological parser for Blackfoot‟ (Dunham, 2010; WAIL) • input = transcription: • kimaaksawohpokooyimasi • output = <segmentation, morph glosses, POSes>: • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • agra-adt-oth-adt-vai-fin-thm-agrb-agrb

  18. Morphological Parser kimaaksawohpokooyimasi FST Accuracy: ca. 70% Challenges: Phonology (from a grammar) hand-coded into - variations in transcription Phonology FST - no hard and fast spelling rules - researchers differ in the Morphotactics & lexicon extent to which they use the Morphotactics extracted programmatically standard phonemic from the BLAOLD orthography to capture (lexicon) phonetic detail POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb

  19. Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i.e., during data entry): save researcher time • create more data to improve searching

  20. Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr].*n[ai][nr].*/

  21. Morphological Parser /n[ai][nr].*n[ai][nr].*/ Good Bad

  22. Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: „S < (NP $. (VP < NP))‟ S NP VP DT NP VBD NP

  23. Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr.s) has only 5 words

  24. Treebank S S S VP VP NP NP DEM VBZ DEM NN CC VBZ drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi „He is building that house and he is still building it.‟

  25. Treebank • Worth it to treebank Blackfoot? Cons Pros might significantly lots of researcher improve search :. hours & money research efficiency time might be better automated parsing spent elsewhere, may be relatively e.g., elicitation easy

  26. Nitsííkoohtaahsi‟taki

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend