a corpus for large scale phonetic typology
play

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky - PowerPoint PPT Presentation

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner a voice crying out in Alan W Black 1 the wilderness


  1. A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner “a voice crying out in 
 Alan W Black � 1 the wilderness”

  2. ‘ipeuhcan’ 
 ‘am Anfang’ 
 ‘ በመጀመሪያ ’ 
 ‘in the beginning’ 
 Nahuatl German Amharic English In the beginning, there was SPEECH Tower of Babel � 2

  3. ‘ipeuhcan’ 
 ‘am Anfang’ 
 ‘ በመጀመሪያ ’ 
 ‘in the beginning’ 
 Nahuatl German Amharic English In the beginning, there was SPEECH Then the linguist asked: We create our new corpus, VoxClamantis v1.0, 
 
 to answer this question! How do speech and language vary? ✔ spoken readings of the Bible ✔ >600 languages ↳ prior cross-linguistic phonetic studies have relied on reported [language- ✔ time-aligned phonemic transcriptions aggregate] measurements ✔ phonetic measures for vowel and sibilant tokens � 3

  4. This talk ① WHY we want this data ② HOW we create it ③ CASE STUDIES validating the corpus & illustrating two possible uses � 4

  5. Why? � 5

  6. ⑤ 
 ⑦ 
 Motivation Variation in and across languages s s s s s Spanish Romanian s s s s s s /i/ /i/ s s s s /u/ s /u/ s s s s /o/ /o/ s s s s /e/ s /e/ s s s s /a/ /a/ s s s s s / ɨ / / ə / We know phonetic variation within a language, 
 How does the number and set of phonemic 
 but what are its range and limits? categories influence their realizations? � 6 variation

  7. How? � 7

  8. Resources ① speech Needed ② transcripts ③ phonemic labels ? ? ? ? Amharic b ə m ə d ʒ m ə ri ja ə Grapheme-to-Phoneme (G2P) በመጀመሪያ � 8

  9. Resources ① speech Needed ② transcripts ③ phonemic labels ④ time alignments ⑤ phonetic measures ? ? ? ? Amharic b ə m ə d ʒ ə m ə r i j a Forced alignment በመጀመሪያ (HMM acoustic model) Phonetic measures (R or Praat): � 9 Formant frequencies, mid-frequency peak, duration…

  10. Extraction ① speech Process ② transcripts e l b i B 9 9 6 ! s CMU Wilderness g n i d a e r (2019) with ① speech! ‘ በመጀመሪያ ’ 
 and ② transcripts! Amharic >1TB 😲 >6 years of CPU compute 😲 � 10

  11. Extraction ① speech Process ② transcripts CMU Wilderness dataset Chapter: ~30min 1 የፍጥረት አጀማመር በመጀመሪያ እግዚአብሔር ( ኤሎሂም ) ሰማያትንና ምድርን ፈጠረ። 2 ምድርም ቅርጽ የለሽና ባዶ ነበረች። ※ የምድርን ጥልቅ ስፍራ ሁሉ ጨለማ ውጦት ነበር። የእግዚአብሔርም ( ኤሎሂም ) መንፈስ በውሆች ላይ ይረብብ ነበር። 3 ከዚያም እግዚአብሔር ( ኤሎሂም ) “ ብርሃን ይሁን ” አለ፤ ብርሃንም ሆነ። 4 እግዚአብሔርም ( ኤሎሂም ) ብርሃኑ መልካም እንደሆነ አየ፤ ብርሃኑን ከጨለማ ለየ። 5 እግዚአብሔርም ( ኤሎሂም ) ብርሃኑን “ ቀን ” ፣ ጨለማውን “ ሌሊት ” ብሎ ጠራው። መሸ፤ ነጋም፤ የመጀመሪያ ቀን። 6 እግዚአብሔር ( ኤሎሂም ) ፣ “ ውሃን ከውሃ የሚለይ ጠፈር በውሆች መካከል ይሁን ” አለ። 7 ስለዚህ እግዚአብሔር ( ኤሎሂም ) ጠፈርን አድርጎ ከጠፈሩ በላይና ከጠፈሩ በታች ያለውን ውሃ ለየ፤ እንዳለውም ሆነ። 8 እግዚአብሔር ( ኤሎሂም ) ጠፈርን “ ሰማይ ” ብሎ ጠራው። መሸ፤ ነጋም፤ ሁለተኛ ቀን። 9 ከዚያም እግዚአብሔር ( ኤሎሂም ) ፣ “ ከሰማይ በታች ያለው ውሃ በአንድ . 
 … Utterance: < 30s 😲 በመጀመሪያ � 11

  12. Extraction ① speech Process ② transcripts text ③ phonemic labels Which phonemes are present? / ɹɛ t / / ɹɛ d / phonemes read 
 read 
 G2P / ɛ / / i / text � 12

  13. Extraction ① speech Process ② transcripts ③ phonemic labels Phoneme “Transcriptions”—- Grapheme-to-Phoneme 39 readings ① Linguist-created rules (Epitran) 690 64 . (disjoint) 18 readings ② Wisdom of Crowds (Wiktionary/WikiPron) 
 690 1 6 5 + our own WFST-models (Phonetisaurus 🦖 ) . All 690 readings ③ Naïve baseline (Unitran) 690 😲 “first-pass transcription” . � 13

  14. G2P Summary 57 readings 
 “High-resource (HR)” 39 690 readings . “first-pass” . 18 ALL 690 readings 
 “First-pass (FP)” 🤕 why provide FP alignments for languages with HR ? We’ll come back to that 😊 � 14

  15. Extraction ① speech Process ② transcripts ③ phonemic labels ? ? ? ? Amharic b ə m ə d ʒ m ə ri ja ə Forced alignment (HMM acoustic model) � 15

  16. Extraction ① speech Process ② transcripts ③ phonemic labels ④ time alignments ? ? ? ? Amharic b ə m ə d ʒ ə m ə r i j a Forced alignment b (HMM acoustic model) start end time time � 16

  17. Extraction ① speech Process ② transcripts ③ phonemic labels ④ time alignments ? ? ? ? Amharic b ə m ə d ʒ ə m ə r i j a Forced alignment b (HMM acoustic model) start end time time � 17

  18. Extraction ① speech Process ② transcripts ③ phonemic labels ④ time alignments Amharic Phoneme tokens: b ə b start end m time time … � 18

  19. Extraction ① speech Process Phonetic Measures ② transcripts ③ phonemic labels ④ time alignments ⑤ phonetic measures VOWELS SIBILANTS a a o s z z F4 F3 F2 F1 Spectral peak, 
 eg high-amplitude 
 Formants COG, Duration, ... frequencies PRAAT TEXTGRID � 19

  20. Evaluation 🤕 Why provide both Unitran and High-Resource alignments? Use multiple sets of alignments to assess Unitran alignment quality ‣ How much does quality vary across languages? ‣ Are certain phonemes more accurate than others? ‣ What about time alignment accuracy? See paper! (+ appendices) � 20

  21. Corpus Summary VoxClamantis v1.0 provides tokens of phoneme- level measurements in hundreds of languages! ‣ 690 recorded readings of the Bible ‣ 635 languages (ISO 639-3) ‣ 70 language families ‣ >400 million aligned phoneme-level segments ‣ Subsequent phonetic measures for all vowels and sibilants � 21

  22. Case Studies � 22

  23. Case Studies Case studies with VoxClamantis v1.0 Vowels 
 Sibilants 
 ~50 phonemes /s/ /z/ 48 High-Resource Readings l e c a s a t R e h p r o d u c t r c i o n o f e a s R e ① s - o s c r l ② r a p r e v n e i o u s r e s e u l t s s g s t e g g s u e s p l c i n v a l i d a p r i t e s r e s o c u r c e t i i s g u i n l � 23

  24. Phonetic Uniformity Are shared characteristics realized uniformly within languages? (eg: vowel height, POA) (eg: measures strongly correlated) Formants : Vowels Mid-Freq Peak : Sibilants /s/, /z/: alveolar 
 /i/, /u/: high vowels place of articulation (eg: language) Supports hypothesis While variation exists across languages, 
 that this may be a 
 within language F1 strongly correlated universal principle Reproduce previous results, 
 but with many more languages � 24

  25. Phonetic Dispersion Is inventory size correlated with articulatory precision? VOWELS 4 vowels 20 vowels i i: u u: i ɪ ᵿ e o ə e ɚ ɛ ɜ : ɔ ɔ : ɛ ɒ æ æ ɑ ɑ : a: Marshallese  English  � 25

  26. Phonetic Dispersion Is inventory size correlated with articulatory precision? 4 vowels 20 vowels i i: u u: i ɪ ᵿ e o ə e ɚ ɛ ɜ : ɔ ɔ : ɛ ɒ æ æ ɑ ɑ : a: Marshallese  English  � 26

  27. Phonetic Dispersion Is inventory size correlated with articulatory precision? No (Spearman ρ = 0.11, p = 0.44; 
 4 vowels Pearson r = 0.11, p = 0.46) 20 vowels i i: u u: i ɪ ᵿ e o ə e ɚ ɛ ɜ : ɔ ɔ : ɛ ɒ æ æ ɑ ɑ : a: Marshallese  English  Supports hypothesis that this may [not] be a 
 Previously shown, 
 universal principle but not possible to study at scale � 27

  28. N O I T U A C + Utterance alignment B Filter -- in future, realign! + D - Automatic phoneme labels A Better G(+A)2P 
 % 0 A Alignment assessment! Curate more resources! 😲 Corpus representation 
 Curate more resources! B (e.g. speakers) � 28

  29. Summary � 29

  30. Conclusion VoxClamantis v1.0 corpus: voxclamantisproject.github.io aligned phoneme-level segments in hundreds of languages 
 57 high-resource, 690 first-pass 😲 methodology is not perfect – version 1.0! ⬇ download 🥴 use for research ⬆ contribute to v2.0! � 30

  31. Contact Us! ! s n o i t s e u ! s Q t n e m ! m s n o o C i t u b voxclamantisproject.github.io i r t n o C voxclamantisproject@gmail.com Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner “a voice crying out in 
 Alan W Black � 31 the wilderness”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend