A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky - - PowerPoint PPT Presentation

a corpus for large scale phonetic typology
SMART_READER_LITE
LIVE PREVIEW

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky - - PowerPoint PPT Presentation

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner a voice crying out in Alan W Black 1 the wilderness


slide-1
SLIDE 1

A Corpus For Large-Scale Phonetic Typology

1

Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner Ryan Cotterell

Alan W Black

Jason Eisner

VoxClamantis in deserto: “a voice crying out in 
 the wilderness”

slide-2
SLIDE 2

2

In the beginning, there was SPEECH

‘in the beginning’
 English

‘ipeuhcan’
 Nahuatl ‘am Anfang’
 German ‘በመጀመሪያ’
 Amharic

Tower of Babel

slide-3
SLIDE 3

3

‘in the beginning’
 English

‘ipeuhcan’
 Nahuatl ‘am Anfang’
 German ‘በመጀመሪያ’
 Amharic

Then the linguist asked:

How do speech and language vary?

↳ prior cross-linguistic phonetic studies have relied on reported [language- aggregate] measurements

We create our new corpus, VoxClamantis v1.0, 
 to answer this question!

✔ spoken readings of the Bible ✔ >600 languages ✔ time-aligned phonemic transcriptions ✔ phonetic measures for vowel and sibilant tokens

In the beginning, there was SPEECH

slide-4
SLIDE 4

① WHY we want this data ② HOW we create it ③ CASE STUDIES validating the corpus & illustrating two possible uses

4

This talk

slide-5
SLIDE 5

5

Why?

slide-6
SLIDE 6

s s

6

variation

/i/ /u/ /o/ /a/ /e/

⑤
 Spanish ⑦
 Romanian

/i/ /u/ /o/ /a/ /e/ /ɨ/ /ə/

s s s s ss s s s s s s s s s s s s s s s s s s s s s s s s s s Variation in and across languages

Motivation

We know phonetic variation within a language,
 but what are its range and limits? How does the number and set of phonemic 
 categories influence their realizations?

slide-7
SLIDE 7

7

How?

slide-8
SLIDE 8

8

በመጀመሪያ

Resources Needed bəmədʒ məri ja ə

Grapheme-to-Phoneme (G2P)

Amharic ? ? ? ?

① speech ② transcripts ③ phonemic labels

slide-9
SLIDE 9

9

Amharic

በመጀመሪያ

Resources Needed b ə m ə dʒ m ə r i j a ə

? ? ? ?

Forced alignment

(HMM acoustic model)

① speech ② transcripts ③ phonemic labels ④ time alignments ⑤ phonetic measures

Phonetic measures (R or Praat):

Formant frequencies, mid-frequency peak, duration…

slide-10
SLIDE 10

with ① speech! and ② transcripts! >1TB 😲 >6 years of CPU compute 😲

10

‘በመጀመሪያ’
 Amharic

Extraction Process

CMU Wilderness (2019)

6 9 9 B i b l e r e a d i n g s !

① speech ② transcripts

slide-11
SLIDE 11

11

Extraction Process

① speech ② transcripts

በመጀመሪያ

Utterance: Chapter: <30s ~30min

CMU Wilderness dataset

1 የፍጥረት አጀማመር በመጀመሪያ እግዚአብሔር (ኤሎሂም) ሰማያትንና ምድርን ፈጠረ። 2 ምድርም ቅርጽ የለሽና ባዶ ነበረች።※ የምድርን ጥልቅ ስፍራ ሁሉ ጨለማ ውጦት ነበር። የእግዚአብሔርም (ኤሎሂም) መንፈስ በውሆች ላይ ይረብብ ነበር። 3 ከዚያም እግዚአብሔር (ኤሎሂም) “ብርሃን ይሁን” አለ፤ ብርሃንም ሆነ። 4 እግዚአብሔርም (ኤሎሂም) ብርሃኑ መልካም እንደሆነ አየ፤ ብርሃኑን ከጨለማ ለየ። 5 እግዚአብሔርም (ኤሎሂም) ብርሃኑን “ቀን”፣ ጨለማውን “ሌሊት” ብሎ ጠራው። መሸ፤ ነጋም፤ የመጀመሪያ ቀን። 6 እግዚአብሔር (ኤሎሂም)፣ “ውሃን ከውሃ የሚለይ ጠፈር በውሆች መካከል ይሁን” አለ። 7 ስለዚህ እግዚአብሔር (ኤሎሂም) ጠፈርን አድርጎ ከጠፈሩ በላይና ከጠፈሩ በታች ያለውን ውሃ ለየ፤

እንዳለውም ሆነ። 8 እግዚአብሔር (ኤሎሂም) ጠፈርን “ሰማይ” ብሎ ጠራው። መሸ፤ ነጋም፤ ሁለተኛ ቀን። 9 ከዚያም እግዚአብሔር (ኤሎሂም)፣ “ከሰማይ በታች ያለው ውሃ በአንድ. 


😲

slide-12
SLIDE 12

read
 /ɛ/

/ɹɛt/ /ɹɛd/ phonemes text text G2P

read
 /i/

Which phonemes are present?

12

Extraction Process

① speech ② transcripts ③ phonemic labels

slide-13
SLIDE 13

13

① Linguist-created rules (Epitran)

Phoneme “Transcriptions”—- Grapheme-to-Phoneme

690

.

690

.

690

.

64

1 6 5

39 readings 18 readings All 690 readings

② Wisdom of Crowds (Wiktionary/WikiPron)


+ our own WFST-models (Phonetisaurus 🦖 )

③ Naïve baseline (Unitran)

😲 “first-pass transcription”

Extraction Process

① speech ② transcripts ③ phonemic labels

(disjoint)

slide-14
SLIDE 14

57 readings
 “High-resource (HR)” ALL 690 readings
 “First-pass (FP)” 690 readings

.

“first-pass” .

39 18

We’ll come back to that 😊

🤕 why provide FP alignments for languages with HR ?

14

G2P Summary

slide-15
SLIDE 15

15

bəmədʒ məri ja ə

Amharic

① speech ② transcripts ③ phonemic labels

Extraction Process

? ? ? ?

Forced alignment

(HMM acoustic model)

slide-16
SLIDE 16

16

Extraction Process

Amharic

b ə m ə dʒ m ə r i j a ə

① speech ② transcripts ③ phonemic labels ④ time alignments

start time

b

end time

? ? ? ?

Forced alignment

(HMM acoustic model)

slide-17
SLIDE 17

17

Extraction Process

Amharic

b ə m ə dʒ m ə r i j a ə

Forced alignment

(HMM acoustic model)

① speech ② transcripts ③ phonemic labels ④ time alignments

? ? ? ?

start time

b

end time

slide-18
SLIDE 18

18

Extraction Process

Amharic

b ə m …

① speech ② transcripts ③ phonemic labels ④ time alignments

Phoneme tokens:

start time

b

end time

slide-19
SLIDE 19

a a

  • F1

F3

z z s

Formants

Spectral peak, 
 COG, Duration, ...

VOWELS SIBILANTS

PRAAT TEXTGRID

19

Extraction Process

① speech ② transcripts ③ phonemic labels ④ time alignments ⑤ phonetic measures

F2 F4

eg high-amplitude 
 frequencies

Phonetic Measures

slide-20
SLIDE 20
  • How much does quality vary across languages?
  • Are certain phonemes more accurate than others?
  • What about time alignment accuracy?

🤕 Why provide both Unitran and High-Resource alignments?

20

Evaluation See paper! (+ appendices)

Use multiple sets of alignments to assess Unitran alignment quality

slide-21
SLIDE 21
  • 690 recorded readings of the Bible
  • 635 languages (ISO 639-3)
  • 70 language families

21

Corpus Summary VoxClamantis v1.0 provides tokens of phoneme- level measurements in hundreds of languages!

  • >400 million aligned phoneme-level segments
  • Subsequent phonetic measures for all vowels and sibilants
slide-22
SLIDE 22

22

Case Studies

slide-23
SLIDE 23

Vowels


~50 phonemes

Sibilants


/s/ /z/

48 High-Resource Readings

23

① R e p r

  • d

u c t i

  • n
  • f

p r e v i

  • u

s r e s u l t s v a l i d a t e s r e s

  • u

r c e

Case studies with VoxClamantis v1.0

Case Studies

② R e s e a r c h a t s c a l e s u g g e s t s g e n e r a l c r

  • s

s

  • l

i n g u i s t i c p r i n c i p l e s

slide-24
SLIDE 24

Reproduce previous results, 
 but with many more languages

24

Formants: Vowels Mid-Freq Peak: Sibilants Are shared characteristics realized uniformly within languages?

Phonetic Uniformity

Supports hypothesis that this may be a 
 universal principle

(eg: vowel height, POA) (eg: measures strongly correlated)

(eg: language) /i/, /u/: high vowels

/s/, /z/: alveolar
 place of articulation

While variation exists across languages, 
 within language F1 strongly correlated

slide-25
SLIDE 25

20 vowels

Marshallese  English 

25

Is inventory size correlated with articulatory precision?

4 vowels

ɜ: i: ə u u: ɚ a:

  • ɑ ɑ:

ɪ ɒ ɔ ɔ: ᵿ e æ ɛ i

Phonetic Dispersion

e æ ɛ i

VOWELS

slide-26
SLIDE 26

20 vowels

Marshallese  English 

26

Is inventory size correlated with articulatory precision?

4 vowels Phonetic Dispersion

e æ ɛ i ɜ: i: ə u u: ɚ a:

  • ɑ ɑ:

ɪ ɒ ɔ ɔ: ᵿ e æ ɛ i

slide-27
SLIDE 27

ɜ: i: ə u u: ɚ a:

  • ɑ ɑ:

ɪ ɒ ɔ ɔ: ᵿ e æ ɛ i

20 vowels

Marshallese  English 

27

Is inventory size correlated with articulatory precision?

4 vowels Phonetic Dispersion

Previously shown, 
 but not possible to study at scale

Supports hypothesis that this may [not] be a 
 universal principle

e æ ɛ i

No

(Spearman ρ = 0.11, p = 0.44; 
 Pearson r = 0.11, p = 0.46)

slide-28
SLIDE 28

Utterance alignment C A U T I O N

Automatic phoneme labels

Alignment assessment! Corpus representation


(e.g. speakers)

Filter -- in future, realign! Better G(+A)2P Curate more resources! Curate more resources!

28

B + A

  • D

+ A % 


😲

B

slide-29
SLIDE 29

29

Summary

slide-30
SLIDE 30

aligned phoneme-level segments in hundreds of languages
 57 high-resource, 690 first-pass

😲 methodology is not perfect – version 1.0! ⬇ download 🥴 use for research ⬆ contribute to v2.0!

30

VoxClamantis v1.0 corpus: Conclusion

voxclamantisproject.github.io

slide-31
SLIDE 31

Q u e s t i

  • n

s ! C

  • m

m e n t s ! C

  • n

t r i b u t i

  • n

s !

voxclamantisproject.github.io

31

Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner Ryan Cotterell

Alan W Black

Jason Eisner

Contact Us!

voxclamantisproject@gmail.com

VoxClamantis in deserto: “a voice crying out in 
 the wilderness”