A quantitative comparison of sung and spoken lyrics
NUS Sung and Spoken Lyrics Corpus
Zhiyan Duan Haotian Fang Bo Li Khe Chai Sim Ye Wang
1
NUS Sung and Zhiyan Duan Haotian Fang Bo Li Spoken Lyrics Corpus - - PowerPoint PPT Presentation
A quantitative comparison of sung and spoken lyrics NUS Sung and Zhiyan Duan Haotian Fang Bo Li Spoken Lyrics Corpus Khe Chai Sim Ye Wang 1 Outline Motivation Dataset Description Duration Analysis
A quantitative comparison of sung and spoken lyrics
Zhiyan Duan Haotian Fang Bo Li Khe Chai Sim Ye Wang
1
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Understanding the characteristics of singing voice ❖ Benefiting a wide range of research problems ❖ Lack of a comprehensive dataset with phoneme level annotation
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Diversity: in gender, accent, tempo etc. ❖ Size: number of songs, subjects
Image by Digitalnative
❖ Phonetic richness: to get the most out of selected songs ❖ Phonetic balance: to minimize bias ❖ Tempo balance: to cover songs with different tempo ❖ Popularity: easier to recruit subjects ❖ Ease of learning: easier for subjects to learn
❖ Songs: 20 ❖ Est. Phoneme Count: 140 ~ 980 per song ❖ Tempo: 68 ~ 150 bpm
❖ 6 males, 6 females ❖ All levels of vocal experiences ❖ Amateur to 10+ years of vocal training ❖ All common voice types ❖ Soprano, alto, tenor, baritone and bass
1.5 3 4.5 6 North American Mild Malay Malay Mild SingaporeanSingaporean North Chinese Singing Speech
Number of subjects with different accents
❖ Sound-proof recording studio ❖ 44.1 kHz, 16-bit ❖ Pro Tools 9 ❖ Metronome with downbeat accent (through earphone) ❖ Lyrics printouts on music stand
❖ Phoneme set: CMU Dictionary * ❖ Annotators: with musical & phonetic backgrounds ❖ Software: Audacity
* http://www.speech.cs.cmu.edu/cgi- bin/cmudict
❖ Annotated sung tracks: 48 tracks ❖ Subjects: 12 (6 male, 6 female), 4 tracks per subject ❖ Total Length: 169 mins ❖ Phoneme Count: 25,474 ❖ Spoken data: alignment of labels from sung data
* http://www.speech.cs.cmu.edu/cgi- bin/cmudict
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Focus on consonants ❖ Stretching in time and subject variations ❖ Proportion in syllable and position effects ❖ Compare among different types of consonants
Class CMU Phonemes Vowels AA, AE, AH, AO, AW, AY, EH, ER, EY, IH, IY, OW, OY, UH, UW Semivowels W, Y Stops B, D, G, K, P, T Affricates CH, JH Fricatives DH, F, S, SH, TH, V, Z, ZH Aspirates HH Liquids L, R Nasals M, N, NG
❖ Intuitively, vowels can be stretched arbitrarily. ❖ Consonants are supposed to be less so
Time (s)
1 2 3 4 Vowel Consonant Speech Singing
Average Stretching Ratio
0.575 1.15 1.725 2.3 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Male Female Overall Stretching Ratio = Singing Duration / Speech Duration Average stretching ratio comparison of different types of consonants
Comparison on probability density function of consonants duration stretching ratio with respect to gender.
Gender Accent Musical Exposure Subject 05 Female Malay 2 years of choral experience Subject 08 Male Northern Chinese no vocal training
Comparison on consonants duration stretching ratio of subject 05 and 08
Consonant Proportion in Syllable (%)
8.5 17 25.5 34 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Male Female Overall Phoneme proportion in syllable comparison of different types of consonants
❖ Syllabic proportions of consonants are higher in males ❖ Absolute length of both consonants and syllables are
Type Description Example Starting At the beginning of a word /g/ in go Preceding Preceding a vowel, but not at the beginning of a word /m/ in small Succeeding Succeeding a vowel, but not at the end
/l/ in angel Ending At the end of a word /t/ in at
Consonant Proportion in Syllable (%)
10 20 30 40 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Start Preceding Succeeding Ending The effect of positioning on consonant proportion in syllable
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Likelihood score comparison of sung and spoken phonemes ❖ Discrepancies between the effects of duration & pitch on MFCC
features
❖ Using a GMM-HMM system trained on WSJ0 corpus ❖ Perform alignment on both speech and singing data ❖ Phonemes boundaries are fixed for sung tracks
GMM-HMM System Spoken Phoneme Sung Phoneme Score Score
Average Likelihood Difference
22.5 45 67.5 90 Vowels Semivowels Stops Affricates Fricatives Aspirates Liquids Nasals Male Female Overall
Average likelihood difference = |Average likelihood score (sung) - Average likelihood score(spoken)| Average likelihood difference comparison of different types of phonemes
❖ Discretize phoneme duration/pitch into 10 bins ❖ Ensure bins have balanced cumulative density masses ❖ Cluster using decision tree ❖ Lower reduction rate indicates larger impact on low
Model Reduction Rate
15 30 45 60 Duration Pitch Sung Spoken
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Created the NUS-48E dataset of sung and spoken lyrics ❖ Conducted comparative study of sung and spoken
❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work
❖ Continue to annotate the remaining tracks (currently 80
❖ Annotate the spoken data ❖ Repeat some previous work related to singing voice
❖ Further exploration based on current observations