NUS Sung and Zhiyan Duan Haotian Fang Bo Li Spoken Lyrics Corpus - - PowerPoint PPT Presentation

nus sung and
SMART_READER_LITE
LIVE PREVIEW

NUS Sung and Zhiyan Duan Haotian Fang Bo Li Spoken Lyrics Corpus - - PowerPoint PPT Presentation

A quantitative comparison of sung and spoken lyrics NUS Sung and Zhiyan Duan Haotian Fang Bo Li Spoken Lyrics Corpus Khe Chai Sim Ye Wang 1 Outline Motivation Dataset Description Duration Analysis


slide-1
SLIDE 1

A quantitative comparison of sung and spoken lyrics

NUS Sung and Spoken Lyrics Corpus

Zhiyan Duan Haotian Fang Bo Li Khe Chai Sim Ye Wang

1

slide-2
SLIDE 2

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-3
SLIDE 3

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-4
SLIDE 4

Motivation

❖ Understanding the characteristics of singing voice ❖ Benefiting a wide range of research problems ❖ Lack of a comprehensive dataset with phoneme level annotation

slide-5
SLIDE 5

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-6
SLIDE 6

Dataset

❖ Diversity: in gender, accent, tempo etc. ❖ Size: number of songs, subjects

  • ❖ Balance the two

Image by Digitalnative

slide-7
SLIDE 7

Songs Selection

❖ Phonetic richness: to get the most out of selected songs ❖ Phonetic balance: to minimize bias ❖ Tempo balance: to cover songs with different tempo ❖ Popularity: easier to recruit subjects ❖ Ease of learning: easier for subjects to learn

slide-8
SLIDE 8

Songs Selection

❖ Songs: 20 ❖ Est. Phoneme Count: 140 ~ 980 per song ❖ Tempo: 68 ~ 150 bpm

slide-9
SLIDE 9

Subjects

❖ 6 males, 6 females ❖ All levels of vocal experiences ❖ Amateur to 10+ years of vocal training ❖ All common voice types ❖ Soprano, alto, tenor, baritone and bass

slide-10
SLIDE 10

Subjects - Accents

1.5 3 4.5 6 North American Mild Malay Malay Mild SingaporeanSingaporean North Chinese Singing Speech

Number of subjects with different accents

slide-11
SLIDE 11

Recording

❖ Sound-proof recording studio ❖ 44.1 kHz, 16-bit ❖ Pro Tools 9 ❖ Metronome with downbeat accent (through earphone) ❖ Lyrics printouts on music stand

slide-12
SLIDE 12

Annotation

❖ Phoneme set: CMU Dictionary * ❖ Annotators: with musical & phonetic backgrounds ❖ Software: Audacity

* http://www.speech.cs.cmu.edu/cgi- bin/cmudict

slide-13
SLIDE 13

Annotation

slide-14
SLIDE 14

Annotation

❖ Annotated sung tracks: 48 tracks ❖ Subjects: 12 (6 male, 6 female), 4 tracks per subject ❖ Total Length: 169 mins ❖ Phoneme Count: 25,474 ❖ Spoken data: alignment of labels from sung data

* http://www.speech.cs.cmu.edu/cgi- bin/cmudict

slide-15
SLIDE 15

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-16
SLIDE 16

Duration Analysis

❖ Focus on consonants ❖ Stretching in time and subject variations ❖ Proportion in syllable and position effects ❖ Compare among different types of consonants

slide-17
SLIDE 17

Phoneme Classes

Class CMU Phonemes Vowels AA, AE, AH, AO, AW, AY, EH, ER, EY, IH, IY, OW, OY, UH, UW Semivowels W, Y Stops B, D, G, K, P, T Affricates CH, JH Fricatives DH, F, S, SH, TH, V, Z, ZH Aspirates HH Liquids L, R Nasals M, N, NG

slide-18
SLIDE 18

Consonants Stretching

❖ Intuitively, vowels can be stretched arbitrarily. ❖ Consonants are supposed to be less so

?

slide-19
SLIDE 19

Consonants Stretching

Time (s)

1 2 3 4 Vowel Consonant Speech Singing

slide-20
SLIDE 20

Consonants Stretching

Average Stretching Ratio

0.575 1.15 1.725 2.3 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Male Female Overall Stretching Ratio = Singing Duration / Speech Duration Average stretching ratio comparison of different types of consonants

slide-21
SLIDE 21

Consonants Stretching - Subject Variations

Comparison on probability density function of consonants duration stretching ratio with respect to gender.

slide-22
SLIDE 22

Consonant Stretching - Subject Variations

Gender Accent Musical Exposure Subject 05 Female Malay 2 years of choral experience Subject 08 Male Northern Chinese no vocal training

slide-23
SLIDE 23

Consonants Stretching - Subject Variations

Comparison on consonants duration stretching ratio of subject 05 and 08

slide-24
SLIDE 24

Consonant Proportion

Consonant Proportion in Syllable (%)

8.5 17 25.5 34 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Male Female Overall Phoneme proportion in syllable comparison of different types of consonants

slide-25
SLIDE 25

Consonant Proportion

❖ Syllabic proportions of consonants are higher in males ❖ Absolute length of both consonants and syllables are

higher in male

slide-26
SLIDE 26

Consonant Proportion - Position Effect

Type Description Example Starting At the beginning of a word /g/ in go Preceding Preceding a vowel, but not at the beginning of a word /m/ in small Succeeding Succeeding a vowel, but not at the end

  • f a word

/l/ in angel Ending At the end of a word /t/ in at

slide-27
SLIDE 27

Consonant Proportion - Position Effect

Consonant Proportion in Syllable (%)

10 20 30 40 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Start Preceding Succeeding Ending The effect of positioning on consonant proportion in syllable

slide-28
SLIDE 28

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-29
SLIDE 29

Spectral Analysis

❖ Likelihood score comparison of sung and spoken phonemes ❖ Discrepancies between the effects of duration & pitch on MFCC

features

slide-30
SLIDE 30

Likelihood Score Comparison

❖ Using a GMM-HMM system trained on WSJ0 corpus ❖ Perform alignment on both speech and singing data ❖ Phonemes boundaries are fixed for sung tracks

slide-31
SLIDE 31

Likelihood Score Comparison

GMM-HMM System Spoken Phoneme Sung Phoneme Score Score

slide-32
SLIDE 32

Likelihood Score Comparison

Average Likelihood Difference

22.5 45 67.5 90 Vowels Semivowels Stops Affricates Fricatives Aspirates Liquids Nasals Male Female Overall

Average likelihood difference = |Average likelihood score (sung) - Average likelihood score(spoken)| Average likelihood difference comparison of different types of phonemes

slide-33
SLIDE 33

Effects of Duration & Pitch on Acoustic Features

❖ Discretize phoneme duration/pitch into 10 bins ❖ Ensure bins have balanced cumulative density masses ❖ Cluster using decision tree ❖ Lower reduction rate indicates larger impact on low

level acoustic features (i.e. MFCC)

slide-34
SLIDE 34

Effects of Duration & Pitch on Acoustic Features

Model Reduction Rate

15 30 45 60 Duration Pitch Sung Spoken

slide-35
SLIDE 35

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-36
SLIDE 36

Conclusion

❖ Created the NUS-48E dataset of sung and spoken lyrics ❖ Conducted comparative study of sung and spoken

phonemes in both time and frequency domain

slide-37
SLIDE 37

Outline

❖ Motivation ❖ Dataset Description ❖ Duration Analysis ❖ Spectral Analysis ❖ Conclusion ❖ Future Work

slide-38
SLIDE 38

Future Work

❖ Continue to annotate the remaining tracks (currently 80

  • ut of 420 are annotated)

❖ Annotate the spoken data ❖ Repeat some previous work related to singing voice

using the new dataset

❖ Further exploration based on current observations

slide-39
SLIDE 39

Thank you!

slide-40
SLIDE 40

Question & Answer