Automatic Prosody Labeling Final Presentation Andrew Rosenberg - - PowerPoint PPT Presentation
Automatic Prosody Labeling Final Presentation Andrew Rosenberg - - PowerPoint PPT Presentation
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05 Overview Project Goal ToBI standard for prosodic labeling Previous Work Method Results
Overview
- Project Goal
- ToBI standard for prosodic labeling
- Previous Work
- Method
- Results
- Conclusion
Project Goal:
- Automatic assignment of tones tier elements
– Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. – Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones
ToBI Annotation
- Tones and Break Index (ToBI) labeling scheme
consists of a speech waveform and 4 tiers:
– Tones
- Annotation of pitch accents and phrasal tones
– Orthographic
- Transcription of text
– Break Index
- Pauses between words, rated on a scale from 0-4.
– Miscellaneous
- Notes about the annotation (e.g., ambiguities, non-speech
noise)
ToBI Transcription Example
ToBI Examples
- Pitch Accents (made3.wav):
– H*, L*, L+H*
- Boundary Tones (money.wav):
– L-H%, H-H%, L-L%, H-L%, (H-, L-)
Previous Work
- Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996
– BU Radio News Corpus (~48 minutes)
- Public news broadcasts spoken by 7 speakers
– Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification – Employs no acoustic features.
- Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi-
Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005
– BU Radio News Corpus – Detects stressed syllables (collapsed ToBI labels) and all boundaries. – Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model
- Wightman: “Automatic Labeling of Prosodic Patterns” 1994
– Single speaker subset of BNC and ambiguous sentence corpus (read speech). – Like Ross, uses decision tree output as input to HMM – Uses many acoustic features
Method
- JRip
– Classification rule learner – Better at working with nominal attributes – Easier to read output
- Corpus
– Boston Direction Corpus
- 4 speakers
- ~65 minutes of semi-spontaneous speech
- Original Plan:
– HMMs and SVMs
- SVMs took a prohibitive amount of time to learn and performed worse.
- HMM implementation problems, and not enough time to implement my own
Method - Features
- Min, max, mean, std.dev. F0 and Intensity
- # Syllables, Duration, approx. vowel length,
POS
- F0 slope (weighted)
- zscore of max F0 and intensity
- Phrase-length F0, intensity and vowel
length features
- Phrase position
Results - Tasks
- Pitch Accent
– Identification – Detection
- Phrase Tone identification
- Boundary Tone identification
- Phrase/Boundary Tone
– Identification – Detection
Results - Pitch Accent Identification
- Accuracy
- Relevant Features
– # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity
58.8% Base 80.2% 78.0% 79.2% Acc. Ross* No Breaks Best
*Ross identifies a different subset of ToBI pitch accents
Results - Pitch Accent Detection
83/ 14% 79.5/ 13.2%
- 80.1/
14% 83.2/ 12.4 T/F
- Wightman
- 82.5%
83.9% 85.7% Acc.
Narayanan
Ross No Breaks Best
Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task
Results - Phrase Tone
- Accuracy
- Relevant Features
– Duration of next word, max, min, mean F0. – Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity 77.4% 86.7% 57.9% 72.4% Acc. Base No Break Base Best
Results - Boundary Tone Identification
- Accuracy
- Relevant Features
– Quadratically weighted F0 slope 84.5% 91.3% 65.1% 73.2% Acc. Base No Break Base Best
Results - Phrase/Boundary Tone Identification
- Accuracy
- Relevant Features
– Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, – zscore of difference in max intensity in the current word and the phrase.
56.3% 66.9% 33.8% 54.7% Acc. Base Ross Base Best
Results – Phrase/Boundary Tone Detection
- Accuracy
- Human agreement (in general): 95%
- Best agreement: 93.0% over 77% baseline
- Relevant Features
– Vowel length (current and next word) – POS of the next word
77/3% 80.9/16.0% 82.5/3.9% T/F Wightman Narayanan Best
Conclusion
- Relatively low-tech acoustic features and ml
algorithms can perform competitively with more complicated NLP approaches
- Break index information was not as helpful as
initially suspected.
- Potential Improvements:
– Sequential Modeling (HMM) – Different features
- More sophisticated pitch contour feature
- Content-based features (similar to Ross)