Automatic Prosody Labeling Final Presentation Andrew Rosenberg - - PowerPoint PPT Presentation

▶

Jun 10, 2023 262 likes •443 views

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05 Overview Project Goal ToBI standard for prosodic labeling Previous Work Method Results

SLIDE 1

Automatic Prosody Labeling

Final Presentation

Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05

SLIDE 2

Overview

Project Goal
ToBI standard for prosodic labeling
Previous Work
Method
Results
Conclusion

SLIDE 3

Project Goal:

Automatic assignment of tones tier elements

– Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. – Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones

SLIDE 4

ToBI Annotation

Tones and Break Index (ToBI) labeling scheme

consists of a speech waveform and 4 tiers:

– Tones

Annotation of pitch accents and phrasal tones

– Orthographic

Transcription of text

– Break Index

Pauses between words, rated on a scale from 0-4.

– Miscellaneous

Notes about the annotation (e.g., ambiguities, non-speech

noise)

SLIDE 5

ToBI Transcription Example

SLIDE 6

ToBI Examples

Pitch Accents (made3.wav):

– H, L, L+H*

Boundary Tones (money.wav):

– L-H%, H-H%, L-L%, H-L%, (H-, L-)

SLIDE 7

Previous Work

Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996

– BU Radio News Corpus (~48 minutes)

Public news broadcasts spoken by 7 speakers

– Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification – Employs no acoustic features.

Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi-

Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005

– BU Radio News Corpus – Detects stressed syllables (collapsed ToBI labels) and all boundaries. – Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model

Wightman: “Automatic Labeling of Prosodic Patterns” 1994

– Single speaker subset of BNC and ambiguous sentence corpus (read speech). – Like Ross, uses decision tree output as input to HMM – Uses many acoustic features

SLIDE 8

Method

JRip

– Classification rule learner – Better at working with nominal attributes – Easier to read output

Corpus

– Boston Direction Corpus

4 speakers
~65 minutes of semi-spontaneous speech
Original Plan:

– HMMs and SVMs

SVMs took a prohibitive amount of time to learn and performed worse.
HMM implementation problems, and not enough time to implement my own

SLIDE 9

Method - Features

Min, max, mean, std.dev. F0 and Intensity
# Syllables, Duration, approx. vowel length,

POS

F0 slope (weighted)
zscore of max F0 and intensity
Phrase-length F0, intensity and vowel

length features

Phrase position

SLIDE 10

Results - Tasks

Pitch Accent

– Identification – Detection

Phrase Tone identification
Boundary Tone identification
Phrase/Boundary Tone

– Identification – Detection

SLIDE 11

Results - Pitch Accent Identification

Accuracy
Relevant Features

– # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity

58.8% Base 80.2% 78.0% 79.2% Acc. Ross* No Breaks Best

*Ross identifies a different subset of ToBI pitch accents

SLIDE 12

Results - Pitch Accent Detection

83/ 14% 79.5/ 13.2%

80.1/

14% 83.2/ 12.4 T/F

Wightman
82.5%

83.9% 85.7% Acc.

Narayanan

Ross No Breaks Best

Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task

SLIDE 13

Results - Phrase Tone

Accuracy
Relevant Features

– Duration of next word, max, min, mean F0. – Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity 77.4% 86.7% 57.9% 72.4% Acc. Base No Break Base Best

SLIDE 14

Results - Boundary Tone Identification

Accuracy
Relevant Features

– Quadratically weighted F0 slope 84.5% 91.3% 65.1% 73.2% Acc. Base No Break Base Best

SLIDE 15

Results - Phrase/Boundary Tone Identification

Accuracy
Relevant Features

– Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, – zscore of difference in max intensity in the current word and the phrase.

56.3% 66.9% 33.8% 54.7% Acc. Base Ross Base Best

SLIDE 16

Results – Phrase/Boundary Tone Detection

Accuracy
Human agreement (in general): 95%
Best agreement: 93.0% over 77% baseline
Relevant Features

– Vowel length (current and next word) – POS of the next word

77/3% 80.9/16.0% 82.5/3.9% T/F Wightman Narayanan Best

SLIDE 17

Conclusion

Relatively low-tech acoustic features and ml

algorithms can perform competitively with more complicated NLP approaches

Break index information was not as helpful as

initially suspected.

Potential Improvements:

– Sequential Modeling (HMM) – Different features

More sophisticated pitch contour feature
Content-based features (similar to Ross)

Automatic Prosody Labeling

Final Presentation

Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05

Overview

Project Goal:

– Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. – Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones

ToBI Annotation

consists of a speech waveform and 4 tiers:

– Tones

– Orthographic

– Break Index

– Miscellaneous

ToBI Transcription Example

ToBI Examples

– H*, L*, L+H*

– L-H%, H-H%, L-L%, H-L%, (H-, L-)

Previous Work

Method

Method - Features

POS

length features

Results - Tasks

– Identification – Detection

– Identification – Detection

Results - Pitch Accent Identification

– # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity

58.8% Base 80.2% 78.0% 79.2% Acc. Ross* No Breaks Best

*Ross identifies a different subset of ToBI pitch accents

Results - Pitch Accent Detection

83/ 14% 79.5/ 13.2%

14% 83.2/ 12.4 T/F

83.9% 85.7% Acc.

Ross No Breaks Best

Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task

Results - Phrase Tone

– Duration of next word, max, min, mean F0. – Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity 77.4% 86.7% 57.9% 72.4% Acc. Base No Break Base Best

Results - Boundary Tone Identification

– Quadratically weighted F0 slope 84.5% 91.3% 65.1% 73.2% Acc. Base No Break Base Best

Results - Phrase/Boundary Tone Identification

– Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, – zscore of difference in max intensity in the current word and the phrase.

56.3% 66.9% 33.8% 54.7% Acc. Base Ross Base Best

Results – Phrase/Boundary Tone Detection

– Vowel length (current and next word) – POS of the next word

77/3% 80.9/16.0% 82.5/3.9% T/F Wightman Narayanan Best

Conclusion

algorithms can perform competitively with more complicated NLP approaches

initially suspected.

– Sequential Modeling (HMM) – Different features

– H, L, L+H*