Automatic Prosody Labeling Final Presentation Andrew Rosenberg - - PowerPoint PPT Presentation

automatic prosody labeling
SMART_READER_LITE
LIVE PREVIEW

Automatic Prosody Labeling Final Presentation Andrew Rosenberg - - PowerPoint PPT Presentation

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05 Overview Project Goal ToBI standard for prosodic labeling Previous Work Method Results


slide-1
SLIDE 1

Automatic Prosody Labeling

Final Presentation

Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05

slide-2
SLIDE 2

Overview

  • Project Goal
  • ToBI standard for prosodic labeling
  • Previous Work
  • Method
  • Results
  • Conclusion
slide-3
SLIDE 3

Project Goal:

  • Automatic assignment of tones tier elements

– Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. – Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones

slide-4
SLIDE 4

ToBI Annotation

  • Tones and Break Index (ToBI) labeling scheme

consists of a speech waveform and 4 tiers:

– Tones

  • Annotation of pitch accents and phrasal tones

– Orthographic

  • Transcription of text

– Break Index

  • Pauses between words, rated on a scale from 0-4.

– Miscellaneous

  • Notes about the annotation (e.g., ambiguities, non-speech

noise)

slide-5
SLIDE 5

ToBI Transcription Example

slide-6
SLIDE 6

ToBI Examples

  • Pitch Accents (made3.wav):

– H*, L*, L+H*

  • Boundary Tones (money.wav):

– L-H%, H-H%, L-L%, H-L%, (H-, L-)

slide-7
SLIDE 7

Previous Work

  • Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996

– BU Radio News Corpus (~48 minutes)

  • Public news broadcasts spoken by 7 speakers

– Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification – Employs no acoustic features.

  • Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi-

Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005

– BU Radio News Corpus – Detects stressed syllables (collapsed ToBI labels) and all boundaries. – Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model

  • Wightman: “Automatic Labeling of Prosodic Patterns” 1994

– Single speaker subset of BNC and ambiguous sentence corpus (read speech). – Like Ross, uses decision tree output as input to HMM – Uses many acoustic features

slide-8
SLIDE 8

Method

  • JRip

– Classification rule learner – Better at working with nominal attributes – Easier to read output

  • Corpus

– Boston Direction Corpus

  • 4 speakers
  • ~65 minutes of semi-spontaneous speech
  • Original Plan:

– HMMs and SVMs

  • SVMs took a prohibitive amount of time to learn and performed worse.
  • HMM implementation problems, and not enough time to implement my own
slide-9
SLIDE 9

Method - Features

  • Min, max, mean, std.dev. F0 and Intensity
  • # Syllables, Duration, approx. vowel length,

POS

  • F0 slope (weighted)
  • zscore of max F0 and intensity
  • Phrase-length F0, intensity and vowel

length features

  • Phrase position
slide-10
SLIDE 10

Results - Tasks

  • Pitch Accent

– Identification – Detection

  • Phrase Tone identification
  • Boundary Tone identification
  • Phrase/Boundary Tone

– Identification – Detection

slide-11
SLIDE 11

Results - Pitch Accent Identification

  • Accuracy
  • Relevant Features

– # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity

58.8% Base 80.2% 78.0% 79.2% Acc. Ross* No Breaks Best

*Ross identifies a different subset of ToBI pitch accents

slide-12
SLIDE 12

Results - Pitch Accent Detection

83/ 14% 79.5/ 13.2%

  • 80.1/

14% 83.2/ 12.4 T/F

  • Wightman
  • 82.5%

83.9% 85.7% Acc.

Narayanan

Ross No Breaks Best

Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task

slide-13
SLIDE 13

Results - Phrase Tone

  • Accuracy
  • Relevant Features

– Duration of next word, max, min, mean F0. – Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity 77.4% 86.7% 57.9% 72.4% Acc. Base No Break Base Best

slide-14
SLIDE 14

Results - Boundary Tone Identification

  • Accuracy
  • Relevant Features

– Quadratically weighted F0 slope 84.5% 91.3% 65.1% 73.2% Acc. Base No Break Base Best

slide-15
SLIDE 15

Results - Phrase/Boundary Tone Identification

  • Accuracy
  • Relevant Features

– Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, – zscore of difference in max intensity in the current word and the phrase.

56.3% 66.9% 33.8% 54.7% Acc. Base Ross Base Best

slide-16
SLIDE 16

Results – Phrase/Boundary Tone Detection

  • Accuracy
  • Human agreement (in general): 95%
  • Best agreement: 93.0% over 77% baseline
  • Relevant Features

– Vowel length (current and next word) – POS of the next word

77/3% 80.9/16.0% 82.5/3.9% T/F Wightman Narayanan Best

slide-17
SLIDE 17

Conclusion

  • Relatively low-tech acoustic features and ml

algorithms can perform competitively with more complicated NLP approaches

  • Break index information was not as helpful as

initially suspected.

  • Potential Improvements:

– Sequential Modeling (HMM) – Different features

  • More sophisticated pitch contour feature
  • Content-based features (similar to Ross)