a preliminary studies on landmark detection
play

A Preliminary Studies on Landmark detection Advisor: Hsiao-Chuan - PowerPoint PPT Presentation

SPEECH LAB NTHU EE A Preliminary Studies on Landmark detection Advisor: Hsiao-Chuan Wang Speaker: Jyh-Min CHENG Date: Aug. 18, 2006 Introduction SPEECH LAB NTHU EE A knowledge-based speech recognition system is dedicated to processing


  1. SPEECH LAB NTHU EE A Preliminary Studies on Landmark detection Advisor: Hsiao-Chuan Wang Speaker: Jyh-Min CHENG Date: Aug. 18, 2006

  2. Introduction SPEECH LAB NTHU EE � A knowledge-based speech recognition system is dedicated to processing speech (versus signals in general) and therefore is efficient � Rather than explicitly specifying speech knowledge in a recognition system, a statistical approach builds models by training on speech data, thereby implicitly acquiring knowledge on its own 2

  3. Introduction (cont.) SPEECH LAB NTHU EE � Statistical methods have been successful for large-vocabulary, speaker-independent speech recognition � Lee, K.-F. (1989). Automatic speech recognition: the development of the SPHINX system � Heavily reliance on data, statistical methods do not generalize easily to tasks for which they are not explicitly trained � Retraining, adaptation, etc. 3

  4. Introduction (cont.) SPEECH LAB NTHU EE � Performance degrades when there are environment mismatch � Das, S., Bakis, R., A., Nahamoo, D., and Picheny, M. (1993). Influence of background noise and microphone on the performance of the IBM Tangora speech recognition system � A combination of both knowledge-based and statistical approaches � Knowledge sources are added such as phone duration, an auditory front-end, mel-frequency scale, etc. 4

  5. Introduction (cont.) SPEECH LAB NTHU EE � Knowledge-based speech recognition system was proposed � Stevens, K. N., Manuel, S. Y., Shattuck-Hufinagel, S., and Liu, S. (1992). “ Implementation of a model for lexical access based on features ” , ICSLP 5

  6. 6 SPEECH LAB NTHU EE Introduction (cont.)

  7. Introduction (cont.) – Distinctive features SPEECH LAB NTHU EE � Distinctive features concisely describe the sounds of a language at a sub-segmental level � They have a relatively direct relation to acoustics and articulation � Jacobson, R., and Zue, V. W. (1952). “ Preliminaries to speech analysis ” � They can concisely describe many of the contextual variations of a segment � Speaking styles, phonological assimilation across word boundaries, etc. 7

  8. Introduction (cont.) – Landmark detection SPEECH LAB NTHU EE � Landmarks are a guide to the presence of underlying segments, which organize distinctive features into bundles � Define regions in an utterance when the acoustic correlates of distinctive features are most salient � They mark perceptual foci and articulatory targets 8

  9. Introduction (cont.) – Landmark detection SPEECH LAB NTHU EE � For some phonetic contrasts, a listener focuses on landmarks to get the acoustic cues necessary for deciphering the underlying distinctive features � Stevens, K. N. (1985). “ Evidence for the role of acoustic boundaries in the perception of speech sounds ” � Furui, S. (1986). “ On the role of spectral transition for speech perception ” � Ohde, R. M. (1994). “ The developmental role of acoustic boundaries in speech perception ” 9

  10. Introduction (cont.) – Landmark detection SPEECH LAB NTHU EE � After finding out the landmarks, the subsequent processing can focus on relevant speech portions, instead of treating each part of the signal equally important � Minimizes the amount of processing necessary � Independent of timing factors, like speaking rate and segmental duration, etc. � Gives timing information to aid in later processing 10

  11. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Landmark detection is just one way to organize the speech waveform � Frame-based processing and Segmentation are two other possibilities 11

  12. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Frame-based processing is the most popular way of dividing up the speech waveform � Segmentation is more structured than frame- based processing � Finds boundaries in the speech waveform � Delimit unequal-length, semi-steady-state, abutting regions, with each region corresponding to a phone or sub-phone unit 12

  13. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Subsequent processing focuses on these regions, typically acquiring averages across a region and sometimes measuring attributes near the boundaries � Gish, H., and Ng. K. (1993). “ A segmental speech model with applications to word spotting ” , ICASSP � Zue, V. W., Glass, J. R., Goodine, D., Leung, H., Philips, M., Pilifroni, J., and Seneff, S. (1990b). “ Recent progress on the SUMMIT system ” 13

  14. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Segmentation approach performs better than or comparably to a frame-based approach while reducing the computational load in training and testing by a significant amount � Flammia, G., Dalsgaard, P., Anderson, O., and Linberg, B. (1992). “ Segment based variable frame rate speech analysis and recognition using spectral variation function ” , ICSLP � Marcus, J. (1993). “ Phonetic recognition in a segmental-based HMM ” , ICASSP 14

  15. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Segmentation was a popular method of organizing speech waveform in the 1970s through mid-1980s � Compatible with acoustic-phonetic processing � Weinstein, C. J., McCandless, S. S., Mondshein, L. F., and Zue, V. W. (1975). “ A system for acoustic-phonetic analysis of continuous speech ” , IEEE ASSP 15

  16. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Segmentation failed when parts of the waveform do not have sharp boundaries, like those corresponding to diphthongs and semivowels � Over-segmentation � Andre-Obrecht, R. (1988). “ A new statistical approach for the automatic segmentation of continuous speech signals ” , IEEE ASSP � Multi-level representation � Glass, J. R. (1988). “ Finding acoustic regularities in speech: applications to phonetic recognition ” 16

  17. Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Landmark detection is different from frame- based processing and segmentation � Landmark are foci , so speech processing is done around a landmark rather than in between two landmarks � Not all boundaries are landmarks, and not all landmarks are boundaries � The problem of semivowels and diphthongs is avoided altogether � Typically more hierarchical � Associated with distinctive features rather than associated with phones in segmentation 17

  18. Objective SPEECH LAB NTHU EE � The most numerous types of landmarks are acoustically abrupt � Zue, V., Seneff, S., and Glass, J. (1990a). “ Speech database development at MIT: TIMIT and beyond ” , speech commun. � An estimate based on a phonetically balanced subset of sentences in the TIMIT corpus shows that acoustically abrupt landmarks comprise approximately 68% of the total number of landmarks in speech � Often associated with consonantal segments, like a stop closure or release 18

  19. I. LANDMARKS SPEECH LAB NTHU EE � Categorized into four groups � Abrupt-consonantal (AC) � Abrupt (A) � Nonabrupt (N) � Vocalic (V) 19

  20. I. LANDMARKS (cont.) SPEECH LAB NTHU EE � Phonologically, segments can be classified as [+ consonantal] or [-consonantal] � Sagey, E. (1986). “ The representation of features and relations in nonlinear phonology ” � A [+ consonantal] involves a primary articulator forming a tight constriction in the midline of the vocal tract (lips, tongue blade, tongue body) � A [-consonantal] does not involve a primary articulator and not forming a tight constriction (soft palate, and glottis) � Speech is formed by a series of articulator narrowings and releases 20

  21. I. LANDMARKS (cont.) SPEECH LAB NTHU EE � The most salient of these narrowings and releases are acoustically abrupt � An acoustically abrupt constriction involving a primary articulator is typically tight and is a consequence of implementing a [+ consonantal] segment � An abrupt-consonantal ( AC ) landmark marks the closure and another marks the release of one of these constrictions 21

  22. I. LANDMARKS (cont.) SPEECH LAB NTHU EE � The clearest manifestation of an AC landmark is when the constriction occurs adjacent to a Outer AC landmark [-consonantal] segment � A pair of these landmarks, one on either side of the constriction, will be referred to as the outer AC landmarks � Ex: [b] closure and release in “ able ” � Other landmarks can occur within or outside of the pair of outer AC landmarks 22

  23. I. LANDMARKS (cont.) SPEECH LAB NTHU EE � A common sequence of landmarks is one in which the outer AC landmarks are governed by the same underlying segment and, thus are implemented by the same articulator � Ex: [b] closure and release in “ able ” � Some outer AC landmarks are not governed by the same articulator � Ex: [p] closure and [d] release in “ tap dance ” 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend