Stefanie Shattuck-Hufnagel Speech Communication Group Research - - PowerPoint PPT Presentation
Stefanie Shattuck-Hufnagel Speech Communication Group Research - - PowerPoint PPT Presentation
Cue-based analysis of speech: Implications for prosodic transcription Stefanie Shattuck-Hufnagel Speech Communication Group Research Laboratory of Electronics MIT A stark view: Some unanswered questions What are the contrastive categories
A stark view: Some unanswered questions
- What are the contrastive categories of spoken
prosody?
- How does their phonetic implementation vary
systematically with context?
- How do they relate to meaning and to
interaction?
Prosodic parallels to a feature-cue-based approach to speech processing?
1) Segmental phonology: growing evidence that language users systematically control:
- individual acoustic cues to contrastive phonemic segments
- contextually appropriate parameter values of these cues
2) Models: representation and processing of surface phonetic information at this level of detail
- feature-cue-based processing (Halle, Stevens)
3) Parallels in prosodic phonology?
- if so, what are the implications for prosodic transcription?
Instruction giver’s map Instruction follower’s map
Reduction of surface word forms It’s probably the same thing.
probably the
Strengthening/clarification of surface word forms
Are you going to have to do that all over again? ProbabLY.
Extremes of variation in word forms
Surface phonetic segments
- ften not appropriate for transcription
- Cues not aligned in time
– Cues to a feature can be distributed over time
- nasality in V preceding a nasal coda C in I can go
- duration of V preceding a voiceless coda C in I can’t go
– Cues to features of two segments can overlap in time
- /n + dh/ of win those interdental nasal
- Cues selected individually
– Individual cues to features survive ‘deletion’ of segment
- Duration of V preceding a ‘deleted’ voiced coda C in cat
– Individual cues to features are sometimes added
- Glottalized word-final /t/ sometimes also has closure and release
burst
Feature-cue-based transcription provides a better fit
- Stevens 2002 (extending Halle 1972): Two
types of features, two types of cues
– Landmarks: abrupt spectral changes as cues to articulator-free features
- Consonant, Vowel, Glide, Continuant, Sonorant,
Strident
– Landmark-related cues: spectral patterns near Landmarks, as cues to articulator-bound features
- Labial, Coronal, Velar, Voiced, Nasal etc.
– Additional acoustic events
Landmark cues
Boyce et al. 2013
Rapid spectral changes across several energy bands which provide information about articulator-free features
Landmark labelling captures individual cue patterns
Advantages of Landmark Cues in Speech Perception
- Reliably produced
– 80% of predicted LMs in AEMT Corpus (Shattuck-
Hufnagel & Veilleux 2007)
- Robustly detectable (‘auditory edges’)
- Highly informative
– Articulator-free features (~manner) provide estimate
- f CV structure of the utterance
– Identification of regions rich in cues to other features (place, voicing) – Inter-Landmark times provide estimate of durational markers of prosodic structure
Extension to Production A sketch of an extrinsic timing model
Stage 1: a phonological planning stage
– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context
Stage 2: a phonetic planning stage
– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed
Stage 3: a motor-sensory implementation stage
– articulator movements are generated and tracked.
Turk and Shattuck-Hufnagel 2014
Extension to Production A sketch of an extrinsic timing model
Stage 1: a phonological planning stage
– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context
Stage 2: a phonetic planning stage
– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed
Stage 3: a motor-sensory implementation stage
– articulator movements are generated and tracked.
Turk and Shattuck-Hufnagel 2014
Evidence for a Feature-Cue-Based production planning model
- Evidence that speakers can choose among
individual cues
– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification
- Evidence that speakers compute cue parameter
values
– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening
Evidence for a Feature-Cue-Based production planning model
- Evidence that speakers can choose among
individual cues
– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification
- Evidence that speakers compute cue parameter
values
– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening
Conversational convergence/divergence
Neilson 2011
Evidence for a Feature-Cue-Based production planning model
- Evidence that speakers can choose among
individual cues
– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification
- Evidence that speakers compute cue parameter
values
– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening
Covert contrast in child speech
Scobbie 1998; see also Gibbon 1990
Covert contrast for stop voicing
Macken & Barton 1980 JCL
Characteristics of the FCBP approach
- More complex planning by the speaker
– Not ‘choose a surface allophone’ – But instead, ‘choose context-appropriate feature cues and cue parameter values’
- Extensive interpretation by the listener
– Which linguistic constituents and structures does the signal contain cues for? – What information about the interaction and the situation does the signal contain cues for?
Parallels in Prosodic Processing?
- Individual variation in cue patterns
– Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996)
- New cues in challenging speaking situations
– Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003)
- Interpretation of ambiguous cues in context
– Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pitt 2008)
Parallels in Prosodic Processing?
- Individual variation in cue patterns
– Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996)
- New cues in challenging speaking situations
– Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003)
- Interpretation of cues in context
– Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pittt 1998)
New cues in challenging speaking situations: Dysarthric Speech
Patel 2003
New cues in challenging speaking situations: Whispered Speech
https://lingos.co/blog/mandarin-tones/ Gao 1999
New cues in challenging speaking situations: Whispered Speech
Gao 1999
New Cues in challenging speaking situations: Whispered Speech
Gao 2003
Implications for Prosodic Transcription?
- Determine the contrastive categories
- Determine the range of appropriate cues and
cue parameter values for each category, across contexts
- Determine the relationship of the categories
(and cue parameter values) to meaning and to interaction
Implications for Prosodic Transcription?
- Determine the contrastive categories
- Determine the range of appropriate cues and
cue parameter values for each category, across contexts
- Determine the relationship of the categories
(and cue parameter values) to meaning and to interaction
- Can cue-based transcription move us toward
these goals?
Some useful steps
- Consider prosodic elements in terms of
distributed cues to contrastive elements and parameter values for those cues
– Rather than as a sequence of surface elements
- Develop displays of parameters as compelling as
F0 contours
– Duration and amplitude as % of typical – Autodetection of irregular pitch periods
- Create inventories of contrastive use of prosodic
phrasing and prominence across languages
- Investigate ‘phonological equivalence’ in prosody
Phonological equivalence
Which differences distinguish contrasts?
Some unanswered (but answerable) questions
- What are the contrastive categories of spoken
prosody?
- How does the phonetic implementation of
these categories vary systematically with context?
- How do these categories relate to meaning
and to interaction?
Evidence for a Feature-Cue-Based production planning model
- Evidence that speakers can choose among individual
cues
– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification
- Evidence that speakers compute cue parameter values
– Covert contrast in development – Conversational convergence: partial, governed by social values – Inventory constraints on final lengthening – Recall evidence from prosodic hierarchy: ipp’s, VOT
Evidence for a Feature-Cue-Based production planning model
- Evidence that speakers can choose among individual
cues
– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification
- Evidence that speakers compute cue parameter values
– Covert contrast in development – Conversational convergence: partial, governed by social values – Inventory constraints on final lengthening – Recall evidence from prosodic hierarchy: ipp’s, VOT
Speakers can select among cues: cues left behind in reductions
- …even if whole strings of segments are no longer
delimitable in reduced forms compared with fuller pronunciations of the same lexical items, there will still be articulatory prosodies, superimposed upon the remaining sound material, which retain essential components of the fuller forms, the phonetic essence that characterizes the whole form class of a word. The extreme reduction [ai~~i] of the German modal particle eigentlich, 'actually', is a case in point. The length, palatality and nasality of its gliding movement reflect the polysyllabicity, the central nasal consonant and the final palatal syllable of the fuller forms. It is assumed that this phonetic essence triggers lexical identification in the listener.
Niebuhr & Kohler 2011
Speakers can select among cues
- Speakers constrain LM modification to
maintain contrasts in their language (Lavoie 2002)
Lavoie 2002
Lavoie 2002
Speakers can compute context-appropriate cue values
- Recall examples from cues to prosodic
structure:
– Duration adjustments of rhyme for boundaries, prominences (many languages) – Duration adjustments of stop VOT for initial boundaries (Korean)
Speakers can compute context-appropriate cue values
- Conversational convergence/divergence
Neilson 2011
Speakers can compute context-appropriate cue values
- Inventory-sensitive constraints on degree of final
lengthening:
- “Nakai et al. (2009) have also shown that a quantity language
like Finnish exhibits final lengthening, but its implementation is regulated to preserve the language-specific quantity system, namely the contrast between single or short vowels and double or long vowels. This important empirical finding raises the question whether final lengthening, and perhaps also other prosodic cues, is a universal cue to phrasing that is implemented in language-particular ways. If so, cue variation may be the result of the conspiracy of specific phonologies against universal tendencies in language, and experimental approaches are decisive to disentangle the two factors.” (Frota 2012)
A sketch of an extrinsic timing model
- f speech production
- Stage 1) a phonological planning stage
– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context
- Stage 2) a phonetic planning stage
– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed
- Stage 3) a motor-sensory implementation stage
– articulator movements are generated and tracked.
Turk & Shattuck-Hufnagel, Sp Prosody 2014
Outline
- Past 30 years have seen two contrasting developments in
speech prosody: the emergence of grammatical theories of prosodic structure, and deep disagreements about how prosodic structure should be annotated/what matters about prosody/the role of prosodic phonology.
- In this talk I will review what seems to me to be a parallel
pair of contrasting views about segmental phonology, one
- f which focuses on range of variation in word forms across
contexts, and the other on the systematic nature of this variation in relation to abstract linguistic categories.
- This approach seems to me to provide a framework for
uniting the largely statistical and probabilistic view of phonetics with the more traditional view of phonology based on abstract contrastive categories, to the benefit of both.
Collaborators
- Segmental phonology and phonetics: Cue-based
approaches
– Ken Stevens, Morris Halle, Jay Keyser, Haruko Kawasaki, Helen Hanson
- The role of prosodic structure in utterance
planning
– Pat Keating, Alice Turk
- Cues to prosodic structure
– Jon Barnes, Alejna Brugos, Nanette Veilleux – Jennifer Cole
Both segmental and prosodic phonetics signal several different types of information
- Linguistic constituents and structures
- Situation-specific information
– the language/dialect – the frequency/predictability of the words – the speaker (attitude, emotion, physiology) – the speaking situation – the relationship between the speakers
The plan for this talk
- Evidence that segmental phonetics has many of
the same problems as prosodic phonetics
- How a model based on cues to contrastive
categories, and their parameter values, addresses some of these problems for both segmental phonetics
- How this might work for prosodic phonetics
- Evidence for a phonetics model based on cues
and their parameter values
- Implications of this Cue-Based-Processing
approach for prosodic transcription
Segmental phonetics has many of the same problems as prosodic phonetics
- Often, expected cues aren’t there
– Reduction in segmental phonology
- Ambiguity
Cue-based processing solves problems
Evidence for Cue Based Processing
- Speakers typically leave cues behind in reduction
- Speakers use new cues in challenging situations
- Developing child speakers use cues differently
- Listeners are sensitive to cue parameter values even
if they can’t accurately report them
- Listeners can parse cues and parameter values into
their sources appropriately
- Listeners can reproduce cue parameter values in
their conversational or imitative behavior
- Listeners tune their perception to cue parameter
values of individual speakers
Speakers typically leave cues behind
Speakers use new cues
Developing child speakers use cues differently
Listeners are sensitive to cue parameter values
Implications of Cue-Based-Processing for Prosodic Transcription
- After three decades of literature focusing on prosodic structure, the
view that prosodic structure has a role to play as the organizing framework of speech is well established. This structure consists of the grouping of chunks of speech into prosodic constituents arranged according to a hierarchy, delimited by prosodic boundaries
- r edges and with prominences or heads at the various levels.
Prominence strength and boundary strength reflect the hierarchy. Prosodic domains are marked by constellations of cues, which stand as the major empirical evidence for prosodic structure and the constituents it comprises. These cues have been shown to be used in lexical processing, in the disambiguation of syntax, or in the identification of morpho-syntactic units (as in bootstrapping).
An alternative title for this talk
- Why I sleep better at night these days, and
how you can too
Summary
- What questions do we need to address?
- What counts as an answer to the question ‘what