[PPT] - Stefanie Shattuck-Hufnagel Speech Communication Group Research PowerPoint Presentation

SLIDE 1

Cue-based analysis of speech: Implications for prosodic transcription

Stefanie Shattuck-Hufnagel Speech Communication Group Research Laboratory of Electronics MIT

SLIDE 2

A stark view: Some unanswered questions

What are the contrastive categories of spoken

prosody?

How does their phonetic implementation vary

systematically with context?

How do they relate to meaning and to

interaction?

SLIDE 3

Prosodic parallels to a feature-cue-based approach to speech processing?

1) Segmental phonology: growing evidence that language users systematically control:

individual acoustic cues to contrastive phonemic segments
contextually appropriate parameter values of these cues

2) Models: representation and processing of surface phonetic information at this level of detail

feature-cue-based processing (Halle, Stevens)

3) Parallels in prosodic phonology?

if so, what are the implications for prosodic transcription?

SLIDE 4

Instruction giver’s map Instruction follower’s map

SLIDE 5

Reduction of surface word forms It’s probably the same thing.

SLIDE 6

probably the

SLIDE 7

Strengthening/clarification of surface word forms

Are you going to have to do that all over again? ProbabLY.

SLIDE 8

Extremes of variation in word forms

SLIDE 9

Surface phonetic segments

ften not appropriate for transcription
Cues not aligned in time

– Cues to a feature can be distributed over time

nasality in V preceding a nasal coda C in I can go
duration of V preceding a voiceless coda C in I can’t go

– Cues to features of two segments can overlap in time

/n + dh/ of win those  interdental nasal
Cues selected individually

– Individual cues to features survive ‘deletion’ of segment

Duration of V preceding a ‘deleted’ voiced coda C in cat

– Individual cues to features are sometimes added

Glottalized word-final /t/ sometimes also has closure and release

burst

SLIDE 10

Feature-cue-based transcription provides a better fit

Stevens 2002 (extending Halle 1972): Two

types of features, two types of cues

– Landmarks: abrupt spectral changes as cues to articulator-free features

Consonant, Vowel, Glide, Continuant, Sonorant,

Strident

– Landmark-related cues: spectral patterns near Landmarks, as cues to articulator-bound features

Labial, Coronal, Velar, Voiced, Nasal etc.

– Additional acoustic events

SLIDE 11

Landmark cues

Boyce et al. 2013

Rapid spectral changes across several energy bands which provide information about articulator-free features

SLIDE 12

Landmark labelling captures individual cue patterns

SLIDE 13

Advantages of Landmark Cues in Speech Perception

Reliably produced

– 80% of predicted LMs in AEMT Corpus (Shattuck-

Hufnagel & Veilleux 2007)

Robustly detectable (‘auditory edges’)
Highly informative

– Articulator-free features (~manner) provide estimate

f CV structure of the utterance

– Identification of regions rich in cues to other features (place, voicing) – Inter-Landmark times provide estimate of durational markers of prosodic structure

SLIDE 14

Extension to Production A sketch of an extrinsic timing model

Stage 1: a phonological planning stage

– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context

Stage 2: a phonetic planning stage

– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed

Stage 3: a motor-sensory implementation stage

– articulator movements are generated and tracked.

Turk and Shattuck-Hufnagel 2014

SLIDE 15

Extension to Production A sketch of an extrinsic timing model

Stage 1: a phonological planning stage

– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context

Stage 2: a phonetic planning stage

– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed

Stage 3: a motor-sensory implementation stage

– articulator movements are generated and tracked.

Turk and Shattuck-Hufnagel 2014

SLIDE 16

Evidence for a Feature-Cue-Based production planning model

Evidence that speakers can choose among

individual cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

Evidence that speakers compute cue parameter

values

– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening

SLIDE 17

Evidence for a Feature-Cue-Based production planning model

Evidence that speakers can choose among

individual cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

Evidence that speakers compute cue parameter

values

– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening

SLIDE 18

Conversational convergence/divergence

Neilson 2011

SLIDE 19

Evidence for a Feature-Cue-Based production planning model

Evidence that speakers can choose among

individual cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

Evidence that speakers compute cue parameter

values

– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening

SLIDE 20

Covert contrast in child speech

Scobbie 1998; see also Gibbon 1990

SLIDE 21

Covert contrast for stop voicing

Macken & Barton 1980 JCL

SLIDE 22

Characteristics of the FCBP approach

More complex planning by the speaker

– Not ‘choose a surface allophone’ – But instead, ‘choose context-appropriate feature cues and cue parameter values’

Extensive interpretation by the listener

– Which linguistic constituents and structures does the signal contain cues for? – What information about the interaction and the situation does the signal contain cues for?

SLIDE 23

Parallels in Prosodic Processing?

Individual variation in cue patterns

– Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996)

New cues in challenging speaking situations

– Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003)

Interpretation of ambiguous cues in context

– Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pitt 2008)

SLIDE 24

Parallels in Prosodic Processing?

Individual variation in cue patterns

– Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996)

New cues in challenging speaking situations

– Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003)

Interpretation of cues in context

– Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pittt 1998)

SLIDE 25

New cues in challenging speaking situations: Dysarthric Speech

Patel 2003

SLIDE 26

New cues in challenging speaking situations: Whispered Speech

https://lingos.co/blog/mandarin-tones/ Gao 1999

SLIDE 27

New cues in challenging speaking situations: Whispered Speech

Gao 1999

SLIDE 28

New Cues in challenging speaking situations: Whispered Speech

Gao 2003

SLIDE 29

Implications for Prosodic Transcription?

Determine the contrastive categories
Determine the range of appropriate cues and

cue parameter values for each category, across contexts

Determine the relationship of the categories

(and cue parameter values) to meaning and to interaction

SLIDE 30

Implications for Prosodic Transcription?

Determine the contrastive categories
Determine the range of appropriate cues and

cue parameter values for each category, across contexts

Determine the relationship of the categories

(and cue parameter values) to meaning and to interaction

Can cue-based transcription move us toward

these goals?

SLIDE 31

Some useful steps

Consider prosodic elements in terms of

distributed cues to contrastive elements and parameter values for those cues

– Rather than as a sequence of surface elements

Develop displays of parameters as compelling as

F0 contours

– Duration and amplitude as % of typical – Autodetection of irregular pitch periods

Create inventories of contrastive use of prosodic

phrasing and prominence across languages

Investigate ‘phonological equivalence’ in prosody

SLIDE 32

Phonological equivalence

SLIDE 33

Which differences distinguish contrasts?

SLIDE 34

Some unanswered (but answerable) questions

What are the contrastive categories of spoken

prosody?

How does the phonetic implementation of

these categories vary systematically with context?

How do these categories relate to meaning

and to interaction?

SLIDE 35

SLIDE 36

Evidence for a Feature-Cue-Based production planning model

Evidence that speakers can choose among individual

cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

Evidence that speakers compute cue parameter values

– Covert contrast in development – Conversational convergence: partial, governed by social values – Inventory constraints on final lengthening – Recall evidence from prosodic hierarchy: ipp’s, VOT

SLIDE 37

Evidence for a Feature-Cue-Based production planning model

Evidence that speakers can choose among individual

cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

Evidence that speakers compute cue parameter values

– Covert contrast in development – Conversational convergence: partial, governed by social values – Inventory constraints on final lengthening – Recall evidence from prosodic hierarchy: ipp’s, VOT

SLIDE 38

Speakers can select among cues: cues left behind in reductions

…even if whole strings of segments are no longer

delimitable in reduced forms compared with fuller pronunciations of the same lexical items, there will still be articulatory prosodies, superimposed upon the remaining sound material, which retain essential components of the fuller forms, the phonetic essence that characterizes the whole form class of a word. The extreme reduction [ai~~i] of the German modal particle eigentlich, 'actually', is a case in point. The length, palatality and nasality of its gliding movement reflect the polysyllabicity, the central nasal consonant and the final palatal syllable of the fuller forms. It is assumed that this phonetic essence triggers lexical identification in the listener.

Niebuhr & Kohler 2011

SLIDE 39

Speakers can select among cues

Speakers constrain LM modification to

maintain contrasts in their language (Lavoie 2002)

SLIDE 40

Lavoie 2002

SLIDE 41

Lavoie 2002

SLIDE 42

Speakers can compute context-appropriate cue values

Recall examples from cues to prosodic

structure:

– Duration adjustments of rhyme for boundaries, prominences (many languages) – Duration adjustments of stop VOT for initial boundaries (Korean)

SLIDE 43

Speakers can compute context-appropriate cue values

Conversational convergence/divergence

Neilson 2011

SLIDE 44

Speakers can compute context-appropriate cue values

Inventory-sensitive constraints on degree of final

lengthening:

“Nakai et al. (2009) have also shown that a quantity language

like Finnish exhibits final lengthening, but its implementation is regulated to preserve the language-specific quantity system, namely the contrast between single or short vowels and double or long vowels. This important empirical finding raises the question whether final lengthening, and perhaps also other prosodic cues, is a universal cue to phrasing that is implemented in language-particular ways. If so, cue variation may be the result of the conspiracy of specific phonologies against universal tendencies in language, and experimental approaches are decisive to disentangle the two factors.” (Frota 2012)

SLIDE 45

A sketch of an extrinsic timing model

f speech production
Stage 1) a phonological planning stage

– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context

Stage 2) a phonetic planning stage

– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed

Stage 3) a motor-sensory implementation stage

– articulator movements are generated and tracked.

Turk & Shattuck-Hufnagel, Sp Prosody 2014

SLIDE 46

Outline

Past 30 years have seen two contrasting developments in

speech prosody: the emergence of grammatical theories of prosodic structure, and deep disagreements about how prosodic structure should be annotated/what matters about prosody/the role of prosodic phonology.

In this talk I will review what seems to me to be a parallel

pair of contrasting views about segmental phonology, one

f which focuses on range of variation in word forms across

contexts, and the other on the systematic nature of this variation in relation to abstract linguistic categories.

This approach seems to me to provide a framework for

uniting the largely statistical and probabilistic view of phonetics with the more traditional view of phonology based on abstract contrastive categories, to the benefit of both.

SLIDE 47

Collaborators

Segmental phonology and phonetics: Cue-based

approaches

– Ken Stevens, Morris Halle, Jay Keyser, Haruko Kawasaki, Helen Hanson

The role of prosodic structure in utterance

planning

– Pat Keating, Alice Turk

Cues to prosodic structure

– Jon Barnes, Alejna Brugos, Nanette Veilleux – Jennifer Cole

SLIDE 48

Both segmental and prosodic phonetics signal several different types of information

Linguistic constituents and structures
Situation-specific information

– the language/dialect – the frequency/predictability of the words – the speaker (attitude, emotion, physiology) – the speaking situation – the relationship between the speakers

SLIDE 49

The plan for this talk

Evidence that segmental phonetics has many of

the same problems as prosodic phonetics

How a model based on cues to contrastive

categories, and their parameter values, addresses some of these problems for both segmental phonetics

How this might work for prosodic phonetics
Evidence for a phonetics model based on cues

and their parameter values

Implications of this Cue-Based-Processing

approach for prosodic transcription

SLIDE 50

Segmental phonetics has many of the same problems as prosodic phonetics

Often, expected cues aren’t there

– Reduction in segmental phonology

Ambiguity

SLIDE 51

Cue-based processing solves problems

SLIDE 52

Evidence for Cue Based Processing

Speakers typically leave cues behind in reduction
Speakers use new cues in challenging situations
Developing child speakers use cues differently
Listeners are sensitive to cue parameter values even

if they can’t accurately report them

Listeners can parse cues and parameter values into

their sources appropriately

Listeners can reproduce cue parameter values in

their conversational or imitative behavior

Listeners tune their perception to cue parameter

values of individual speakers

SLIDE 53

Speakers typically leave cues behind

SLIDE 54

Speakers use new cues

SLIDE 55

Developing child speakers use cues differently

SLIDE 56

Listeners are sensitive to cue parameter values

SLIDE 57

Implications of Cue-Based-Processing for Prosodic Transcription

SLIDE 58

After three decades of literature focusing on prosodic structure, the

view that prosodic structure has a role to play as the organizing framework of speech is well established. This structure consists of the grouping of chunks of speech into prosodic constituents arranged according to a hierarchy, delimited by prosodic boundaries

r edges and with prominences or heads at the various levels.

Prominence strength and boundary strength reflect the hierarchy. Prosodic domains are marked by constellations of cues, which stand as the major empirical evidence for prosodic structure and the constituents it comprises. These cues have been shown to be used in lexical processing, in the disambiguation of syntax, or in the identification of morpho-syntactic units (as in bootstrapping).

SLIDE 59

SLIDE 60

An alternative title for this talk

Why I sleep better at night these days, and

how you can too

SLIDE 61

Summary

What questions do we need to address?
What counts as an answer to the question ‘what

is the prosody of this utterance?’

– An abstract specification of the linguistic constituents that make up the utterance and its structure? – An exact specification of the quantifiable, measurable and prosodically-relevant aspects of the signal? – A cue-based specification that links the abstract contrastive categories to their quantitative implementation?