Stefanie Shattuck-Hufnagel Speech Communication Group Research - - PowerPoint PPT Presentation

stefanie shattuck hufnagel speech communication group
SMART_READER_LITE
LIVE PREVIEW

Stefanie Shattuck-Hufnagel Speech Communication Group Research - - PowerPoint PPT Presentation

Cue-based analysis of speech: Implications for prosodic transcription Stefanie Shattuck-Hufnagel Speech Communication Group Research Laboratory of Electronics MIT A stark view: Some unanswered questions What are the contrastive categories


slide-1
SLIDE 1

Cue-based analysis of speech: Implications for prosodic transcription

Stefanie Shattuck-Hufnagel Speech Communication Group Research Laboratory of Electronics MIT

slide-2
SLIDE 2

A stark view: Some unanswered questions

  • What are the contrastive categories of spoken

prosody?

  • How does their phonetic implementation vary

systematically with context?

  • How do they relate to meaning and to

interaction?

slide-3
SLIDE 3

Prosodic parallels to a feature-cue-based approach to speech processing?

1) Segmental phonology: growing evidence that language users systematically control:

  • individual acoustic cues to contrastive phonemic segments
  • contextually appropriate parameter values of these cues

2) Models: representation and processing of surface phonetic information at this level of detail

  • feature-cue-based processing (Halle, Stevens)

3) Parallels in prosodic phonology?

  • if so, what are the implications for prosodic transcription?
slide-4
SLIDE 4

Instruction giver’s map Instruction follower’s map

slide-5
SLIDE 5

Reduction of surface word forms It’s probably the same thing.

slide-6
SLIDE 6

probably the

slide-7
SLIDE 7

Strengthening/clarification of surface word forms

Are you going to have to do that all over again? ProbabLY.

slide-8
SLIDE 8

Extremes of variation in word forms

slide-9
SLIDE 9

Surface phonetic segments

  • ften not appropriate for transcription
  • Cues not aligned in time

– Cues to a feature can be distributed over time

  • nasality in V preceding a nasal coda C in I can go
  • duration of V preceding a voiceless coda C in I can’t go

– Cues to features of two segments can overlap in time

  • /n + dh/ of win those  interdental nasal
  • Cues selected individually

– Individual cues to features survive ‘deletion’ of segment

  • Duration of V preceding a ‘deleted’ voiced coda C in cat

– Individual cues to features are sometimes added

  • Glottalized word-final /t/ sometimes also has closure and release

burst

slide-10
SLIDE 10

Feature-cue-based transcription provides a better fit

  • Stevens 2002 (extending Halle 1972): Two

types of features, two types of cues

– Landmarks: abrupt spectral changes as cues to articulator-free features

  • Consonant, Vowel, Glide, Continuant, Sonorant,

Strident

– Landmark-related cues: spectral patterns near Landmarks, as cues to articulator-bound features

  • Labial, Coronal, Velar, Voiced, Nasal etc.

– Additional acoustic events

slide-11
SLIDE 11

Landmark cues

Boyce et al. 2013

Rapid spectral changes across several energy bands which provide information about articulator-free features

slide-12
SLIDE 12

Landmark labelling captures individual cue patterns

slide-13
SLIDE 13

Advantages of Landmark Cues in Speech Perception

  • Reliably produced

– 80% of predicted LMs in AEMT Corpus (Shattuck-

Hufnagel & Veilleux 2007)

  • Robustly detectable (‘auditory edges’)
  • Highly informative

– Articulator-free features (~manner) provide estimate

  • f CV structure of the utterance

– Identification of regions rich in cues to other features (place, voicing) – Inter-Landmark times provide estimate of durational markers of prosodic structure

slide-14
SLIDE 14

Extension to Production A sketch of an extrinsic timing model

Stage 1: a phonological planning stage

– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context

Stage 2: a phonetic planning stage

– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed

Stage 3: a motor-sensory implementation stage

– articulator movements are generated and tracked.

Turk and Shattuck-Hufnagel 2014

slide-15
SLIDE 15

Extension to Production A sketch of an extrinsic timing model

Stage 1: a phonological planning stage

– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context

Stage 2: a phonetic planning stage

– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed

Stage 3: a motor-sensory implementation stage

– articulator movements are generated and tracked.

Turk and Shattuck-Hufnagel 2014

slide-16
SLIDE 16

Evidence for a Feature-Cue-Based production planning model

  • Evidence that speakers can choose among

individual cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

  • Evidence that speakers compute cue parameter

values

– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening

slide-17
SLIDE 17

Evidence for a Feature-Cue-Based production planning model

  • Evidence that speakers can choose among

individual cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

  • Evidence that speakers compute cue parameter

values

– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening

slide-18
SLIDE 18

Conversational convergence/divergence

Neilson 2011

slide-19
SLIDE 19

Evidence for a Feature-Cue-Based production planning model

  • Evidence that speakers can choose among

individual cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

  • Evidence that speakers compute cue parameter

values

– Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening

slide-20
SLIDE 20

Covert contrast in child speech

Scobbie 1998; see also Gibbon 1990

slide-21
SLIDE 21

Covert contrast for stop voicing

Macken & Barton 1980 JCL

slide-22
SLIDE 22

Characteristics of the FCBP approach

  • More complex planning by the speaker

– Not ‘choose a surface allophone’ – But instead, ‘choose context-appropriate feature cues and cue parameter values’

  • Extensive interpretation by the listener

– Which linguistic constituents and structures does the signal contain cues for? – What information about the interaction and the situation does the signal contain cues for?

slide-23
SLIDE 23

Parallels in Prosodic Processing?

  • Individual variation in cue patterns

– Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996)

  • New cues in challenging speaking situations

– Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003)

  • Interpretation of ambiguous cues in context

– Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pitt 2008)

slide-24
SLIDE 24

Parallels in Prosodic Processing?

  • Individual variation in cue patterns

– Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996)

  • New cues in challenging speaking situations

– Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003)

  • Interpretation of cues in context

– Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pittt 1998)

slide-25
SLIDE 25

New cues in challenging speaking situations: Dysarthric Speech

Patel 2003

slide-26
SLIDE 26

New cues in challenging speaking situations: Whispered Speech

https://lingos.co/blog/mandarin-tones/ Gao 1999

slide-27
SLIDE 27

New cues in challenging speaking situations: Whispered Speech

Gao 1999

slide-28
SLIDE 28

New Cues in challenging speaking situations: Whispered Speech

Gao 2003

slide-29
SLIDE 29

Implications for Prosodic Transcription?

  • Determine the contrastive categories
  • Determine the range of appropriate cues and

cue parameter values for each category, across contexts

  • Determine the relationship of the categories

(and cue parameter values) to meaning and to interaction

slide-30
SLIDE 30

Implications for Prosodic Transcription?

  • Determine the contrastive categories
  • Determine the range of appropriate cues and

cue parameter values for each category, across contexts

  • Determine the relationship of the categories

(and cue parameter values) to meaning and to interaction

  • Can cue-based transcription move us toward

these goals?

slide-31
SLIDE 31

Some useful steps

  • Consider prosodic elements in terms of

distributed cues to contrastive elements and parameter values for those cues

– Rather than as a sequence of surface elements

  • Develop displays of parameters as compelling as

F0 contours

– Duration and amplitude as % of typical – Autodetection of irregular pitch periods

  • Create inventories of contrastive use of prosodic

phrasing and prominence across languages

  • Investigate ‘phonological equivalence’ in prosody
slide-32
SLIDE 32

Phonological equivalence

slide-33
SLIDE 33

Which differences distinguish contrasts?

slide-34
SLIDE 34

Some unanswered (but answerable) questions

  • What are the contrastive categories of spoken

prosody?

  • How does the phonetic implementation of

these categories vary systematically with context?

  • How do these categories relate to meaning

and to interaction?

slide-35
SLIDE 35
slide-36
SLIDE 36

Evidence for a Feature-Cue-Based production planning model

  • Evidence that speakers can choose among individual

cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

  • Evidence that speakers compute cue parameter values

– Covert contrast in development – Conversational convergence: partial, governed by social values – Inventory constraints on final lengthening – Recall evidence from prosodic hierarchy: ipp’s, VOT

slide-37
SLIDE 37

Evidence for a Feature-Cue-Based production planning model

  • Evidence that speakers can choose among individual

cues

– Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification

  • Evidence that speakers compute cue parameter values

– Covert contrast in development – Conversational convergence: partial, governed by social values – Inventory constraints on final lengthening – Recall evidence from prosodic hierarchy: ipp’s, VOT

slide-38
SLIDE 38

Speakers can select among cues: cues left behind in reductions

  • …even if whole strings of segments are no longer

delimitable in reduced forms compared with fuller pronunciations of the same lexical items, there will still be articulatory prosodies, superimposed upon the remaining sound material, which retain essential components of the fuller forms, the phonetic essence that characterizes the whole form class of a word. The extreme reduction [ai~~i] of the German modal particle eigentlich, 'actually', is a case in point. The length, palatality and nasality of its gliding movement reflect the polysyllabicity, the central nasal consonant and the final palatal syllable of the fuller forms. It is assumed that this phonetic essence triggers lexical identification in the listener.

Niebuhr & Kohler 2011

slide-39
SLIDE 39

Speakers can select among cues

  • Speakers constrain LM modification to

maintain contrasts in their language (Lavoie 2002)

slide-40
SLIDE 40

Lavoie 2002

slide-41
SLIDE 41

Lavoie 2002

slide-42
SLIDE 42

Speakers can compute context-appropriate cue values

  • Recall examples from cues to prosodic

structure:

– Duration adjustments of rhyme for boundaries, prominences (many languages) – Duration adjustments of stop VOT for initial boundaries (Korean)

slide-43
SLIDE 43

Speakers can compute context-appropriate cue values

  • Conversational convergence/divergence

Neilson 2011

slide-44
SLIDE 44

Speakers can compute context-appropriate cue values

  • Inventory-sensitive constraints on degree of final

lengthening:

  • “Nakai et al. (2009) have also shown that a quantity language

like Finnish exhibits final lengthening, but its implementation is regulated to preserve the language-specific quantity system, namely the contrast between single or short vowels and double or long vowels. This important empirical finding raises the question whether final lengthening, and perhaps also other prosodic cues, is a universal cue to phrasing that is implemented in language-particular ways. If so, cue variation may be the result of the conspiracy of specific phonologies against universal tendencies in language, and experimental approaches are decisive to disentangle the two factors.” (Frota 2012)

slide-45
SLIDE 45

A sketch of an extrinsic timing model

  • f speech production
  • Stage 1) a phonological planning stage

– symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context

  • Stage 2) a phonetic planning stage

– cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed

  • Stage 3) a motor-sensory implementation stage

– articulator movements are generated and tracked.

Turk & Shattuck-Hufnagel, Sp Prosody 2014

slide-46
SLIDE 46

Outline

  • Past 30 years have seen two contrasting developments in

speech prosody: the emergence of grammatical theories of prosodic structure, and deep disagreements about how prosodic structure should be annotated/what matters about prosody/the role of prosodic phonology.

  • In this talk I will review what seems to me to be a parallel

pair of contrasting views about segmental phonology, one

  • f which focuses on range of variation in word forms across

contexts, and the other on the systematic nature of this variation in relation to abstract linguistic categories.

  • This approach seems to me to provide a framework for

uniting the largely statistical and probabilistic view of phonetics with the more traditional view of phonology based on abstract contrastive categories, to the benefit of both.

slide-47
SLIDE 47

Collaborators

  • Segmental phonology and phonetics: Cue-based

approaches

– Ken Stevens, Morris Halle, Jay Keyser, Haruko Kawasaki, Helen Hanson

  • The role of prosodic structure in utterance

planning

– Pat Keating, Alice Turk

  • Cues to prosodic structure

– Jon Barnes, Alejna Brugos, Nanette Veilleux – Jennifer Cole

slide-48
SLIDE 48

Both segmental and prosodic phonetics signal several different types of information

  • Linguistic constituents and structures
  • Situation-specific information

– the language/dialect – the frequency/predictability of the words – the speaker (attitude, emotion, physiology) – the speaking situation – the relationship between the speakers

slide-49
SLIDE 49

The plan for this talk

  • Evidence that segmental phonetics has many of

the same problems as prosodic phonetics

  • How a model based on cues to contrastive

categories, and their parameter values, addresses some of these problems for both segmental phonetics

  • How this might work for prosodic phonetics
  • Evidence for a phonetics model based on cues

and their parameter values

  • Implications of this Cue-Based-Processing

approach for prosodic transcription

slide-50
SLIDE 50

Segmental phonetics has many of the same problems as prosodic phonetics

  • Often, expected cues aren’t there

– Reduction in segmental phonology

  • Ambiguity
slide-51
SLIDE 51

Cue-based processing solves problems

slide-52
SLIDE 52

Evidence for Cue Based Processing

  • Speakers typically leave cues behind in reduction
  • Speakers use new cues in challenging situations
  • Developing child speakers use cues differently
  • Listeners are sensitive to cue parameter values even

if they can’t accurately report them

  • Listeners can parse cues and parameter values into

their sources appropriately

  • Listeners can reproduce cue parameter values in

their conversational or imitative behavior

  • Listeners tune their perception to cue parameter

values of individual speakers

slide-53
SLIDE 53

Speakers typically leave cues behind

slide-54
SLIDE 54

Speakers use new cues

slide-55
SLIDE 55

Developing child speakers use cues differently

slide-56
SLIDE 56

Listeners are sensitive to cue parameter values

slide-57
SLIDE 57

Implications of Cue-Based-Processing for Prosodic Transcription

slide-58
SLIDE 58
  • After three decades of literature focusing on prosodic structure, the

view that prosodic structure has a role to play as the organizing framework of speech is well established. This structure consists of the grouping of chunks of speech into prosodic constituents arranged according to a hierarchy, delimited by prosodic boundaries

  • r edges and with prominences or heads at the various levels.

Prominence strength and boundary strength reflect the hierarchy. Prosodic domains are marked by constellations of cues, which stand as the major empirical evidence for prosodic structure and the constituents it comprises. These cues have been shown to be used in lexical processing, in the disambiguation of syntax, or in the identification of morpho-syntactic units (as in bootstrapping).

slide-59
SLIDE 59
slide-60
SLIDE 60

An alternative title for this talk

  • Why I sleep better at night these days, and

how you can too

slide-61
SLIDE 61

Summary

  • What questions do we need to address?
  • What counts as an answer to the question ‘what

is the prosody of this utterance?’

– An abstract specification of the linguistic constituents that make up the utterance and its structure? – An exact specification of the quantifiable, measurable and prosodically-relevant aspects of the signal? – A cue-based specification that links the abstract contrastive categories to their quantitative implementation?