Spontaneous Speech How People Really Talk and Why Engineers Should - - PowerPoint PPT Presentation

spontaneous speech
SMART_READER_LITE
LIVE PREVIEW

Spontaneous Speech How People Really Talk and Why Engineers Should - - PowerPoint PPT Presentation

Spontaneous Speech How People Really Talk and Why Engineers Should Care Elizabeth Shriberg 1 Acknowledgments Matthew Aylett Hermann Ney Harry Bratt Mari Ostendorf Ozgur Cetin Fernando Pereira Nizar Habash Owen Rambow Mary Harper


slide-1
SLIDE 1

1

Spontaneous Speech

How People Really Talk and Why Engineers Should Care

Elizabeth Shriberg

slide-2
SLIDE 2

2

Matthew Aylett Hermann Ney Harry Bratt Mari Ostendorf Ozgur Cetin Fernando Pereira Nizar Habash Owen Rambow Mary Harper Andreas Stolcke Dilek Hakkani-Tur Isabel Trancoso Jeremy Kahn Gokhan Tur Kornel Laskowski Dimitra Vergyri Robin Lickley Wen Wang Yang Liu Jing Zheng Evgeny Matusov Matthias Zimmermann

Other SRI and ICSI Colleagues

Artwork: Patrick Stolcke

Acknowledgments

slide-3
SLIDE 3

3

Spontaneous speech

Most speech produced every day is spontaneous It has been this way for a long time Natural spoken language precedes written language Speaking requires no special training, is efficient, carries a wealth

  • f information

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Long, long ago. Long ago Today

slide-4
SLIDE 4

4

Problems for NLP

Most natural language processing, however, based on text Two Problems: Spontaneous speech violates assumptions stemming from text-based NLP approaches Spontaneous speech is rich in information that is often not utilized by spoken language technology Goal of this talk Suggest that technology can work better if we pay attention to special properties of spontaneous speech

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-5
SLIDE 5

5

Four challenge areas

Humans do these easily, but computers do not Tasks important for a range of computational applications Tasks are interrelated Currently far from “solved” Apply across languages (although focus here on English)

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

  • 1. Recovering punctuation
  • 2. Coping with disfluencies
  • 3. Allowing real turn-taking
  • 4. Hearing real emotion

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-6
SLIDE 6

6

Topics cover range from lower level to higher level

Punctuation Disfluencies Turn-taking Emotion One speaker, basic segmentation Within segments, regions of disfluency Expand from single to multiple speakers Hearing more than just words

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

In this talk, for the benefit of engineers: more focus on lower-level than higher-level tasks.

slide-7
SLIDE 7

7

Punctuation Disfluencies Turn-taking Emotion

we -- it drives more like a car anyway . that’s something i i wouldn’t go as far as to say that it’s it’s just like a car . but uh the- that’s what the advertisement would say . ah ok

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

7-1 7-2

slide-8
SLIDE 8

8

Four claims

They would listen for sentence units, not speech between pauses.

1.

They would cope with (and maybe even use) disfluencies.

2.

They would model overlap and turns that are not strictly sequential.

3.

They would “hear” our emotions.

If computers were really listening:

Goal: not “strong AI”. But rather: engineering solutions to model cues that humans use.

slide-9
SLIDE 9

9

  • 1. Recovering Hidden Punctuation

If computers were really listening, they would listen for sentences  instead of speech between pauses.

slide-10
SLIDE 10

10

Recovering hidden punctuation

In many written languages, punctuation is explicit But in speech, punctuation conveyed by other means Most ASR systems output only a stream of words; punctuation is “hidden” tomorrow is fado here is the banquet tonight where Tomorrow is fado, here. Is the banquet tonight? Where? Problem for downstream natural language processing (NLP) Will focus on sentence-level punctuation, the most important

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-11
SLIDE 11

11

Tasks that need sentence boundaries

Processing by humans Humans comprehend transcripts of spoken language more effectively if contain punctuation [Jones et al.] Processing by machine ASR Parsing Information extraction Summarization Translation

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-12
SLIDE 12

12

Segmenting speech: the common approach

ASR systems perform better on shorter segments of speech Keep search manageable Avoid insertions in nonspeech regions In some dialog systems, no problem, turns ~ 1 sentence But conversational (& read) speech often have longer turns Current ASR systems typically chop at pauses Pauses easy to detect automatically Approach avoids fragmenting of words

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-13
SLIDE 13

13

Pauses ≠ sentence boundaries

Many real sentence boundaries have no pause Speakers use other cues, including intonation Some use “rush-through” to prevent interruption And, some nonboundaries do have a pause (hesitations) Example statistics (Switchboard) 56% of within-turn sentences have no pause 10% of within-turn pauses are not sentence boundaries Focus here on NLP, but FYI: sentences also help ASR: Significant (3% rel.) reduction in WER by segmenting at sentence boundaries instead of pauses

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-14
SLIDE 14

14

Computational models for punctuation

Typically involve combining lexical and prosodic cues Language model: N-grams over words & punctuation tokens Prosody model: Features: pauses, duration, F0, turn taking Models: decision trees, neural networks Models improved by sampling and ensemble techniques Prosody and LM combined via HMMs, maximum entropy models

  • r CRFs [Liu et al., 2005]

Gains from multiple system combination Research based on reference words or 1-best ASR; recent work uses multiple ASR hypotheses [Hillard et al., 2004]

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-15
SLIDE 15

15

Sentence segmentation: state-of-the-art results

Same system of Liu et al just mentioned Systems use lexical, prosodic, POS, ‘turn’ information Difficult problem; large degradation for ASR (esp. CTS) Broadcast News has higher NIST error rates, due to Fewer true boundaries (longer sentences) Few 1st person pronouns, fillers (cues to sentence starts) 100 100 Baseline (chance) 54.3 46.3 Broadcast News Speech 41.9 29.3

  • Conv. Telephone Speech

ASR Ref Words NIST error rate (errors per ref sentence)

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-16
SLIDE 16

16

Sentence segmentation and parsing

Parsing useful for many downstream NLP tasks Parsing algorithms need short input units; otherwise the processing becomes too computationally expensive (super-linear algorithmic complexity) Parsing of text can use sentence punctuation; for speech, need to infer the units automatically. Hot off the press: results from JHU 2005 Workshop project on parsing and “metadata” (thanks to M. Harper & Y. Liu). Earlier related work: [Kahn et al., 2004].

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-17
SLIDE 17

17

Sentence segmentation and parsing

[JHU WS-2005; M. Harper, Y. Liu]

Charniak parser on true words or ASR (~13% WER) output. Parsing results (bracket F-measure): Sentences really matter – large effects (1% is significant) Automatic system (words and prosody) results more than halfway from pause-based to reference-based performance 64.03 74.34 Automatic [Liu et al., 2005] 54.62 63.09 Pause-based (0.5 sec) 71.42 83.25 Human ASR Ref Words Sentence segmentation

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-18
SLIDE 18

18

Decision threshold depends on task [Y. Liu]

20 30 40 50 60 70 0.2 0.4 0.6 0.8

Decision Threshold Sentence Detection Error Metric

60 65 70 75 80

Parsing: F-measure (Higher = Better) Threshold for sentence task itself may be suboptimal for downstream NLP Optimal threshold for boundary task: wide range in middle But parsers prefer lower decision thresholds (shorter units)

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-19
SLIDE 19

19

Sentence segmentation and other NLP

Other areas of NLP have also been using models trained on text containing punctuation: Information Extraction [Makhoul et. al., IS05] Summarization [Murray et al., IS05] Machine Translation (in a moment) As areas become consumers of ASR, problems arise when punctuation must be inferred Automatic segmentation can cause downstream errors Basic assumptions about scoring paradigm violated Little published work in this new area, but new programs (like DARPA GALE) mean we should see some soon

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-20
SLIDE 20

20

Sentence segmentation & machine translation

Like parsing, MT requires chopping into small units Some meanings depend on within-sentence context, so need to get the boundaries right For example, suppose Isabel asks Fernando whether the audience gave him a hard time at his keynote

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

  • No. They were nice!

Não . Foram simpáticos. ASR + Auto Punctuation MT Output They were not nice. Não foram simpáticos.

20-1 20-2

slide-21
SLIDE 21

21

Sentence segmentation and MT scoring

MT scoring relies on sentence-by-sentence comparison of ref and hypothesized translations. If segmentations differ, sentence level comparison not meaningful In parsing, can string together all ref and all system output into one long ‘sentence’ and then apply standard metrics But in MT, N-gram metrics (BLEU) too forgiving of reorderings; too many spurious far-away matches counted as correct Recent proposed solution [Matusov et al., 2005]: resegment hypotheses according to reference sentences

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-22
SLIDE 22

22

  • 2. Coping with Disfluencies

If computers were really listening, they would cope with (and maybe even use) disfluencies.

slide-23
SLIDE 23

23

Disfluencies: frequency and types

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Across languages and studies, affect ~ 5 - 15% of words

  • the

Deletion the SUV the car Substitution the new car the car Insertion the the Repetition Edits

  • uh
  • Fillers

Repair Region (Editing Terms) IP Edited Region

slide-24
SLIDE 24

24

Distribution of disfluencies

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Majority of disfluencies are not “errors”. Rather, used to manage difficult task of speaking in time. Not randomly distributed: Occur at/near beginnings of sentences Occur at points of higher entropy Related to both cognitive demands of speaking in time, and discourse phenomena: Cognitive: planning Discourse: turn-taking (grabbing or keeping the floor)

slide-25
SLIDE 25

25

Speaker differences: “repeaters” and “deleters”

Rates and types of disfluencies highly speaker dependent. A repeater: but i - i would believe that they - they need to be able to equip the - the teachers with - with what they need to do their job and - and …

A deleter:

actually the test sets - are they - the italia- - i mean the italian and finn - the test sets are all different. i guess wha- - the italian was - was that done in the car? Relevance: Machine models: repeats easier to process than deletions Human processing: two different strategies Personal relationships? (opposites do seem to attract ☺)

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

25-1 25-2

slide-26
SLIDE 26

26

Disfluency processing: humans vs. machines

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Humans easily process naturally disfluent speech In fact, people even appear to use disfluencies in comprehension [Arnold, 2004; Watanabe, 2005] But machines have trouble with disfluencies Disfluencies are problematic for different reasons: Cut off words (not in ASR dictionary) Fillers (disrupt language model history) Edit disfluencies (poor LM probability at interruption point)

slide-27
SLIDE 27

27

Computational models of disfluencies

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Two tasks: Find the interruption point (IP) Determine how far back to delete words (edit extent) IP detection methods similar to sentence segmentation N-gram LMs or features Prosodic features and models HMM, Maxent, or CRF for classification Finding edit extent challenging because of cross-serial dependencies between edited and corrected words Use rules, POS information [Heeman, 1999], tree-adjoining grammars [Charniak & Johnson, 2004], maxent, CRF [Liu et al., IS2005]

slide-28
SLIDE 28

28

Automatic detection results

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Difficult problem: high error rates on human transcripts Large degradation on ASR output Edit and filler word tagging F-measure (combined precision & recall, bigger is better): For system details see [Liu et al., IS2005] .811 .897 fillers .459 .666 edits ASR output Ref Words Task

slide-29
SLIDE 29

29

Parsing and edit disfluencies

Parsers do not deal well with disfluencies because

Tree structures can’t model cross-serial dependencies Disfluencies fragment event space Treebanked training data contains few disfluencies

Solution: [Charniak & Johnson]: Detect edits, remove them, parse remaining material, insert edits back into parse. Bracket F-measure Results [2005 JHU workshop, Harper et al.] Large loss from manual to default (none) Significant gain from automatic modeling, but not for ASR 76.55 88.06 Manual (reference) 71.42 83.25 Automatic 71.36 81.79 None (parse edits) ASR output Ref words Edit removal method

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-30
SLIDE 30

30

Disfluencies and other applications

In addition to parsing, disfluencies impact other NLP tasks concerned with content (need to remove disfluencies) Information extraction governor uh president bush Summarization (wants fluent short output) Machine translation (want to translate intended content) In addition (unlike sentence boundaries) disfluencies convey speaker state and style information Courtroom applications Tutoring Maybe shouldn’t be completely lost for certain applications, e.g. MT

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-31
SLIDE 31

31

  • 3. Allowing for Realistic Turn-taking

If computers were really listening, they would model overlap and turns that are not strictly sequential.

slide-32
SLIDE 32

32

Allowing for realistic turn-taking

ASR systems tend to listen to one speaker at a time. Facilitated by recordings on separate channels But conversations are interactive: “joint projects” [Clark] Classic papers in conversation analysis [Sacks et al., 1974; Schegloff; Jefferson]. More recent work [Ward & Tsukahara, 2000; ten Bosch et al., 2005]

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-33
SLIDE 33

33

Different functions yielding speaker overlap

Backchanneling We bought our first car and it’s getting old now . Uh-huh . Oh . Floor-grabbing We bought our first car and it’s - Wel- well I have a Honda . Starting before current talker done (projection) We bought our first car . Does it get good mileage? Multiple people responding to a previous talker We bought our first car . Wow. Great.

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-34
SLIDE 34

34

12.0 54.4 11.7 53.0 8.8 31.4 17.0 54.4

By Words By Segments Strangers Familiar Directed Interactive % Units Overlapped

Used pause-based segments since automatic Rate for segments high, because of the tendency of overlaps to be short and located at turn relevant points Telephone speech not so different from meetings Talk with strangers has same rate as talk with friends/family.

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Overlap percentages in different data sets

ICSI Meetings Interactive Directed Phone Conversations Familiar Strangers

slide-35
SLIDE 35

35

Scoring recognition of overlapping speech

Current NIST scoring of multiple-speaker recognition yields

  • verlap segments containing non-overlapped words

Thus, degradation due to overlap is actually underestimated

3 2 1

Actual Overlap Regions

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Spkr 1 Spkr 2 Spkr 3 Spkr 4

slide-36
SLIDE 36

36

Impact of overlap on ASR

NIST RT-05S Meeting Evaluation [thanks to J. Fiscus, NIST] ASR results for multiple distant microphone condition No speaker separation, just delay-sum beam forming Promising work by many groups in blind speaker separation; not yet applied to realistic ASR scenarios

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

ICSI/SRI

1 2 3 4 5 ALL Number of Speakers Word Error Rate 30 40 50

slide-37
SLIDE 37

37

Speaker overlap & language modeling

Most language models look at speakers individually But some predictive information exists across speaker changes, for example Ji & Bilmes [2004] found lower language model perplexity by modeling cross-speaker information

Really? Yeah

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-38
SLIDE 38

38

Turns & dialog act modeling

Modeling turn sequences useful for dialog act tagging [Jurafsky et al., 1997;

Stolcke et al., 2000; Venkataraman et al., 2002]

Harder to model in meetings [Ang et al., 2005; Ji & Bilmes, 2005] Patterns are there in meetings:

What’s next on the agenda? (Question) Disk space. (Statement) p (S | Q) = 0.407 ** Uh-huh. (Backchannel) p (B | Q) = 0.013

But, ordering less canonical, especially for > 2 talkers Example: many responses to a suggestion:

  • k ok that’s good sure fine with me let’s ju- and see what we get it’ll be easier on the subjects

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

2x

38-1

slide-39
SLIDE 39

39

Endpointing for dialog systems

So far discussed turn-taking among humans, but good turn- taking is also critical in human-computer dialog This last topic in turn-taking section demonstrates inter- relatedness of all topics so far: Finding sentence boundaries Detecting disfluency (hesitation) Turn-taking (here, between human and machine) Application: endpointing of user input Most current systems wait for pause of minimum length Problem: people pause mid-utterance while thinking

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-40
SLIDE 40

40

Endpointing for dialog systems

Example from in-car navigation system data collection [Bosch-VW-

Stanford-SRI NIST ATP project, thanks to H. Bratt]

If endpointer stops listening at pause, will Miss crucial content (post-hesitation high entropy) Annoy user, because human listener would know speaker isn’t done yet If just increase the pause threshold, then for true boundaries speaker has to wait the extra time

What do I dooooooooo after I cross the river?

0.97 secs. 1.40 secs.

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

40-1

slide-41
SLIDE 41

41

Better endpointing: Results [Ferrer et al., 2003]

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

Predict end of utt using prosody & LM Baseline: pause-only Dialog system data Dramatic fall in False alarms Speaker waiting time (SWT) at true utterance ends Most of the benefit is from prosody

.025 .050 .075 0.100

False Alarm Rate (%) Speaker Waiting Time (sec)

.25 .50 .75 1.0 1.25 1.50

60% reduction in FA 80% reduction in SWT

Baseline Prosody Only Prosody + LM

slide-42
SLIDE 42

42

Computational challenges

In ASR acoustic modeling overlap needs speaker separation work (ongoing effort by multiple research groups) But overlap does not occur randomly; should utilize word-level and prosodic cues to model turn-taking E.g., in study at ICSI [2001], we found that in meetings: When grabbing floor, people raise energy and F0 Location in other’s speech at which they jump in, also characterized prosodically (similar to sentence ends) Another relevant, fascinating area: turn-taking for interactive conversational agents [e.g., Ward et al.; Fujie et al.]

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-43
SLIDE 43

43

  • 4. Detecting Emotion

If computers were really listening, they would “hear” our emotions.

slide-44
SLIDE 44

44

Detecting real emotion

Emotion recognition is increasingly important for many applications: Customer service Navigation systems Speech-enabled toys and games Automatic tutoring Health monitoring Real example from a customer service application (thanks to D. Hakkani-Tur and AT&T)

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

44-1

slide-45
SLIDE 45

45

Challenges for emotion research

Emotion is a “hot” research topic, lots of interest Problem: lack of large, publicly available data for studying real emotion Real emotions only occur in real applications! But Data is proprietary Privacy issues for speakers Therefore, most work is on acted speech Easier to obtain and control Doesn’t need to be labeled for emotion Acted emotions easier to obtain, but real emotions are harder to recognize [Douglas-Cowie; Cowie; Batliner; Devillers]

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-46
SLIDE 46

46

Emotion detection for DARPA Communicator

[Ang et al., 2002]

Example of large data set, mock application (air travel) Emotion labels: “neutral”, “annoyed”, “frustrated”, other Automatic classification based on prosodic and lexical features using ASR output Prosodic features extracted automatically

– Pitch, energy, duration, energy, voice quality – Normalized for the talker

Additional features labeled by humans, to assess correlations

– Stylistic: hyperarticulation and “raised voice” – Repeated request or correction after system error

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-47
SLIDE 47

47

Emotion detection: results and implications

Words alone are not very helpful in this domain Users often shorten utterances after ASR errors Users cannot switch to a human, must stay in domain Neutral Annoyed Frustrated Tired Prosodic cues are helpful (pitch, energy, duration) Interactions with cues not represented in acted speech: Dialog context (hand-labeled) Hyperarticulation (hand-labeled)

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

47-1 47-3 47-2 47-4 47-5

slide-48
SLIDE 48

48

Conclusions & Future Directions

slide-49
SLIDE 49

49

Conclusions

Described four challenge areas: Recovering hidden punctuation Coping with disfluencies Allowing for realistic turn-taking Hearing real emotion In each area, speakers convey useful information that humans process easily, but that is often overlooked in current technology Important as focus turns to speech processing for the benefit of natural language understanding

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-50
SLIDE 50

50

Future directions: features and models

In all 4 areas performance would benefit from: Improved basic features (lexical, prosodic) Better methods for feature integration Robustness to ASR errors Better integrate with downstream processing by preserving multiple hypotheses (and their probabilities) Side-steps the need to pick task-dependent thresholds For multimodal applications Integrate with other cues such as visual information

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-51
SLIDE 51

51

Future directions: scope of modeling

First 3 areas in particular (punctuation, disfluencies, turn-taking) should benefit from longer-range dependency modeling Language modeling beyond N-grams (e.g., parsing models) Prosodic modeling with longer-range features Joint decoding of multiple related tasks Modeling individual speakers Large variation in spontaneous speech [Blaauw; Eskenazi] Occurs in all four areas mentioned

Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions

slide-52
SLIDE 52

52

In Closing

Attention to how people really talk should yield: Better scientific understanding of natural speaking behavior Long-term benefit for intelligent spoken language technology.

Thank You