Various Approaches Various Approaches acoustic classic The - - PowerPoint PPT Presentation

various approaches various approaches
SMART_READER_LITE
LIVE PREVIEW

Various Approaches Various Approaches acoustic classic The - - PowerPoint PPT Presentation

Various Approaches Various Approaches acoustic classic The Prosody The Prosody measurement and linguistic of Turn- -Taking Taking of Turn hypothesis testing methods machine dialog system perception of conversation learning user


slide-1
SLIDE 1

Various Approaches Various Approaches

classic linguistic methods perception of synthesized stimuli

The Prosody The Prosody

  • f Turn
  • f Turn-
  • Taking

Taking

conversation analysis acoustic measurement and hypothesis testing machine learning dialog system user studies

slide-2
SLIDE 2

A Case Study in the Identification A Case Study in the Identification

  • f Prosodic Cues to Turn
  • f Prosodic Cues to Turn-
  • Taking

Taking

  • Back

Back-

  • Channeling in Arabic

Channeling in Arabic -

  • Nigel Ward and

Nigel Ward and Yaffa Yaffa Al Al Bayyari Bayyari

University of Texas at El Paso University of Texas at El Paso

Interspeech 2006

slide-3
SLIDE 3

The Second Channel The Second Channel

Form gesture gaze prosody ... Content uncertainty, novelty, dialog control ... Value efficiency satisfaction + (Shriberg 2005) ...

slide-4
SLIDE 4
  • 1. Project Aims
  • 1. Project Aims

Discover the rules governing back- channeling in Arabic

to teach soldiers how to “show you’re listening”

  • a qualitative description

to use for teaching, plus

a quantitative description

to drive the characters

slide-5
SLIDE 5
  • 2. Problem Formulation
  • 2. Problem Formulation

co-construction p l a n n i n g f

  • r

d i a l

  • g

g

  • a

l s mutual models joint action

if X then back-channel

bfgklfm ...

tvzyxyv ...

using only past information (no look-ahead) using only features computable from the signal (no hand-labeling)

slide-6
SLIDE 6
  • 3. Corpus Preparation
  • 3. Corpus Preparation

All the usual issues ...

UTEP Corpus of Iraqi Arabic

112 minutes 689 back-channel tokens

big enough to

  • find a good dialog
  • do proper evaluation
slide-7
SLIDE 7

4a: Feature Discovery 4a: Feature Discovery unclear where to look unclear where to look

likely location of proximate cause distant causes back-channel

  • 1000 ms
  • 200 ms

time Complications:

  • time from cue to back-channel varies (complicates Machine Learning)
  • salient events can obscure the cues (complicates perceptual analysis)

Example: what do the following have in common?

slide-8
SLIDE 8

4b: Feature Discovery 4b: Feature Discovery the the overwhelming multitude of features

computed over prosody (pitch, energy, timing), voicing ... (possibly in combination) for example:

  • height of highest pitch peak in the last 400 ms

relative to the baseline over the past 2000 ms

  • first coefficient of a second-order approximation to

the pitch curve over the last three syllables before a pause of at least 200 milliseconds

  • presence of a 150 millisecond region with the pitch

consistently below the 26th percentile

...

slide-9
SLIDE 9

4c: Feature Discovery: 4c: Feature Discovery: harnessing perception harnessing perception

audio inspection

  • perceive lots
  • f information

specific places

  • hard to focus
  • n specific

features

  • hard to scan to

visual inspection

  • perceive only

what’s graphically salient

  • easy to focus
  • n specific

features

  • easy to scan

to specific places neither

  • no subjectivity
  • no insight

both

  • perceive lots
  • f information
  • navigate

quickly

  • focus easily
  • need tools
slide-10
SLIDE 10

A Custom Tool for Integrated Analysis A Custom Tool for Integrated Analysis

Didi

slide-11
SLIDE 11

4d: Feature Discovery 4d: Feature Discovery quantifying perceptions quantifying perceptions

label

  • ccurrences

perceptually identified feature program feature detector (in C, alas) identify acoustic correlates good match ? yes no Since some features are pervasive, hence un-informative, listen casually first, to get familiar with the pervasive patterns.

slide-12
SLIDE 12
  • 5. Feature Combination
  • 5. Feature Combination

pitch downdash time

  • 200 ms
  • 1000 ms

no flat pitch region pause substantial speech back-channel

feature combination is tricky, since features not always synchronized

slide-13
SLIDE 13
  • 6. Hypothesis Refinement
  • 6. Hypothesis Refinement

refining the qualitative description (by listening and looking) refining the quantitative description (by programming and debugging) plus evaluation against corpus ideas missed predictions and false alarms

a back-channel cue in Spanish a false alarm

slide-14
SLIDE 14
  • 7. Hypothesis Tuning
  • 7. Hypothesis Tuning

hill-climbing suffices (iff the previous steps were done well)

If

  • an utterance has lasted at least 1.2 seconds, and
  • contains a pitch downdash
  • lasting at least 40 milliseconds, with
  • a pitch drop of at least 0.7% every 10 ms

... then

  • predict a back-channel in response, 300 ms later

Resulting rule:

slide-15
SLIDE 15
  • 8. Evaluation
  • 8. Evaluation
  • by native-speaker acclamation
  • by interacting with it
  • by correspondence to the corpus

(51% coverage, 16% accuracy)

slide-16
SLIDE 16

Summary Summary

An integrated answer (qualitative + quantitative)

  • achievable
  • costly (~$90,000)
slide-17
SLIDE 17

What Next? What Next?

  • need more usable tools
  • need more feature-rich tools

classic linguistic methods perception of synthesized stimuli The Prosody The Prosody

  • f Turn
  • f Turn-
  • Taking

Taking conversation analysis acoustic measurement and hypothesis testing machine learning dialog system user studies

slide-18
SLIDE 18
slide-19
SLIDE 19

An Integrated Method An Integrated Method

Eight steps to discovery of a prosodic cue

  • 1. Project aims
  • 2. Problem formulation
  • 3. Corpus preparation
  • 4. Feature discovery
  • 5. Feature combination
  • 6. Hypothesis refinement
  • 7. Tuning
  • 8. Evaluation
slide-20
SLIDE 20

Fostering Progress Fostering Progress

let’s build tools! let’s look at the same elephants!

slide-21
SLIDE 21

Why Engineers Should Care Why Engineers Should Care

  • Spontaneous speech is different, in ways

that affect recognition (Shriberg 2005)

  • Dialog systems are pervasive but

unnatural and disliked

  • Intrinsic scientific interest
  • Language teaching applications
slide-22
SLIDE 22
  • 3. Corpus Preparation
  • 3. Corpus Preparation

Corpus size is a Goldilocks question

50 hours, 20,000 tokens 80 minutes, 400 tokens 5 minutes, 25 tokens

  • results not general
  • can analyze too deeply
  • labeling too expensive
  • can’t listen to the data
  • can find a good dialog
  • can evaluate properly

this corpus is too big this corpus is too small this corpus is just right (for us)

slide-23
SLIDE 23

Applications Applications

Making Machines more like People

  • acknowledgements in tutorial systems
  • adapting pace in information-delivery systems
  • noticing user reactions in persuasive systems

Making People more like People

  • learning to show you’re listening ... actively