[PPT] - Algorithms for Dysfluency Detection in Symbolic Sequences using PowerPoint Presentation

SLIDE 1

Algorithms for Dysfluency Detection in Symbolic Sequences using Suffix Arrays

J. P´

alfy1,2

J. Posp´

ıchal1

1Slovak University of Technology, Faculty of Informatics and Information

Technologies, Bratislava, Slovakia

2Slovak Academy of Sciences, Institute of Informatics, Bratislava, Slovakia

Text Speech and Dialogue, September 3, 2013

SLIDE 2

Overview

◮ Introduction to Dysfluencies ◮ Motivation in Dysfluent Speech Recognition ◮ Common Approach & Problem with “Complex” Dysfluencies ◮ Methodology ◮ Results ◮ Conclusion

SLIDE 3

Introduction to Dysfluencies

◮ Dysfluencies are disruptions or breaks in the smooth flow of

speech. (Shipley & McAfee, 1998)

◮ Unlike read speech, spontaneous speech contains high rates of

disfluencies (Shriberg, 1994)

SLIDE 4

Hesitations (pause) Interjections (um, uh, er) Revisions ("I want-I need that") Repetitions of phrases ("I want- I want that") Repetitions of multisyllabic whole words (“mommy- mommy-mommy let’s go.”) Disfluencies occur more frequently Prolongations ("llllllike this") Tension or struggle increases

"Stuttered" Disfluencies

Repetitions of sounds or syllables ("li-li-like this")

"Normal" Disfluencies

Duration (length) of disfluencies increases Reactions to disfluencies increase Blocks ("l---ike this") Tension during "normal" disfluencies Repetitions of monosyllabic whole words (“I-I-I want to go.”) NOTE: "Normal" disfluencies can be used to avoid or postpone stuttering (e.g., “I um, you know, uh, I want to um, g-g-g-o with you.”)

From Yaruss & Reardon (2006), Young Children Who Stutter: Information and Support for Parents. New York: National Stuttering Association (NSA).

Understanding Different Types

f Speech

Disfluencies

SLIDE 5

Motivation in Dysfluent Speech Recognition

Dysfluent speech recognition:

◮ Speech Language Pathology

(SLP) - automatic &

bjective evaluation e.g.

analysis tool

◮ Automatic Speech

Recognition (ASR) - improve the accuracy e.g. module

SLIDE 6

Problem with Dysfluencies

◮ statistical distribution of atomic parts of speech - build

Automatic Speech Recognition (ASR) system

◮ sparse regularity of dysfluencies - design ASR (like Hidden

Markov Models (HMM))

◮ ASR complexity - define every transition between states which

can occur in case of dysfluent events

SLIDE 7

Conventions

In our work we used convention:

◮ “simple” dysfluencies - e.g. part-word/syllable repetitions

(R1), prolongations (P); already studied in many works

“simple” dysfluencies

e.g. P: rrrun, R1: re re research

◮ “complex” dysfluencies - specified as a chaotic mixture of

dysfluent events (e.g. repetition of phrase, prolongation combined with hesitation & repetition) are frequent in stutterers speech

“complex” dysfluencies

e.g. I do my, I do my work; j j j jer j j jer ja just

SLIDE 8

Common Approach & Problem with “Complex” Dysfluencies

common approach

◮ fix a window (e.g. 200 - 800 ms) ◮ build a dysfluency recognition system (e.g. Artificial Neural

Networks, Support Vector Machines)

◮ recognize the “simple” dysfluent events in a fixed interval

problem

◮ dysfluencies frequently do not fit the fixed window, but are

dynamically distributed throughout much longer 2 - 4 s intervals

◮ how to choose the right window size for “complex”

dysfluencies?

SLIDE 9

Our Methodology

Alg. 1-2

Bioinformatics

1000 2000 −0.5 0.5 Vector Amplitude

Data mining

Time series

Speech language pathology

Dysfluencies Sequence analysis

◮ our solution: combine &

apply methods from other fields of science

◮ SLP - knowledge,

dysfluencies

◮ Data Mining - mining time

series, Symbolic Aggregate Approximation (SAX)

◮ Bioinformatics - sequence

(DNA) analysis, Suffix Arrays

SLIDE 10

Methodology: Corpus

◮ University College London Archive of Stuttered Speech

(UCLASS)

◮ Howell, Huckvale, 2004, ˜ 500 recordings, 16 - 44.1 KHz, 2 -

15 min playing time, age 8 - 47 year, male / female

◮ Howell, Davis, Bartrip, 2009, 12 selected recordings, working

set from UCLASS

◮ We annotated & used subset of this working set, 22.05 KHz,

19:32 min playing time

SLIDE 11

Methodology: Feature Extraction PAA, SAX

◮ speech, 22.5 KHz ◮ short-time energy, X = x1, . . . xn ◮ Piecewise Aggregate Approx.

X = x1, . . . xN (1) xi = N n

n N i

j= n

N (i−1)+1

xj (2)

◮ Symbolic Aggregate Approx.

B = β1, ..., βa−1 (3)

W =

w1, . . . wm (4)

◮ map X →

W

wi = ai, iff βj−1 < xj <= βj

(5)

Speech Short time energy SAX

Lexical content: “c can c c can”

SLIDE 12

Methodology: Data Structure, Suffix Arrays

i i = 1 2 3 4 5 6 7 8 9 10 11 1 11 $ C = p r o c e s s i n g $ 2 4 3 5 4 10 g$ 5 8 6 9 7 3 8 1 9 2 10 7 11 6 Pos[i] C[Pos[i] … n] cessing$ essing$ ing$ ng$

cessing$

processing$ rocessing$ sing$ ssing$

◮ large sequence

C = c0c1 . . . cN−1

◮ suffix of C,

Ci = cici+1 . . . cN−1

◮ lexicog. sorted array, Pos ◮ Pos[k], kth smallest suffix in

the set C0, C1, . . . CN−1

◮ assume that Pos is given

then

◮ CPos[0] < CPos[1] < · · · <

CPos[N−1]

◮ where ’<’ denotes the

lexicog. order

SLIDE 13

Methodology: Our Derived Functions

◮ prolongations are characterized by

minimal difference between n neighboring frames

◮ video segmentation → were

adapted for speech x = x1, . . . xN, y = y1, . . . yN (6) D(x, y) = 1 N

N

i=1

|xi − yi| (7) Db(x) =

b

i=1

D(xi, x(i+l)) (8) Dh(Hx) =

h

i=1

D(Hx(i), Hx(i + l)) (9)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 −0.5 0.5 Time (s) Amplitude Speech waveform Time (s) Frequency (Hz) Wideband spectrogram 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5000 10000 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0.5 1 Time (s) Function D(x,y) Prolongation detection functions Db(x,y) Dh(x,y) Dg(x,y)

Lexical content: “personal s:eedee player”

SLIDE 14

Methodology: Our Developed Algorithms

◮ Alg. 1 - for speech pattern searching ◮ Alg. 2 - for searching repeated patterns (repetitions) in speech ◮ P short sequence, C long sequence, s is a shift, l is C length

1: while i < n do

⊲ Begin: Alg.2()

2:

In i-th window 1st block set to P, remaining blocks put to C

3:

Compute Pos for P. ⊲ Pos is a suffix array

4:

With Pos construe Tab for P. ⊲ Tab is a look up table

5:

while s < l do ⊲ Begin: Alg.1()

6:

Use Tab to query C in P.

7:

Save patterns position and patterns length.

8:

end while ⊲ End: Alg.1()

9: end while

⊲ End: Alg.2()

SLIDE 15

Methodology: Our Features for “Complex” Dysfluencies

For every 5 s long interval, 3 features of 100ms blocks were computed:

◮ patterns average redundancy ◮ patterns relative frequency ◮ patterns redundancies sum

1 2 3 4 5 6 7 8 9 10

window 1 window 2

1 6 2 7 3 8 4 9 5 10

window window 1 window 2 Evaluate rows Evaluate columns

Algorithms iterative output

SLIDE 16

Methodology: Main Steps in Running Algorithms

◮ Alg. 1-2 based on SAX - Symrep

◮ in relational DBs, short query is executed in a large set of data ◮ Alg. 1 - opposite to relat. DBs, query a long sequence C in a

short sequence P

◮ Alg. 2 - adaptation capability to unknown repeated speech

pattern length

◮ DTW based on MFCC - Specrep

SLIDE 17

Results: Statistical Analysis

process of classifier design:

◮ measurement of data class

separability - correlation

◮ study of data characteristics

Mann-Whitney U-test

compare features:

◮ Specrep - DTW on basis of

MFCC features

◮ Symrep - our developed

algorithms on basis of SAX

◮ r - correlation coefficients ◮ h - accepted hypotheses

(p-values < 0.05 level)

SLIDE 18

Results: Objective Assessment

◮ SVMs to perform objective

assessment of MFCC, Specrep, Symrep

◮ training (80 %) & testing

(20 %) sets

◮ we trained individual SVMs,

sigmoidal kernel function

SLIDE 19

Conclusion

◮ derived functions for prolongation detection ◮ developed algorithms Alg.1-2 - detection of “complex”

dysfluencies

◮ new designated features - statistically analyzed ◮ objective assessment of new features & MFCC by SVM, 47.4% ◮ symbolic sequences are competitive to spectral domain

SLIDE 20

Bibliography 1/2

Camastra, F. and Vinciarelli, A., Machine Learning for Audio, Image and Video Analysis: Theory and Applications. Springer-Verlag London Limited, 2008. Hamel, L., Knowledge Discovery with Support Vector Machines. John Wiley & Sons, Inc., Hoboken, NJ, USA, July 2009. Howell, P., Davis, S., Bartrip, J., The UCLASS archive of stuttered speech. Journal of Speech, Language, and Hearing Research, 52, pp. 556-569, 2009. Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S., Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowledge and Information Systems 3, pp. 263-286, 2001.

SLIDE 21

Bibliography 2/2

Lin, J., Keogh, E., Lonardi, S., Patel, P., Finding motifs in time series. ACM Special Interest Group on Knowledge Discovery and Data Mining, 2002. Liu, Y., Shriberg, E., Stolcke, A., Harper, M., Comparing HMM, Maximum Entropy, and Conditional Random Fields for Disfluency Detection. In Proc. of the Eu. Conf. on Speech Comm. and Tech., 2005. Lustyk, T., Bergl, P., ˇ Cmejla, R., and Vokˇ r´ al J., Change evaluation of Bayesian detector for dysfluent speech assessment. In International Conference on Applied Electronics 2011, Pilsen, Czech Republic, pp. 231-234, 2011. Manber, U. and Myers, G., Suffix arrays : A new method for on-line string searches. SIAM J. Comput., 22(5), pp. 935948, 1993.

SLIDE 22