SLIDE 1 Algorithms for Dysfluency Detection in Symbolic Sequences using Suffix Arrays
alfy1,2
ıchal1
1Slovak University of Technology, Faculty of Informatics and Information
Technologies, Bratislava, Slovakia
2Slovak Academy of Sciences, Institute of Informatics, Bratislava, Slovakia
Text Speech and Dialogue, September 3, 2013
SLIDE 2
Overview
◮ Introduction to Dysfluencies ◮ Motivation in Dysfluent Speech Recognition ◮ Common Approach & Problem with “Complex” Dysfluencies ◮ Methodology ◮ Results ◮ Conclusion
SLIDE 3 Introduction to Dysfluencies
◮ Dysfluencies are disruptions or breaks in the smooth flow of
- speech. (Shipley & McAfee, 1998)
◮ Unlike read speech, spontaneous speech contains high rates of
disfluencies (Shriberg, 1994)
SLIDE 4 Hesitations (pause) Interjections (um, uh, er) Revisions ("I want-I need that") Repetitions of phrases ("I want- I want that") Repetitions of multisyllabic whole words (“mommy- mommy-mommy let’s go.”) Disfluencies occur more frequently Prolongations ("llllllike this") Tension or struggle increases
"Stuttered" Disfluencies
Repetitions of sounds or syllables ("li-li-like this")
"Normal" Disfluencies
Duration (length) of disfluencies increases Reactions to disfluencies increase Blocks ("l---ike this") Tension during "normal" disfluencies Repetitions of monosyllabic whole words (“I-I-I want to go.”) NOTE: "Normal" disfluencies can be used to avoid or postpone stuttering (e.g., “I um, you know, uh, I want to um, g-g-g-o with you.”)
From Yaruss & Reardon (2006), Young Children Who Stutter: Information and Support for Parents. New York: National Stuttering Association (NSA).
Understanding Different Types
Disfluencies
SLIDE 5 Motivation in Dysfluent Speech Recognition
Dysfluent speech recognition:
◮ Speech Language Pathology
(SLP) - automatic &
analysis tool
◮ Automatic Speech
Recognition (ASR) - improve the accuracy e.g. module
SLIDE 6
Problem with Dysfluencies
◮ statistical distribution of atomic parts of speech - build
Automatic Speech Recognition (ASR) system
◮ sparse regularity of dysfluencies - design ASR (like Hidden
Markov Models (HMM))
◮ ASR complexity - define every transition between states which
can occur in case of dysfluent events
SLIDE 7
Conventions
In our work we used convention:
◮ “simple” dysfluencies - e.g. part-word/syllable repetitions
(R1), prolongations (P); already studied in many works
“simple” dysfluencies
e.g. P: rrrun, R1: re re research
◮ “complex” dysfluencies - specified as a chaotic mixture of
dysfluent events (e.g. repetition of phrase, prolongation combined with hesitation & repetition) are frequent in stutterers speech
“complex” dysfluencies
e.g. I do my, I do my work; j j j jer j j jer ja just
SLIDE 8
Common Approach & Problem with “Complex” Dysfluencies
common approach
◮ fix a window (e.g. 200 - 800 ms) ◮ build a dysfluency recognition system (e.g. Artificial Neural
Networks, Support Vector Machines)
◮ recognize the “simple” dysfluent events in a fixed interval
problem
◮ dysfluencies frequently do not fit the fixed window, but are
dynamically distributed throughout much longer 2 - 4 s intervals
◮ how to choose the right window size for “complex”
dysfluencies?
SLIDE 9 Our Methodology
Bioinformatics
1000 2000 −0.5 0.5 Vector Amplitude
Data mining
Time series
Speech language pathology
Dysfluencies Sequence analysis
◮ our solution: combine &
apply methods from other fields of science
◮ SLP - knowledge,
dysfluencies
◮ Data Mining - mining time
series, Symbolic Aggregate Approximation (SAX)
◮ Bioinformatics - sequence
(DNA) analysis, Suffix Arrays
SLIDE 10
Methodology: Corpus
◮ University College London Archive of Stuttered Speech
(UCLASS)
◮ Howell, Huckvale, 2004, ˜ 500 recordings, 16 - 44.1 KHz, 2 -
15 min playing time, age 8 - 47 year, male / female
◮ Howell, Davis, Bartrip, 2009, 12 selected recordings, working
set from UCLASS
◮ We annotated & used subset of this working set, 22.05 KHz,
19:32 min playing time
SLIDE 11 Methodology: Feature Extraction PAA, SAX
◮ speech, 22.5 KHz ◮ short-time energy, X = x1, . . . xn ◮ Piecewise Aggregate Approx.
X = x1, . . . xN (1) xi = N n
n N i
N (i−1)+1
xj (2)
◮ Symbolic Aggregate Approx.
B = β1, ..., βa−1 (3)
w1, . . . wm (4)
◮ map X →
W
- wi = ai, iff βj−1 < xj <= βj
(5)
Speech Short time energy SAX
Lexical content: “c can c c can”
SLIDE 12 Methodology: Data Structure, Suffix Arrays
i i = 1 2 3 4 5 6 7 8 9 10 11 1 11 $ C = p r o c e s s i n g $ 2 4 3 5 4 10 g$ 5 8 6 9 7 3 8 1 9 2 10 7 11 6 Pos[i] C[Pos[i] … n] cessing$ essing$ ing$ ng$
processing$ rocessing$ sing$ ssing$
◮ large sequence
C = c0c1 . . . cN−1
◮ suffix of C,
Ci = cici+1 . . . cN−1
◮ lexicog. sorted array, Pos ◮ Pos[k], kth smallest suffix in
the set C0, C1, . . . CN−1
◮ assume that Pos is given
then
◮ CPos[0] < CPos[1] < · · · <
CPos[N−1]
◮ where ’<’ denotes the
SLIDE 13 Methodology: Our Derived Functions
◮ prolongations are characterized by
minimal difference between n neighboring frames
◮ video segmentation → were
adapted for speech x = x1, . . . xN, y = y1, . . . yN (6) D(x, y) = 1 N
N
|xi − yi| (7) Db(x) =
b
D(xi, x(i+l)) (8) Dh(Hx) =
h
D(Hx(i), Hx(i + l)) (9)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 −0.5 0.5 Time (s) Amplitude Speech waveform Time (s) Frequency (Hz) Wideband spectrogram 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5000 10000 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0.5 1 Time (s) Function D(x,y) Prolongation detection functions Db(x,y) Dh(x,y) Dg(x,y)
Lexical content: “personal s:eedee player”
SLIDE 14
Methodology: Our Developed Algorithms
◮ Alg. 1 - for speech pattern searching ◮ Alg. 2 - for searching repeated patterns (repetitions) in speech ◮ P short sequence, C long sequence, s is a shift, l is C length
1: while i < n do
⊲ Begin: Alg.2()
2:
In i-th window 1st block set to P, remaining blocks put to C
3:
Compute Pos for P. ⊲ Pos is a suffix array
4:
With Pos construe Tab for P. ⊲ Tab is a look up table
5:
while s < l do ⊲ Begin: Alg.1()
6:
Use Tab to query C in P.
7:
Save patterns position and patterns length.
8:
end while ⊲ End: Alg.1()
9: end while
⊲ End: Alg.2()
SLIDE 15 Methodology: Our Features for “Complex” Dysfluencies
For every 5 s long interval, 3 features of 100ms blocks were computed:
◮ patterns average redundancy ◮ patterns relative frequency ◮ patterns redundancies sum
1 2 3 4 5 6 7 8 9 10
window 1 window 2
1 6 2 7 3 8 4 9 5 10
window window 1 window 2 Evaluate rows Evaluate columns
Algorithms iterative output
SLIDE 16 Methodology: Main Steps in Running Algorithms
◮ Alg. 1-2 based on SAX - Symrep
◮ in relational DBs, short query is executed in a large set of data ◮ Alg. 1 - opposite to relat. DBs, query a long sequence C in a
short sequence P
◮ Alg. 2 - adaptation capability to unknown repeated speech
pattern length
◮ DTW based on MFCC - Specrep
SLIDE 17 Results: Statistical Analysis
process of classifier design:
◮ measurement of data class
separability - correlation
◮ study of data characteristics
compare features:
◮ Specrep - DTW on basis of
MFCC features
◮ Symrep - our developed
algorithms on basis of SAX
◮ r - correlation coefficients ◮ h - accepted hypotheses
(p-values < 0.05 level)
SLIDE 18
Results: Objective Assessment
◮ SVMs to perform objective
assessment of MFCC, Specrep, Symrep
◮ training (80 %) & testing
(20 %) sets
◮ we trained individual SVMs,
sigmoidal kernel function
SLIDE 19
Conclusion
◮ derived functions for prolongation detection ◮ developed algorithms Alg.1-2 - detection of “complex”
dysfluencies
◮ new designated features - statistically analyzed ◮ objective assessment of new features & MFCC by SVM, 47.4% ◮ symbolic sequences are competitive to spectral domain
SLIDE 20
Bibliography 1/2
Camastra, F. and Vinciarelli, A., Machine Learning for Audio, Image and Video Analysis: Theory and Applications. Springer-Verlag London Limited, 2008. Hamel, L., Knowledge Discovery with Support Vector Machines. John Wiley & Sons, Inc., Hoboken, NJ, USA, July 2009. Howell, P., Davis, S., Bartrip, J., The UCLASS archive of stuttered speech. Journal of Speech, Language, and Hearing Research, 52, pp. 556-569, 2009. Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S., Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowledge and Information Systems 3, pp. 263-286, 2001.
SLIDE 21
Bibliography 2/2
Lin, J., Keogh, E., Lonardi, S., Patel, P., Finding motifs in time series. ACM Special Interest Group on Knowledge Discovery and Data Mining, 2002. Liu, Y., Shriberg, E., Stolcke, A., Harper, M., Comparing HMM, Maximum Entropy, and Conditional Random Fields for Disfluency Detection. In Proc. of the Eu. Conf. on Speech Comm. and Tech., 2005. Lustyk, T., Bergl, P., ˇ Cmejla, R., and Vokˇ r´ al J., Change evaluation of Bayesian detector for dysfluent speech assessment. In International Conference on Applied Electronics 2011, Pilsen, Czech Republic, pp. 231-234, 2011. Manber, U. and Myers, G., Suffix arrays : A new method for on-line string searches. SIAM J. Comput., 22(5), pp. 935948, 1993.
SLIDE 22
Thank you for your attention.
Questions?
Contact
e-mail: juraj.palfy@savba.sk home page: sites.google.com/site/georgepalfy/