SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND - - PDF document

speaker envir onment and channel change detection and
SMART_READER_LITE
LIVE PREVIEW

SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND - - PDF document

SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND CLUSTERING VIA THE BA YESIAN INF ORMA TION CRITERION Sc ott Shaobing Chen & P.S. Gop alakrish nan IBM T.J. Watson R ese ar ch Center email:


slide-1
SLIDE 1 SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND CLUSTERING VIA THE BA YESIAN INF ORMA TION CRITERION Sc
  • tt
Shaobing Chen & P.S. Gop alakrish nan IBM T.J. Watson R ese ar ch Center email: schen@watson.ibm.c
  • m
ABSTRA CT In this pap er, w e are in terested in detecting c hanges in sp eak er iden tit y , en vironmen tal condition and c hannel con- dition; w e call this the problem
  • f
ac
  • ustic
change dete c- tion. The input audio stream can b e mo deled as a Gauss- ian pro cess in the cepstral space. W e presen t a maxim um lik eliho
  • d
approac h to detect turns
  • f
a Gaussian pro cess; the decision
  • f
a turn is based
  • n
the Bayesian Informa- tion Criterion (BIC), a mo del selection criterion w ell-kno wn in the statistics literature. The BIC criterion can also b e applied as a termination criterion in hierarc hical metho ds for clustering
  • f
audio segmen ts: t w
  • no
des can b e merged
  • nly
if the merging increases the BIC v alue. Our exp eri- men ts
  • n
the Hub4 1996 and 1997 ev aluation data sho w that
  • ur
segmen tation algorithm can successfully detect acoustic c hanges;
  • ur
clustering algorithm can pro duce clusters with high purit y , leading to impro v emen ts in accuracy through unsup ervised adaptation as m uc h as the ideal clustering b y the true sp eak er iden tities. 1. INTR ODUCTION Automatic segmen tation
  • f
an audio stream and automatic clustering
  • f
audio segmen ts according to sp eak er iden ti- ties, en vironmen tal conditions and c hannel conditions ha v e receiv ed quite a bit
  • f
atten tion recen tly [4, 8, 6, 10 ]. F
  • r
example, in the task
  • f
automatic transcription
  • f
broadcast news [3], the data con tains clean sp eec h, telephone sp eec h, m usic segmen ts, sp eec h corrupted b y m usic
  • r
noise, etc. There are no explicit cues for the c hanges in sp eak er iden- tit y , en vironmen t condition and c hannel condition. Also the same sp eak er ma y app ear m ultiple times in the data. In
  • rder
to transcrib e the sp eec h con ten t in audio streams
  • f
this nature,
  • w
e w
  • uld
lik e to se gment the audio stream in to homo- geneous regions according to sp eak er iden tit y , en viron- men tal condition and c hannel condition so that regions
  • f
dieren t nature can b e handled dieren tly: for ex- ample, regions
  • f
pure m usic and noise can b e rejected; also,
  • ne
migh t design a separate recognition system for telephone sp eec h.
  • w
e w
  • uld
lik e to cluster sp eec h segmen ts in to homoge- neous clusters according to sp eak er iden tit y , en viron- men t and c hannel; unsup ervised adaptation can then b e p erformed
  • n
eac h cluster. [8, 10] sho w ed that a go
  • d
clustering pro cedure can greatly impro v e the p er- formance
  • f
unsup ervised adaptation suc h as MLLR. V arious segmen tation algorithms ha v e b een prop
  • sed
in the literation [2, 4, 6, 8 , 10, 14 ], whic h can b e categorized as follo ws:
  • Deco
der-guided segmen tation. The input audio stream can b e rst deco ded; then the desired segmen ts can b e pro duced b y cutting the input at the silence lo cations generated from the deco der [14, 8]. Other informations from the deco der, suc h as the gender information, could also b e utilized in the segmen tation [8].
  • Mo
del-based segmen tation. [2] prop
  • sed
to build dif- feren t mo dels, e.g. Gaussian mixture mo dels, for a xed set
  • f
acoustic classes, suc h as telephone sp eec h, pure m usic, etc, from a training corpus; the incoming audio stream can b e classied b y maxim um lik eliho
  • d
selection
  • v
er a sliding windo w; segmen tation can b e made at the lo cations where there is a c hange in the acoustic class.
  • Metric-based
segmen tation. [4, 6, 10] prop
  • sed
to seg- men t the audio stream at maxima
  • f
the distances b et w een neigh b
  • ring
windo ws placed at ev ery sample; distances suc h as the KL distance, the generalized lik e- liho
  • d
ratio distance ha v e b een in v estigated. In
  • ur
  • pinion,
these metho ds are not v ery successful in detection the acoustic c hanges presen t in the data. The deco der-guided segmen tation
  • nly
places b
  • undaries
at si- lence lo cations, whic h in general has no direct connection with the acoustic c hanges in the data. Both the mo del- based segmen tation and the metric-based segmen tation rely
  • n
thresholding
  • f
measuremen ts whic h lac k stabilit y and robustness. Besides, the mo del-based segmen tation do es not generalize to unseen acoustic conditions. Clustering
  • f
audio segmen ts is
  • ften
p erformed via hier- arc hical clustering [10, 8]. First, a distance matrix is com- puted; the common practice is to mo del eac h audio segmen t as
  • ne
Gaussian in the cepstral space and to use the KL distance
  • r
the generalized lik eliho
  • d
ratio as the distance measure [6 ]. Then b
  • ttom-up
hierarc hical clustering can b e p erformed to generate a clustering tree. It is
  • ften
dicult to determine the n um b er
  • f
clusters. One can heuristicall y pre-determine the n um b er
  • f
clusters
  • r
the minim um size
  • f
eac h cluster; accordingly ,
  • ne
can go do wn the tree to
  • btain
desired clustering [14]. Another heuristic solution is to threshold the distance measures during the hierarc hi- cal pro cess; the thresholding lev el is tuned
  • n
a training set [10]. Jin et al. [7] shed some ligh t
  • n
automatically c ho
  • sing
a clustering solution.
slide-2
SLIDE 2 In this pap er, w e are in terested in detecting c hanges in sp eak er iden tit y , en vironmen tal condition and c hannel con- dition; w e call this the problem
  • f
ac
  • ustic
change dete c- tion. The input audio stream can b e mo deled as a Gauss- ian pro cess in the cepstral space. W e presen t a maxim um lik eliho
  • d
approac h to detect turns
  • f
a Gaussian pro cess; the decision
  • f
a turn is based
  • n
the Bayesian Information Criterion (BIC), a mo del selection criterion in the statistics literature. The BIC criterion can also b e applied as a ter- mination criterion in the hierarc hical metho ds for sp eak er clustering: t w
  • no
des can b e merged
  • nly
if the merging increases the BIC v alue. Our exp erimen ts
  • n
the Hub4 1996 and 1997 ev aluation data sho w that
  • ur
segmen ta- tion algorithm can successfully detect acoustic c hanges;
  • ur
clustering algorithm can pro duce clusters with high purit y and enhance unsup ervised adaptation as m uc h as the ideal clustering b y the true sp eak er iden tities. This pap er is
  • rganized
as follo ws: section 2 describ es mo del selection criterions in the statistics literature; section 3 and section 4 explains
  • ur
maxim um lik eliho
  • d
approac h for acoustic c hange detection and
  • ur
clustering algorithm based
  • n
BIC; w e presen t
  • ur
exp erimen ts
  • n
the Hub4 1996 and 1997 ev aluation data; w e compare
  • ur
algorithms with
  • ther
recen t w
  • rks
in the literature. 2. MODEL SELECTION CRITERIA The problem
  • f
mo del iden tication is to c ho
  • se
  • ne
among a set
  • f
candidate mo dels to describ e a giv en data set. W e
  • ften
ha v e candidates
  • f
a series
  • f
mo dels with dieren t n um b er
  • f
parameters. It is eviden t that when the n um b er
  • f
parameters in the mo del is increased, the lik eliho
  • d
  • f
the training data is also increased; ho w ev er, when the n um b er
  • f
parameters is to
  • large,
this migh t cause the problem
  • f
  • vertr
aining. Sev eral criteria for mo del selection ha v e b een in tro duced in the statistics literature, ranging from non- parametric metho ds suc h as cross-v alidation, to parametric metho ds suc h as the Ba y esian Information Criterion (BIC) [11]. BIC is a lik eliho
  • d
criterion p enalized b y the mo del com- plexit y: the n um b er
  • f
parameters in the mo del. In detail, let X = fx i : i = 1;
  • ;
N g b e the data set w e are mo d- eling; let M = fM i : i = 1;
  • ;
K g b e the candidates
  • f
desired parametric mo dels. Assuming w e maximize the lik eliho
  • d
function separately for eac h mo del M ,
  • btaining,
sa y L(X ; M ). Denote #(M ) as the n um b er
  • f
parameters in the mo del M . The BIC criterion is dened as: B I C (M ) = log L(X ; M )
  • 1
2 #(M )
  • log
(N ) (1) where the p enalt y w eigh t
  • =
1. The BIC pro cedure is to c ho
  • se
the mo del for whic h the BIC criterion is maximized. This pro cedure can b e deriv ed as a large-sample v ersion
  • f
Ba y es pro cedures for the case
  • f
indep enden t, iden ticall y distributed
  • bserv
ations and linear mo dels [11 ]. The BIC criterion is w ell-kno wn in the statistics liter- ature; it has b een widely used for mo del iden tication in statistical mo deling, time series [13], linear regression [5], etc. It is commonly kno wn in the engineering literature as the minimum description length (MDL). It has b een used in the sp eec h recognition literature, e.g. for sp eak er adapta- tion [12]. BIC is closely related to
  • ther
p enalized lik eliho
  • d
criterions suc h as AIC [1] and RIC [5]. One can v ary the p enalt y w eigh t
  • in
(1), although
  • nly
  • =
1 corresp
  • nds
to the denition
  • f
BIC. 3. CHANGE DETECTION VIA BIC In this section, w e describ e a maxim um lik eliho
  • d
approac h for acoustic c hange detection based
  • n
the BIC criterion. Denote x = fx i 2 R d ; i = 1; :::; N g as the sequence
  • f
cepstral v ectors exacted from the en tire audio stream; as- sume x is dra wn from an indep enden t m ultiv ariate Gaussian pro cess: x i
  • N
( i ;
  • i
) where
  • i
is the mean v ector and
  • i
is the full co v ariance matrix. 3.1. Detecting One Changing P
  • in
t W e rst examine a simplied problem: assume that there is at most
  • ne
c hanging p
  • in
t in the Gaussian pro cess. W e are in terested in the h yp
  • thesis
testing
  • f
a c hange
  • ccurring
at time i: H : x 1
  • x
N
  • N
(; ) v ersus H 1 : x 1
  • x
i
  • N
( 1 ;
  • 1
); x i+1
  • x
N
  • N
( 2 ;
  • 2
): The maxim um lik eliho
  • d
ratio statistics is R(i) = N l
  • g
jj
  • N
1 l
  • g
j 1 j
  • N
2 l
  • g
j 2 j (2) where ;
  • 1
and
  • 2
are the sample co v ariance matrices from all the data, from fx 1 ;
  • ;
x i g and from fx i+1 ;
  • ;
x N g, resp ectiv ely . Th us the maxim um lik elih
  • d
estimate
  • f
the c hanging p
  • in
t is ^ t = arg max i R(i): On the
  • ther
hand, w e can view the h yp
  • thesis
testing as a problem
  • f
mo del selection. W e are comparing t w
  • mo
dels:
  • ne
mo dels the data as t w
  • Gaussians;
the
  • ther
mo dels the data as just
  • ne
Gaussian. The dierence b e- t w een the BIC v alues
  • f
these t w
  • mo
dels can b e expressed as B I C (i) = R(i)
  • P
(3) where the lik eliho
  • d
ratio R(i) is dened in (2), the p enalt y P = 1 2 (d + 1 2 d(d + 1)) log N and the p enalt y w eigh t
  • =
1; d is the dimension
  • f
the space. Th us if (3) is p
  • sitiv
e, the mo del
  • f
t w
  • Gaussians
is fa v
  • red.
Th us w e decide there is a c hange if fmax i B I C (i)g > 0: (4) It is clear that the m.l.e.
  • f
the c hanging p
  • in
t also can b e expressed as ^ t = arg max i B I C (i): (5) Comparing with the metric-based segmen tation describ ed in the in tro duction,
  • ur
BIC pro cedure has the follo wing adv an tages:
slide-3
SLIDE 3

20 40 60 80 −400 −300 −200 −100 100 (a) first cepstral dimension seconds 20 40 60 80 4 6 8 10 12 14 x 10

4

(b) log liklihood distance seconds 20 40 60 80 40 60 80 100 120 140 (c) KL2 distance seconds 20 40 60 80 −5000 5000 10000 15000 (d) BIC criterion seconds

Figure 1. Detecting
  • ne
c hanging p
  • in
t
  • R
  • bustness.
[10, 4] prop
  • sed
to measure the v ariation at lo cation i as the distance b et w een a windo w to the left and a windo w to the righ t; t ypically the windo w size is short, e.g. t w
  • seconds;
the distance can b e c ho- sen to b e the log lik eliho
  • d
ratio distance [6]
  • r
the KL distance. In
  • ur
  • pinion,
suc h measuremen ts are
  • ften
noisy and not robust, b ecause it in v
  • lv
es
  • nly
the limited samples in t w
  • short
windo ws. In con trast, the BIC criterion is rather robust, since it computes the v ariation at time i utilizing all the samples. Fig- ure 1 sho ws an example whic h indicates the robustness
  • f
  • ur
pro cedure. W e exp erimen ted
  • n
a sp eec h sig- nal
  • f
77 seconds whic h con tains t w
  • sp
eak ers. P anel (a) plots the rst dimension
  • f
the cepstral v ectors; the dotted line indicates the lo cation
  • f
the c hange. One can clearly notice the c hanging b eha vior around the c hanging p
  • in
t. W e computed b
  • th
the log lik e- liho
  • d
ratio distance (i.e. the Gish distance) and the KL2 distance [10] b et w een t w
  • adjacen
t sliding win- do ws
  • f
100 frames. P anel (b) sho ws the log lik eliho
  • d
distance: it attains lo cal maxim um at the lo cation
  • f
the c hange; ho w ev er, it has sev eral maxima whic h do not corresp
  • nd
to an y c hanging p
  • in
ts; it also seems rather noisy . Similarly P anel (c) sho ws the KL2 dis- tances: there is a sharp spik e at the lo cation
  • f
the c hange; ho w ev er, there are sev eral
  • ther
spik es whic h do not corresp
  • nd
to an y c hanging p
  • in
ts. P anel (d) displa ys the BIC criterion; it clearly predicts the c hang- ing p
  • in
t.
  • Thr
esholding-f r e e. Our BIC pro cedure is able to au- tomatically p erforms mo del selection, whereas [10] is based
  • n
thresholding . As sho wn in Figure 1 (b) and (c), it is dicult to set a thresholding lev el to pic k the c hanging p
  • in
ts. Figure 1(d) indicates there is a c hange since the BIC v alue at the detected c hanging p
  • in
t is p
  • sitiv
e.
  • Optimality.
Our pro cedure is deriv ed from the theory
  • f
maxim um lik eliho
  • d
and mo del selection. It can b e sho wn that
  • ur
estimate (5) con v erges to the true

1 2 3 4 5 6 7 8 9 10 −1500 −1000 −500 500 1000 1500 2000 2500 3000 Effect of Detectability on Detection Detectability in Seconds BIC Value

Figure 2. The detectabilit y
  • f
a c hange c hanging p
  • in
t as the sample size increases. The p erformance
  • f
  • ur
pro cedure relies hea vily
  • n
the amoun t
  • f
data a v ailable for eac h
  • f
the t w
  • Gaussian
mo d- els separated b y the true c hanging p
  • in
t. W e dene the dete ctabilit y
  • f
a c hanging p
  • in
t at t as D (t) = min (t; N
  • t):
(6) In general the BIC pro cedure is less accurate as the de- tectabilit y decreases. This can b e demonstrated in the fol- lo wing exp erimen t. W e placed m ultiple windo ws
  • f
the same size around a sp eak er c hanging p
  • in
t in an audio stream, with eac h windo w corresp
  • nding
to dieren t de- tectabilit y . Within eac h windo w, the BIC pro cedure w as p erformed to detect if there w as a c hange. Figure 2 plots the BIC v alue against the detectabilit y
  • f
the sampling. The BIC v alue starts as negativ e, suggesting that there is
  • nly
  • ne
sp eak er. As the detectabilit y increases, the BIC v alue also increases sharply; it is w ell ab
  • v
e zero for detectabil- it y greater than 2 seconds, strongly supp
  • rting
the c hange p
  • in
t h yp
  • thesis.
3.2. Detecting Multiple Changing P
  • in
ts W e prop
  • se
the follo wing algorithm to sequen tially detect the c hanging p
  • in
ts in the Gaussian pro cess x: (1) initiali ze the in terv al [a; b] : a = 1; b = 2. (2) detect if there is
  • ne
c hanging p
  • in
t in [a; b] via BIC. (3) if (no c hange in [a; b]) let b = b + 1; else let ^ t b e the c hanging p
  • in
t detected; set a = ^ t + 1 ; b = a + 1; end (4) go to (2). By expanding the windo w [a; b], the nal decision
  • f
a c hange p
  • in
t is made based
  • n
as m uc h data p
  • in
ts as p
  • s-
sible. In
  • ur
view, this can b e more robust than decisions based
  • n
distance b et w een t w
  • adjacen
t sliding windo ws
  • f
xed sizes [10], though
  • ur
approac h is more costly .
slide-4
SLIDE 4 The BIC criterion can b e view ed as thresholding the log lik eliho
  • d
distance, with the thresholding lev el automati- cally c hosen as
  • 1
2 (d + 1 2 d(d + 1)) log N where N is the size
  • f
the decision windo w and d is the the dimension
  • f
the feature space. Again w e emphasize that the accuracy
  • f
  • ur
pro cedure dep ends
  • n
the detectabiliti es
  • f
the true c hanging p
  • in
ts. Let T = ft i g b e the true c hanging p
  • in
ts; the detectabilit y can b e dened as D (t i ) = min (t i
  • t
i1 + 1; t i+1
  • t
i + 1): When the detectabilit y is lo w, the curren t c hanging p
  • in
t is
  • ften
missed; moreo v er, this error con taminates the statis- tics for the next Gaussian mo del, th us aects the detection
  • f
the next c hanging p
  • in
t. Our algorithm has a quadratic complexit y; ho w ev er,
  • ne
can reduce the complexit y dramatically b y p erforming a crude searc h without m uc h sacrice
  • f
the resolution. 3.3. Change detection
  • n
the Hub4 1997 ev aluation data W e applied
  • ur
algorithm
  • n
the Hub4 1997 ev aluation data, whic h consists
  • f
3 hour broadcasting news programs; detection w as p erformed using 24-dimensiona l Mel-cepstral v ectors exacted at 10ms frame rate. NIST pro vided hand-segmen tation
  • f
this data according to dieren t categories: clean prepared sp eec h, clean sp
  • n
ta- neous sp eec h, telephone-qual it y sp eec h, sp eec h with bac k- ground m usic and sp eec h with bac kground noise. As com- men ted in [4], it is v ery hard to come up with a standard for analyzing the errors in segmen tation since segmen tation can b e v ery sub jectiv e; ev en t w
  • p
eople listening to the same sp eec h ma y segmen t it dieren tly . Nev ertheless, w e analyze the p erformance
  • f
  • ur
detection b y comparing with the hand-segmen tation pro vided b y NIST. W e rst examine whether
  • ur
detected c hanging p
  • in
ts w ere true, i.e. the T yp e-I errors. Among the 462 detected c hanges, there w ere 19 (4:1%) errors whic h happ ened in the middle
  • f
sp eak er turns. Our BIC criterion seems sensi- tiv e in pure m usic region. There w ere 14 (3:0%) detected c hanges in the middle
  • f
pure m usic segmen ts; w e did not coun t them as errors since rst
  • ne
can argue that the m u- sic tune c hanged in those areas, second the pure m usic seg- men ts w ere discarded b y the classier and did not aect the recognition accuracy . There w ere 20 (4:3%) detected c hanges sligh tly biased from the true c hanges. The biases w ere less than 1 seconds, as sho wn in panel (a) in Figure 3. W e did not coun t these as errors since they came so close to the true c hanges. The bias migh t b e caused b y con tami- nation
  • f
the statistics for estimating the Gaussian mo dels b y
  • utliers,
  • r
b y the statistics from the previous turn if the previous c hange p
  • in
t w as missed in the detection. Usually these errors can b e xed, for example, b y mo ving to the nearest silence. It is also p
  • ssible
to rene the b
  • undary
b y ner analysis in the detected region. W e also examine whether true c hanging p
  • in
ts w ere missed in
  • ur
detection, i.e. the T yp e-I I errors. In the NIST segmen tation, there w ere 620 c hanges. T
  • tally
207 (33:4%) c hanges w ere missed. 154 (25:0%) errors w ere caused b y short turns with duration less than 2 seconds. Examples T yp e-I Error 4:1% T yp e-I I
  • 2s
25:0% Error 33:4% > 2s 8:4% T able 1. Change detection error rates

20 40 60 50 100 150 (b) Histogram: all true changes Detectability in Seconds 5 10 15 20 40 60 80 100 (c) Histogram: missed true changes Detectability in Seconds 5 10 15 0.2 0.4 0.6 0.8 (d) Type−II errors analysis Detectability in Seconds Error Rate 5 10 15 20 0.2 0.4 0.6 0.8 1 (a) Biases Bias in Seconds

Figure 3. Error analysis
  • f
c hange detection
  • f
these short turns are sen tences made up
  • f
  • nly
brief phrases suc h as \Go
  • d
morning" and \Thank y
  • u".
Ab
  • ut
50
  • f
these short turns con tained v
  • ices
from more than
  • ne
sp eak er. They w ere lab eled as \excluded regions" b y NIST and w ere not included in the nal scoring
  • f
the recogni- tion system but w ere included in determining the c hange detection accuracy . Figure 3 analyzes the T yp e-I I errors in detail. P anel (b) sho ws the histogram
  • f
the detectabil- it y
  • f
the true c hanges; there w ere 223 true c hanges with detectabilit y less than 2 seconds. P anel (c) sho ws the his- togram
  • f
the detectabilit y
  • f
the true c hanges whic h w ere missed in the detection; it is clear that most
  • f
the errors came from lo w detectabilities less than 2 seconds. P anel (d) describ es the T yp e-I I error rates according to dieren t de- grees
  • f
detectabilit y: when detectabilit y is b elo w 1 second, as the t yp e-I I error rate is 78%, most suc h c hanging p
  • in
ts w ere missed; as the detectabilit y increases, the T yp e-I I er- ror drops. 4. CLUSTERING VIA BIC In this section, w e describ e ho w to apply the BIC criterion in clustering. Let S = fs i : i = 1;
  • ;
M g b e the collection
  • f
signals w e wish to cluster; eac h signal is asso ciated with a sequence
  • f
indep enden t random v ariables X i = fx i j : j = 1;
  • ;
n i g. In the con text
  • f
sp eec h clustering, S is a collection
  • f
audio segmen ts; X i can b e the cepstral v ectors exacted from the i'th segmen t. Denote N = P i n i as the total sample size
  • f
the v ectors X i . Let C k = fc i : i = 1;
  • ;
k g b e the clustering whic h has k clusters. W e mo del eac h cluster c i as a m ultiv ariate Gaussian distribution N ( i ;
  • i
), where
  • i
can b e estimated as the sample mean v ector and
  • i
can b e estimated as the sample co v ariance matrix. Th us the n um b er
  • f
parameters
slide-5
SLIDE 5 for eac h cluster is d + 1 2 d(d + 1). Let n i b e the n um b er
  • f
samples in cluster c i . One can sho w that B I C (C k ) = k X i=1 f 1 2 n i log j i jg
  • P
(7) where the p enalt y P = 1 2 (d + 1 2 d(d + 1)) log N and the p enalt y w eigh t
  • =
1. W e c ho
  • se
the clustering whic h maximizes the BIC criterion. 4.1. Hierarc hical Clustering via greedy BIC As
  • ne
can imagine, it is
  • ften
v ery costly to searc h globally for the b est BIC v alue, since clustering has to b e p erformed to
  • btain
dieren t n um b ers
  • f
clusters. Ho w ev er, for hier- arc hical clustering metho ds, it is p
  • ssible
to
  • ptimize
the BIC criterion in a greedy fashion. Bottom-up metho ds start with eac h signal as
  • ne
initial no de, then successiv ely merge t w
  • nearest
no des according to a distance measure. Let S = fs 1 ;
  • ;
s k g b e the cur- ren t set
  • f
no des; supp
  • se
s 1 and s 2 are the candidate pair for merging, and the merged new no de is s. Th us w e are comparing the curren t clustering S with a new clustering S = fs; s 3 ;
  • ;
s k g. W e mo del eac h no de s i as a m ulti- v ariate Gaussian distributi
  • n
N ( i ;
  • i
). It is clear from (7) that the increase
  • f
the BIC v alue b y merging s 1 and s 2 is B I C = n log jj
  • n
1 log j 1 j
  • n
2 log j 2 j
  • P
(8) where n = n 1 + n 2 is sample size
  • f
the merged no de,
  • is
the sample co v ariance matrix
  • f
the merged no de, the p enalt y P = 1 2 (d + 1 2 d(d + 1)) log N and the p enalt y w eigh t
  • =
1. Our BIC termination pro cedure is that t w
  • no
des should not b e merged if (8) is negativ e. Since the BIC v alue is increased at eac h merge, w e are searc hing for an \optimal" clustering tree b y
  • ptimizing
the BIC criterion in a greedy fashion. Note that w e merely use
  • ur
criterion (8) for termina- tion. It is p
  • ssible
to use
  • ur
criterion (8) as the distance measure in the b
  • ttom-up
pro cess. Ho w ev er, in man y ap- plications, it is probably b etter to use more sophisticated distance measures. It is also clear that
  • ur
criterion can b e applied to top-do wn metho ds. 4.2. Sp eak er Clustering
  • n
the Hub4 1996 ev alua- tion data The data set consists
  • f
the clean prepared and the clean sp
  • n
taneous p
  • rtion
  • f
the HUB4 1996 ev aluation data [2], hand-segmen ted in to 824 short segmen ts. Cepstral co e- cien ts w ere extracted as feature v ectors X i for eac h segmen t. W e used the log lik eliho
  • d
ratio distance measure; Bottom- up clustering w as p erformed with maxim um link age, with the BIC termination criterion (8). The true n um b er
  • f
sp eak ers is 28; the BIC termination criterion c hose 31 clusters. F
  • r
eac h cluster, w e dene the

−5 5 10 15 20 25 30 35 0.2 0.4 0.6 0.8 1 Purity of the BIC clustering

Figure 4. Clustering Purities Prepared Sp
  • n
taneous Baseline 18.8% 27.0% MLLR w/o clustering 18.7% 26.9% MLLR w/ ideal clustering 17.5% 24.8% MLLR w/ BIC clustering 17.5% 24.6% T able 2. MLLR adaptation enhanced b y BIC clus- tering purit y as the ratio b et w een the n um b er
  • f
segmen ts b y the dominating sp eak er in that cluster and the total n um b er
  • f
segmen ts in that cluster. Figure 3 sho ws the purities
  • f
eac h cluster. Clearly
  • ur
algorithm results in not
  • nly
clusters with high purit y , but also the appropriate n um b er
  • f
clusters. Sp eak er clustering can enhance the p erformance
  • f
un- sup ervised adaptation. The reason is that most
  • f
the 824 segmen ts here are quite short, around 2
  • 3
seconds. With-
  • ut
sp eak er clustering, unsup ervised adaptation tec hniques suc h as MLLR [9 ] has small impro v emen ts due to lac k
  • f
data. Go
  • d
sp eak er clustering can bring the segmen ts
  • f
the same sp eak er together th us impro ving the p erformance
  • f
unsup ervised adaptation. W e started from a baseline sys- tem whic h had ab
  • ut
90k Gaussians. The deco ding results w ere scored according to t w
  • conditions:
clean prepared and clean sp
  • n
taneous. As sho wn in T able 2, the baseline error rates w ere 18:8% and 27:0% for the t w
  • conditions
resp ectiv ely . Without clustering, MLLR reduced the error rates b y
  • nly
0:1%. With
  • ur
clustering, MLLR reduced the error rates b y 1:3% for the clean condition and b y 2:4% for the sp
  • n
taneous condition. T able 2 also sho ws the error rates
  • f
MLLR using the ideal clustering b y the true sp eak er iden tities. It is clear that
  • ur
sp eak er clustering enhanced the p erformance
  • f
MLLR as m uc h as the ideal clustering. 4.3. Discussion Jin et al.
  • f
BBN [7] prop
  • sed
a similar automatic sp eak er clustering algorithm. They also used the log lik elih
  • d
ratio distance measure prop
  • sed
in Gish et al. [6], ho w ev er, with the distances b et w een consecutiv e segmen ts scaled do wn b y a parameter . They p erformed hierarc hical clustering; for an y giv en n um b er k , the clustering tree w as pruned to
  • b-
slide-6
SLIDE 6 tain k tigh test clusters. A heuristic mo del selection criterion k X j =1 n
  • j
j
  • j
j
  • p
k (9) w as then used to searc h through the space
  • f
(; k ) for the b est clustering. They applied this algorithm to cluster the HUB4-96 ev aluation data for the purp
  • se
  • f
unsup ervised adaptation. Similar to
  • ur
results ab
  • v
e, this automatic clustering enhanced the unsup ervised adaptation as m uc h as the ideal clustering according the the true sp eak er iden- tities. This heuristic mo del selection criterion (9) resem bles the BIC criterion (7): they b
  • th
p enalize the lik eliho
  • d
b y the n um b er
  • f
clusters. Ho w ev er, the BIC criterion has a solid theoretical foundation and seems more appropriate. Indeed the n um b er
  • f
sp eak er clusters found in [7] is considerably less than the truth. Moreo v er, extra information suc h as the adjacency
  • f
the segmen ts w as utilized in [7]. Siegler et al.
  • f
CMU [10] prop
  • sed
another sp eak er clustering algorithm. They c hose the symmetric Kullbac k- Leibler metric as the distance measure, and p erformed hi- erarc hical clustering. The clusters w ere
  • btained
b y thresh-
  • lding
the distances. Unlik e
  • ur
metho d and the BBN clus- tering, this clustering is not fully automatic: the thresh-
  • lding
lev el w as tuned in a delicate fashion: it had to b e small enough suc h that the clusters created w ere made up
  • f
segmen ts from
  • nly
  • ne
sp eak er and y et large enough to impro v e the p erformance
  • f
the unsup ervised adaptation. 5. CONCLUSION W e presen ted a maxim um lik eliho
  • d
approac h to detecting c hanging p
  • in
ts in a indep enden t Gaussian pro cess; the de- cision
  • f
a c hange is based
  • n
the BIC criterion. The k ey features
  • f
  • ur
approac h are:
  • Instead
  • f
making lo cal decision based
  • n
distance b e- t w een t w
  • adjacen
t sliding windo ws
  • f
xs sizes, w e ex- pand the decision windo ws as wide as p
  • ssible
so that
  • ur
nal decision
  • f
c hange p
  • in
ts can b e more r
  • bust.
  • Our
approac h is thresholding-free. The BIC criterion can b e view ed as thresholding the log lik eliho
  • d
dis- tance, with the thresholding lev el automatically c ho- sen as
  • 1
2 (d + 1 2 d(d + 1)) log N where N is the size
  • f
the decision windo w and d is the the dimension
  • f
the feature space. W e also prop
  • sed
to apply the BIC criterion as a termi- nation criterion in the hierarc hical clustering. Our c hange detection algorithm can successfully detects acoustic c hang- ing p
  • in
ts with reasonable detectabilit y (> 2s); Our exp er- imen ts
  • n
clustering demonstrated that the BIC criterion is able to c ho
  • se
the n um b er
  • f
clusters according to the in trinsic complexit y presen t in the data set and pro duce clustering solution with high purit y . W e applied
  • ur
algorithms
  • n
the Hub4 1997 ev aluation data [3]. T able 3 sho ws the recognition error rates. Our segmen tation w as
  • nly
0:6% w
  • rse
than the NIST hand- segmen tation. After clustering, the unsup ervised adapta- tion further reduced the error rate b y 2:7%. Error Rate NIST hand-segme n ta tion 19.8% IBM segmen tation 20.4% adaptation after clustering 17.7% T able 3. Segmen tation and clustering in Hub4 1997 task W e commen t that the p enalt y w eigh t
  • in
the BIC crite- rion could b e tuned to
  • btain
v arious degrees
  • f
segmen ta- tion and clustering. A smaller w eigh t w
  • uld
result in more c hanges and more clusters. In this pap er, w e simply c ho
  • se
  • =
1 according to the BIC theory . REFERENCES [1] H. Ak aik e, \A new lo
  • k
at the statistical iden ticatio n mo del", IEEE T rans. Auto. Con trol, v
  • l
19, pp 716- 723, 1974. [2] R. Bakis et al., \T ranscription
  • f
broadcast news sho ws with the IBM large v
  • cabulary
sp eec h recognition sys- tem", Pro ceedings
  • f
the Sp eec h Recognition W
  • rk-
shop, pp 67-72, 1997. [3] S. Chen et al., \IBM's L V CSR System for T ranscrip- tion
  • f
Broadcast News Used in the 1997 Hub4 Eng- lish Ev aluation", Pro ceedings
  • f
the Sp eec h Recogni- tion W
  • rkshop,
1998. [4] H. Beigi and S. Maes, \Sp eak er, c hannel and en vi- ronmen t c hange detection", Pro ceedings
  • f
the W
  • rld
Congress
  • n
Automation, 1998. [5] D. F
  • ster
and E. George, \The risk ination factor in m ultiple linear regression", T ec hnical Rep
  • rt,
Univ.
  • f
T exas, 1993. [6] H. Gish and N. Sc hmidt, \T ext-indep enden t sp eak er iden tication ", IEEE Signal Pro cessing Magazine, pp 18-21, Oct. 1994. [7] H. Jin, F. Kubala and R. Sc h w artz, \Automatic sp eak er clustering", Pro ceedings
  • f
the Sp eec h Recog- nition W
  • rkshop,
pp 108-111, 1997. [8] F. Kubala et al., \The 1996 BBN Byblos Hub-4 tran- scription system", Pro ceedings
  • f
the Sp eec h Recogni- tion W
  • rkshop,
pp 90-93, 1997. [9] C. J. Legetter and P . C. W
  • dland,
\Maxim um lik e- liho
  • d
linear regression for sp eak er adaptation
  • f
con- tin uous densit y HMM's", Computer Sp eec h and Lan- guage, v
  • l.
9, no. 2, pp 171{186. [10] M. Siegler, U. Jain, B. Ra y and R. Stern, \Automatic segmen tation, classication and clustering
  • f
broadcast news audio", Pro ceedings
  • f
the Sp eec h Recognition W
  • rkshop,
pp 97-99, 1997. [11] G. Sc h w arz, \Estimating the dimension
  • f
a mo del", The Annals
  • f
Statistics, v
  • l.
6, pp 461-464, 1978. [12] K. Shino da et al., \Sp eak er adaptation with au- tonomous mo del complexit y con trol b y MDL princi- ple", Pro ceedings
  • f
ICASSP , pp 717-720, 1996. [13] W.S. W ei, Time Series Analysis, Addison-W esley , 1993. [14] P . W
  • dland,
M. Gales, D. Py e and S. Y
  • ung,
\The Dev elopmen t
  • f
the 1996 HTK broadcast news tran- scription system", Pro ceedings
  • f
the Sp eec h Recogni- tion W
  • rkshop,
pp 73-78, 1997.