speaker envir onment and channel change detection and
play

SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND - PDF document

SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND CLUSTERING VIA THE BA YESIAN INF ORMA TION CRITERION Sc ott Shaobing Chen & P.S. Gop alakrish nan IBM T.J. Watson R ese ar ch Center email:


  1. SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND CLUSTERING VIA THE BA YESIAN INF ORMA TION CRITERION Sc ott Shaobing Chen & P.S. Gop alakrish nan IBM T.J. Watson R ese ar ch Center email: schen@watson.ibm.c om V arious segmen tation algorithms ha v e b een prop osed in ABSTRA CT the literation [2, 4, 6, 8 , 10, 14 ], whic h can b e categorized In this pap er, w e are in terested in detecting c hanges in as follo ws: sp eak er iden tit y , en vironmen tal condition and c hannel con- Deco der-guided segmen tation. The input audio stream dition; w e call this the problem of ac oustic change dete c- � can b e �rst deco ded; then the desired segmen ts can b e tion . The input audio stream can b e mo deled as a Gauss- pro duced b y cutting the input at the silence lo cations ian pro cess in the cepstral space. W e presen t a maxim um generated from the deco der [14, 8]. Other informations lik eliho o d approac h to detect turns of a Gaussian pro cess; the decision of a turn is based on the from the deco der, suc h as the gender information, could Bayesian Informa- also b e utilized in the segmen tation [8]. tion Criterion (BIC), a mo del selection criterion w ell-kno wn in the statistics literature. The BIC criterion can also b e Mo del-based segmen tation. [2] prop osed to build dif- � applied as a termination criterion in hierarc hical metho ds feren t mo dels, e.g. Gaussian mixture mo dels, for a for clustering of audio segmen ts: t w o no des can b e merged �xed set of acoustic classes, suc h as telephone sp eec h, only if the merging increases the BIC v alue. Our exp eri- pure m usic, etc, from a training corpus; the incoming men ts on the Hub4 1996 and 1997 ev aluation data sho w that audio stream can b e classi�ed b y maxim um lik eliho o d our segmen tation algorithm can successfully detect acoustic selection o v er a sliding windo w; segmen tation can b e c hanges; our clustering algorithm can pro duce clusters with made at the lo cations where there is a c hange in the high purit y , leading to impro v emen ts in accuracy through acoustic class. unsup ervised adaptation as m uc h as the ideal clustering b y Metric-based segmen tation. [4, 6, 10] prop osed to seg- the true sp eak er iden tities. � men t the audio stream at maxima of the distances b et w een neigh b oring windo ws placed at ev ery sample; distances suc h as the KL distance, the generalized lik e- 1. INTR ODUCTION liho o d ratio distance ha v e b een in v estigated. Automatic segmen tation of an audio stream and automatic In our opinion, these metho ds are not v ery successful in clustering of audio segmen ts according to sp eak er iden ti- detection the acoustic c hanges presen t in the data. The ties, en vironmen tal conditions and c hannel conditions ha v e deco der-guided segmen tation only places b oundaries at si- receiv ed quite a bit of atten tion recen tly [4, 8, 6, 10 ]. F or lence lo cations, whic h in general has no direct connection example, in the task of automatic transcription of broadcast with the acoustic c hanges in the data. Both the mo del- news [3], the data con tains clean sp eec h, telephone sp eec h, based segmen tation and the metric-based segmen tation rely m usic segmen ts, sp eec h corrupted b y m usic or noise, etc. on thresholding of measuremen ts whic h lac k stabilit y and There are no explicit cues for the c hanges in sp eak er iden- robustness. Besides, the mo del-based segmen tation do es tit y , en vironmen t condition and c hannel condition. Also not generalize to unseen acoustic conditions. the same sp eak er ma y app ear m ultiple times in the data. Clustering of audio segmen ts is often p erformed via hier- In order to transcrib e the sp eec h con ten t in audio streams arc hical clustering [10, 8]. First, a distance matrix is com- of this nature, puted; the common practice is to mo del eac h audio segmen t w e w ould lik e to se gment the audio stream in to homo- � as one Gaussian in the cepstral space and to use the KL geneous regions according to sp eak er iden tit y , en viron- distance or the generalized lik eliho o d ratio as the distance men tal condition and c hannel condition so that regions measure [6 ]. Then b ottom-up hierarc hical clustering can b e of di�eren t nature can b e handled di�eren tly: for ex- p erformed to generate a clustering tree. It is often di�cult ample, regions of pure m usic and noise can b e rejected; to determine the n um b er of clusters. One can heuristicall y also, one migh t design a separate recognition system pre-determine the n um b er of clusters or the minim um size for telephone sp eec h. of eac h cluster; accordingly , one can go do wn the tree to w e w ould lik e to cluster sp eec h segmen ts in to homoge- � obtain desired clustering [14]. Another heuristic solution neous clusters according to sp eak er iden tit y , en viron- men t and c hannel; unsup ervised adaptation can then is to threshold the distance measures during the hierarc hi- cal pro cess; the thresholding lev el is tuned on a training b e p erformed on eac h cluster. [8, 10] sho w ed that a go o d clustering pro cedure can greatly impro v e the p er- set [10]. Jin et al. [7] shed some ligh t on automatically formance of unsup ervised adaptation suc h as MLLR. c ho osing a clustering solution.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend