[PPT] - Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck PowerPoint Presentation

SLIDE 1

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS Data Science Lab – Ghent University/iMinds

21 June 2016

SLIDE 2

2 22 June 2016

VRT STON project: Subtitling of TV shows by using speech technology
Subtitle generation is a time-consuming process which can be (partially)

automated

Speaker Diarization for Automatic Subtitle Generation

Why solve the “who-spoke-when?” problem?

Subtitles with color codes
Enable the use of speaker-adapted models for speech recognition (SR)
Extra information for the SR language model through detected sentence boundaries

Speaker 1 Speaker 1 Speaker 2 Speaker 2 Speaker 2

Diarization

SLIDE 3

3 22 June 2016

STON platform

http://www.esat.kuleuven.be/psi/spraak/demo/STON/

SLIDE 4

4 22 June 2016

Main Approach to Diarization

Focus on more accurate speaker segmentation:

Too short segments do not provide enough data for reliable speaker models
Non-homogeneous segments result in error propagation
Oversegmentation makes clustering a lot slower

Two step diarization process:

1. Speaker segmentation or speaker change point detection
2. Speaker clustering

Audio signal Speech/nonspeech segmentation Speaker diarization Language detection Speech recognition Post- processing

STON subtitling workflow

SLIDE 5

5 22 June 2016

Speaker Diarization Architecture

Speech segments Speaker segmentation through generic eigenvoices Speaker clustering Speaker clusters Retrain UBM on speech segments Retrain eigenvoices on speaker clusters 1st pass 2nd pass Speech segments Speaker segmentation through specific eigenvoices Speaker clustering Speaker clusters

SLIDE 6

6 22 June 2016

Speaker Segmentation: Boundary Generation

Speaker segmentation: Boundary generation and boundary elimination

1. Boundary generation: creation of candidate speaker change points
Two hypotheses: different or identical speakers in left and right fixed-length sliding windows
Search for maximal dissimilarity by comparing the distribution of acoustic features (MFCCs)

L R

SLIDE 7

7 22 June 2016

Speech/non-speech segmentation does not eliminate short pauses (<1s) between speakers

Overlapping Comparison Windows

L R

Adjacent comparison windows maximize dissimilarity at the pause boundaries
Overlapping windows maximize dissimilarity at the center of the pause

SLIDE 8

8 22 June 2016

Speaker Segmentation via Speaker Factor Extraction

Extract speaker-specific information in each comparison window through factor analysis Speaker factor 𝑦𝑢

𝒏𝒖

GMM-UBM speech model with 32 components
low dimensional speaker variability (eigenvoice) matrix 𝑊 (R=20)
Extract speaker factors 𝑦𝑢 with a sliding window (1s ) approach

𝑛𝑢 ≈ 𝑛𝑉𝐶𝑁 + 𝑊𝑦𝑢

Training data for the UBM and eigenvoice model: HUB4 BN96 English broadcast news

SLIDE 9

9 22 June 2016

Speaker Factor Distance Measures

Model the intra-speaker variability of the left (L) and right (R) speaker with two

covariance matrices Σ𝑀and Σ𝑆

Emphasize local changes that are not explained by intra-speaker variability with

the Mahalanobis distance 𝐸𝑁𝐵𝐼 𝑢 = Δ𝑦𝑢

𝑈Σ𝑀 −1Δ𝑦𝑢 +

Δ𝑦𝑢

𝑈Σ𝑆 −1Δ𝑦𝑢 with Δ𝑦𝑢 = 𝑦𝑢+𝜐 − 𝑦𝑢−𝜐

Significant local changes in speaker factors indicate a speaker change

𝐸 𝑦𝑢−𝜐, 𝑦𝑢+𝜐

Phonetic content has impact on speaker factors due to short extraction windows
Estimate this intra-speaker variability on the test data

𝒚𝒖−𝝊 𝒚𝒖+𝝊

ΣL ΣR

SLIDE 10

10 22 June 2016

Generate Candidate Change Points

𝐸𝑁𝐵𝐼 𝑢 = Δ𝑦𝑢

𝑈Σ𝑀 −1Δ𝑦𝑢 +

Δ𝑦𝑢

𝑈Σ𝑆 −1Δ𝑦𝑢

Peak selection algorithm:

Averaging filter to avoid detection of spurious peaks
Select a # of maxima according to the length of the speech segment
Enforce a minimum duration of 1s for each generated speaker turn

SLIDE 11

11 22 June 2016

Speaker Segmentation: Boundary Elimination

2. Boundary elimination: eliminate false positives
Split the speech segment into speaker turns defined by the candidate boundaries
1st pass: ΔBIC agglomerative clustering of adjacent speaker turns

Δ𝐶𝐽𝐷 = 𝑂𝑗 + 𝑂𝑗+1 log Σ𝑗,𝑗+1 − 𝑂𝑗 log Σ𝑗 − 𝑂𝑗+1 log Σ𝑗+1 − 𝜇𝑄 2nd pass: CDS agglomerative clustering of adjacent speaker turns 𝐸𝐷𝐸𝑇 𝑗, 𝑗 + 1 = 1 − 𝑦𝑗 ⋅ 𝑦𝑗+1 𝑦𝑗 𝑦𝑗+1

Clustering threshold controls the number of eliminated boundaries

… i i+1 …

SLIDE 12

12 22 June 2016

Speaker Segmentation: Baseline Results

COST278 broadcast news test set: 11(+1) languages, 30 hours, 4400 speaker turns

Evaluation

Mapping with 500ms margin
Recall: percentage of real boundaries mapped to computed ones
Precision: percentage of computed boundaries mapped to real ones
Popular ΔBIC boundary generation baseline (overlapping comparison windows)

Maximum recall: 90.6%

SLIDE 13

13 22 June 2016

Two-Pass Adaptive Speaker Segmentation

1. Cluster the speaker turns generated by our best system 𝐸𝑁𝐵𝐼 + ∆𝐶𝐽𝐷
2. Retrain the UBM and eigenvoice model on the speech and speaker clusters of the test file
3. Repeat the boundary generation (𝐸𝑁𝐵𝐼) and elimination (CDS) with the adapted models

SLIDE 14

14 22 June 2016

Soft VAD for Speaker Factor Extraction

Our Speaker Factor Extraction does not differentiate between speech and nonspeech

frames

Give speech frames more weight during speaker factor extraction
Integrate GMM-based soft Voice Activity Detection (VAD) during estimation Baum-Welch

statistics: 𝑞 𝑇 𝒑𝑢 = 𝑓log 𝑞(𝒑𝑢|𝑉𝐶𝑁𝑇) 𝑓log 𝑞(𝒑𝑢|𝑉𝐶𝑁𝑇) + 𝑓log 𝑞(𝒑𝑢|𝑉𝐶𝑁𝑂𝑇) 𝑂χ

𝑛 = 𝑝𝑢∈ χ

𝑞(𝑇|𝒑𝑢)𝛿(𝑉𝐶𝑁𝑇,𝑛|𝒑𝑢) 𝒈χ

𝑛 = 𝑝𝑢∈ χ

𝑞(𝑇|𝒑𝑢)𝛿(𝑉𝐶𝑁𝑇,𝑛|𝒑𝑢) 𝒑𝑢 Modifications 2nd pass adaptive system:

Also retrain the nonspeech UBM on the test file
Use soft VAD speaker factor extraction during CDS boundary elimination

SLIDE 15

15 22 June 2016

Soft VAD Speaker Segmentation

SLIDE 16

16 22 June 2016

Agglomerative Clustering

Initial ∆𝐶𝐽𝐷 clustering
iVector PLDA clustering

Cluster 1 Cluster 1

c

Cluster 2 Cluster 2 ???

𝑛𝑑1 → 𝑦𝑑1 𝑛𝑑3 → 𝑦𝑑3 𝑛𝑑2 → 𝑦𝑑2

1. Extract iVector 𝑦𝑑 for each cluster (after VAD and feature warping):

𝑛𝑑 = 𝑛𝑉𝐶𝑁 + 𝑈𝑦𝑑

2. Hypothesis test with Gaussian PLDA

𝑞 𝑦𝑑1, 𝑦𝑑3|𝐼𝑡𝑏𝑛𝑓_𝑡𝑞𝑓𝑏𝑙𝑓𝑠 𝑞 𝑦𝑑1, 𝑦𝑑3|𝐼𝑒𝑗𝑔𝑔𝑓𝑠𝑓𝑜𝑢_𝑡𝑞𝑓𝑏𝑙𝑓𝑠

3. Merge cluster pair with largest ratio
4. Iterate whole process until maximum ratio is too small

SLIDE 17

17 22 June 2016

Results Clustering + Conclusion

Initial segmentation DER(%) Boundary Precision(%) Boundary Recall(%) 𝐸Δ𝐶𝐽𝐷 10.1 65.2 75.8 𝐸𝑁𝐵𝐼 9.7 74.3 84.2 2−pass 𝐸𝑁𝐵𝐼 9.8 79.7 81.3 2-pass with soft VAD 8.9 81.7 85.0

Factor analysis based speaker segmentation produces more accurate boundaries
File-by-file adaptation further improves results
Soft VAD makes the speaker factor extraction more accurate (adaptive system)
Viterbi resegmentation deteriorates the boundary accuracy of the proposed system
COST278 broadcast news test set: 11(+1) languages, 30 hours, 4400 speaker turns
Diarization Error Rate (DER): percentage of frames attributed to a wrong speaker