Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck - - PowerPoint PPT Presentation

segmentation of broadcast news
SMART_READER_LITE
LIVE PREVIEW

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck - - PowerPoint PPT Presentation

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS Data Science Lab Ghent University/iMinds 21 June 2016 Speaker Diarization for Automatic Subtitle


slide-1
SLIDE 1

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS Data Science Lab – Ghent University/iMinds

21 June 2016

slide-2
SLIDE 2

2 22 June 2016

  • VRT STON project: Subtitling of TV shows by using speech technology
  • Subtitle generation is a time-consuming process which can be (partially)

automated

Speaker Diarization for Automatic Subtitle Generation

Why solve the “who-spoke-when?” problem?

  • Subtitles with color codes
  • Enable the use of speaker-adapted models for speech recognition (SR)
  • Extra information for the SR language model through detected sentence boundaries

Speaker 1 Speaker 1 Speaker 2 Speaker 2 Speaker 2

Diarization

slide-3
SLIDE 3

3 22 June 2016

STON platform

http://www.esat.kuleuven.be/psi/spraak/demo/STON/

slide-4
SLIDE 4

4 22 June 2016

Main Approach to Diarization

Focus on more accurate speaker segmentation:

  • Too short segments do not provide enough data for reliable speaker models
  • Non-homogeneous segments result in error propagation
  • Oversegmentation makes clustering a lot slower

Two step diarization process:

  • 1. Speaker segmentation or speaker change point detection
  • 2. Speaker clustering

Audio signal Speech/nonspeech segmentation Speaker diarization Language detection Speech recognition Post- processing

STON subtitling workflow

slide-5
SLIDE 5

5 22 June 2016

Speaker Diarization Architecture

Speech segments Speaker segmentation through generic eigenvoices Speaker clustering Speaker clusters Retrain UBM on speech segments Retrain eigenvoices on speaker clusters 1st pass 2nd pass Speech segments Speaker segmentation through specific eigenvoices Speaker clustering Speaker clusters

slide-6
SLIDE 6

6 22 June 2016

Speaker Segmentation: Boundary Generation

Speaker segmentation: Boundary generation and boundary elimination

  • 1. Boundary generation: creation of candidate speaker change points
  • Two hypotheses: different or identical speakers in left and right fixed-length sliding windows
  • Search for maximal dissimilarity by comparing the distribution of acoustic features (MFCCs)

L R

slide-7
SLIDE 7

7 22 June 2016

Speech/non-speech segmentation does not eliminate short pauses (<1s) between speakers

Overlapping Comparison Windows

L R

  • Adjacent comparison windows maximize dissimilarity at the pause boundaries
  • Overlapping windows maximize dissimilarity at the center of the pause
slide-8
SLIDE 8

8 22 June 2016

Speaker Segmentation via Speaker Factor Extraction

Extract speaker-specific information in each comparison window through factor analysis Speaker factor 𝑦𝑢

𝒏𝒖

  • GMM-UBM speech model with 32 components
  • low dimensional speaker variability (eigenvoice) matrix 𝑊 (R=20)
  • Extract speaker factors 𝑦𝑢 with a sliding window (1s ) approach

𝑛𝑢 ≈ 𝑛𝑉𝐶𝑁 + 𝑊𝑦𝑢

  • Training data for the UBM and eigenvoice model: HUB4 BN96 English broadcast news
slide-9
SLIDE 9

9 22 June 2016

Speaker Factor Distance Measures

  • Model the intra-speaker variability of the left (L) and right (R) speaker with two

covariance matrices Σ𝑀and Σ𝑆

  • Emphasize local changes that are not explained by intra-speaker variability with

the Mahalanobis distance 𝐸𝑁𝐵𝐼 𝑢 = Δ𝑦𝑢

𝑈Σ𝑀 −1Δ𝑦𝑢 +

Δ𝑦𝑢

𝑈Σ𝑆 −1Δ𝑦𝑢 with Δ𝑦𝑢 = 𝑦𝑢+𝜐 − 𝑦𝑢−𝜐

  • Significant local changes in speaker factors indicate a speaker change

𝐸 𝑦𝑢−𝜐, 𝑦𝑢+𝜐

  • Phonetic content has impact on speaker factors due to short extraction windows
  • Estimate this intra-speaker variability on the test data

𝒚𝒖−𝝊 𝒚𝒖+𝝊

ΣL ΣR

slide-10
SLIDE 10

10 22 June 2016

Generate Candidate Change Points

𝐸𝑁𝐵𝐼 𝑢 = Δ𝑦𝑢

𝑈Σ𝑀 −1Δ𝑦𝑢 +

Δ𝑦𝑢

𝑈Σ𝑆 −1Δ𝑦𝑢

Peak selection algorithm:

  • Averaging filter to avoid detection of spurious peaks
  • Select a # of maxima according to the length of the speech segment
  • Enforce a minimum duration of 1s for each generated speaker turn
slide-11
SLIDE 11

11 22 June 2016

Speaker Segmentation: Boundary Elimination

  • 2. Boundary elimination: eliminate false positives
  • Split the speech segment into speaker turns defined by the candidate boundaries
  • 1st pass: ΔBIC agglomerative clustering of adjacent speaker turns

Δ𝐶𝐽𝐷 = 𝑂𝑗 + 𝑂𝑗+1 log Σ𝑗,𝑗+1 − 𝑂𝑗 log Σ𝑗 − 𝑂𝑗+1 log Σ𝑗+1 − 𝜇𝑄 2nd pass: CDS agglomerative clustering of adjacent speaker turns 𝐸𝐷𝐸𝑇 𝑗, 𝑗 + 1 = 1 − 𝑦𝑗 ⋅ 𝑦𝑗+1 𝑦𝑗 𝑦𝑗+1

  • Clustering threshold controls the number of eliminated boundaries

… i i+1 …

slide-12
SLIDE 12

12 22 June 2016

Speaker Segmentation: Baseline Results

  • COST278 broadcast news test set: 11(+1) languages, 30 hours, 4400 speaker turns

Evaluation

  • Mapping with 500ms margin
  • Recall: percentage of real boundaries mapped to computed ones
  • Precision: percentage of computed boundaries mapped to real ones
  • Popular ΔBIC boundary generation baseline (overlapping comparison windows)

Maximum recall: 90.6%

slide-13
SLIDE 13

13 22 June 2016

Two-Pass Adaptive Speaker Segmentation

  • 1. Cluster the speaker turns generated by our best system 𝐸𝑁𝐵𝐼 + ∆𝐶𝐽𝐷
  • 2. Retrain the UBM and eigenvoice model on the speech and speaker clusters of the test file
  • 3. Repeat the boundary generation (𝐸𝑁𝐵𝐼) and elimination (CDS) with the adapted models
slide-14
SLIDE 14

14 22 June 2016

Soft VAD for Speaker Factor Extraction

  • Our Speaker Factor Extraction does not differentiate between speech and nonspeech

frames

  • Give speech frames more weight during speaker factor extraction
  • Integrate GMM-based soft Voice Activity Detection (VAD) during estimation Baum-Welch

statistics: 𝑞 𝑇 𝒑𝑢 = 𝑓log 𝑞(𝒑𝑢|𝑉𝐶𝑁𝑇) 𝑓log 𝑞(𝒑𝑢|𝑉𝐶𝑁𝑇) + 𝑓log 𝑞(𝒑𝑢|𝑉𝐶𝑁𝑂𝑇) 𝑂χ

𝑛 = 𝑝𝑢∈ χ

𝑞(𝑇|𝒑𝑢)𝛿(𝑉𝐶𝑁𝑇,𝑛|𝒑𝑢) 𝒈χ

𝑛 = 𝑝𝑢∈ χ

𝑞(𝑇|𝒑𝑢)𝛿(𝑉𝐶𝑁𝑇,𝑛|𝒑𝑢) 𝒑𝑢 Modifications 2nd pass adaptive system:

  • Also retrain the nonspeech UBM on the test file
  • Use soft VAD speaker factor extraction during CDS boundary elimination
slide-15
SLIDE 15

15 22 June 2016

Soft VAD Speaker Segmentation

slide-16
SLIDE 16

16 22 June 2016

Agglomerative Clustering

  • Initial ∆𝐶𝐽𝐷 clustering
  • iVector PLDA clustering

Cluster 1 Cluster 1

c

Cluster 2 Cluster 2 ???

𝑛𝑑1 → 𝑦𝑑1 𝑛𝑑3 → 𝑦𝑑3 𝑛𝑑2 → 𝑦𝑑2

  • 1. Extract iVector 𝑦𝑑 for each cluster (after VAD and feature warping):

𝑛𝑑 = 𝑛𝑉𝐶𝑁 + 𝑈𝑦𝑑

  • 2. Hypothesis test with Gaussian PLDA

𝑞 𝑦𝑑1, 𝑦𝑑3|𝐼𝑡𝑏𝑛𝑓_𝑡𝑞𝑓𝑏𝑙𝑓𝑠 𝑞 𝑦𝑑1, 𝑦𝑑3|𝐼𝑒𝑗𝑔𝑔𝑓𝑠𝑓𝑜𝑢_𝑡𝑞𝑓𝑏𝑙𝑓𝑠

  • 3. Merge cluster pair with largest ratio
  • 4. Iterate whole process until maximum ratio is too small
slide-17
SLIDE 17

17 22 June 2016

Results Clustering + Conclusion

Initial segmentation DER(%) Boundary Precision(%) Boundary Recall(%) 𝐸Δ𝐶𝐽𝐷 10.1 65.2 75.8 𝐸𝑁𝐵𝐼 9.7 74.3 84.2 2−pass 𝐸𝑁𝐵𝐼 9.8 79.7 81.3 2-pass with soft VAD 8.9 81.7 85.0

  • Factor analysis based speaker segmentation produces more accurate boundaries
  • File-by-file adaptation further improves results
  • Soft VAD makes the speaker factor extraction more accurate (adaptive system)
  • Viterbi resegmentation deteriorates the boundary accuracy of the proposed system
  • COST278 broadcast news test set: 11(+1) languages, 30 hours, 4400 speaker turns
  • Diarization Error Rate (DER): percentage of frames attributed to a wrong speaker