On the Use of PLDA i-Vector Scoring for Clustering Short Segments - - PowerPoint PPT Presentation

on the use of plda i vector scoring for clustering short
SMART_READER_LITE
LIVE PREVIEW

On the Use of PLDA i-Vector Scoring for Clustering Short Segments - - PowerPoint PPT Presentation

On the Use of PLDA i-Vector Scoring for Clustering Short Segments Itay Salmun Irit Opher Itshak Lapidot itshakl@afeka.ac.il itaysa@afeka.ac.il irito@afeka.ac.il Outline Shortly about DNNs Motivation Problem Definition Basic


slide-1
SLIDE 1

On the Use of PLDA i-Vector Scoring for Clustering Short Segments

Itay Salmun Irit Opher Itshak Lapidot

itaysa@afeka.ac.il irito@afeka.ac.il itshakl@afeka.ac.il

slide-2
SLIDE 2

Outline

  • Shortly about DNNs
  • Motivation
  • Problem Definition
  • Basic Mean-Shift Algorithm
  • Modified Mean-Shift Algorithm
  • Speaker Clustering System
  • Experiments and Results
  • Summary

June 2016

slide-3
SLIDE 3

Shortly about DNNs

June 2016

slide-4
SLIDE 4

Motivation

June 2016

Bli… Blu… Blo… Pla… Pli… Dla… Gla… Tra… Tra Ta Ta Ta Ta…

slide-5
SLIDE 5

Motivation (cont.)

June 2016

Bli… Blu… Blo… Pla… Pli… Dla… Gla… Tra… Tra Ta Ta Ta Ta…

slide-6
SLIDE 6

Motivation (cont.)

June 2016

Bli… Blu… Blo… Pla… Pli… Dla… Gla… Tra… Tra Ta Ta Ta Ta…

slide-7
SLIDE 7

Problem Definition

June 2016

  • Given many short speech segments,

required to cluster them into homogeneous groups, such that:

– Each cluster will occupied mostly by one speaker only (cluster purity). – Each speaker will mostly belongs to one cluster

  • nly (speaker purity).
slide-8
SLIDE 8

June 2016

Mean-Shift Algorithm

Basic

  • Objective : Find the densest

region

  • The Mean Shift vector:

Region of interest Center of mass Mean Shift vector

2 1 2 1

( )

n i i i h n i i

g h m g h φ φ φ φ φ φ φ

= =

  −     = −   −    

∑ ∑

  • The uniform kernel with bandwidth for Euclidean pairwise distances :

2 2 2 2

1 ( , , )

i i i

h g h h φ φ φ φ φ φ  − ≤  =  − >  

slide-9
SLIDE 9

June 2016

Mean-Shift Algorithm (cont.)

Modified

  • The Mean Shift vector:
  • The adaptive bandwidth parameter is

calculated using K-Nearest neighbor

  • algorithm. If is the Kth nearest neighbor of

then the bandwidth is calculated as:

  • Where is the two-covariance scoring.

1 1

( , , ) ( ) ( , , )

i

k il il i i l h i i k il i i l

g h m g h φ φ φ φ φ φ φ

= =

= −

∑ ∑

( , )

i i iK

h s φ φ =

i

h

iK

φ

i

φ ( , )

i ik

s φ φ

slide-10
SLIDE 10

June 2016

Mean-Shift Algorithm (cont.)

Modified

  • We select a subset of data points in

which the PLDA pairwise score with are larger or equal to the adaptive bandwidth

  • We use Mean shift weighted kernel of:

( ) { : ( , ) }

i

h i il i il i

S s h φ φ φ φ = ≥ ( , ) ( , ) ( , , ) ( , )

i il i il i i il i i il i

s s h g h s h φ φ φ φ φ φ φ φ ≥  =  <  ( )

i

h i

S φ

i

φ

i

h

1 2 1 2 1 2

( , | ) ( , ) log ( , | )

s d

p H s p H φ φ φ φ φ φ =

slide-11
SLIDE 11

June 2016

Speaker Clustering System

“i-vectors” PLDA score Mean Shift algorithm Clustering results

* In previous work: I. fixed h threshold II. a cosine distance instead of PLDA

  • III. Random Mean Shift
slide-12
SLIDE 12

June 2016

Speaker Clustering System

Before clustering:

  • Train the UBM and TV matrix.
  • Train the PCA matrix T and the Whitening

transformation matrix C.

  • Calculate the low rank i-vectors:
  • Using the low rank i-vectors, train the two-

covariance model parameters. CT CT φ ϕ φ =

slide-13
SLIDE 13

June 2016

Speaker Clustering System

Given a set of speech segments, cluster them according to the following steps:

  • 1. For each speech segment extract the i-

vectors:

  • 2. Calculate low rank i-vectors:
  • 3. Apply two-covariance score mean-shift.
  • 4. Merge all shifted points, according to

Euclidian distance with fixed threshold

{ }

i

φ

{ }

i

ϕ

slide-14
SLIDE 14

June 2016

Experiments and Results

Experiments Setup

Experiments on telephone conversations

  • Cutting NIST-2008 into segments according to a given statistic.
  • Average segment length: 2.5 Sec
  • Average number of segments per speaker: 33

Clustering evaluation

  • 1. Average Speaker Purity (ASP).
  • 2. Average Cluster purity (ACP).
  • 3. K:

.

  • 4. Average Number of Detected Speakers (ANDS).

K acp asp = ⋅

slide-15
SLIDE 15

June 2016

Experiments and Results (cont.)

Bandwidth parameter h (for 30 speakers)

Cosine based random mean shift clustering: adaptive threshold using kNN VS a fixed threshold

slide-16
SLIDE 16

June 2016

Experiments and Results (cont.)

Mean Shift’s selecting point configuration (for 30 speakers)

Cosine based mean shift clustering with adaptive threshold: full mean shift VS random mean shift

slide-17
SLIDE 17

June 2016

Experiments and Results (cont.)

PLDA based Mean Shift (for 30 speakers)

Clustering with adaptive threshold: PLDA based mean shift VS cosine based mean shift

slide-18
SLIDE 18

June 2016

Experiments and Results (cont.)

PLDA training (for 30 speakers)

PLDA based mean shift: PLDA model trained on short segments VS PLDA model trained on long segments

slide-19
SLIDE 19

June 2016

Experiments and Results (cont.)

Summary of Mean Shift configuration (for 30 speakers)

Comparing K value of mean shift configurations

slide-20
SLIDE 20

June 2016

Experiments and Results (cont.)

Summary of Mean Shift configuration (for 30 speakers)

Comparing the average number of detected speakers (ANDS) of mean shift configurations.

slide-21
SLIDE 21

June 2016

Experiments and Results (cont.)

Table 1: Results for different number of speakers for the cosine based mean shift (baseline system)

Influence of the Population Size (Baseline System)

ANDS K ASP ACP h Number of Speakers 6.1 85.7 80.1 92.2 0.35 3 21.1 79.9 71.6 89.5 0.40 7 60.6 70.0 63.3 77.6 0.45 15 136.6 69.9 57.6 85.0 0.50 22 195.0 65.9 53.2 81.7 0.50 30 614.1 61.2 44.3 84.6 0.55 60 1742.1 54.1 42.8 68.4 0.55 188

slide-22
SLIDE 22

June 2016

Experiments and Results (cont.)

Table 2: Results for different number of speakers for the PLDA based mean shift (proposed system)

Influence of the Population Size (Proposed System)

ANDS K ASP ACP k (kNN) Number of Speakers 5.0 79.8 71.3 90.0 19 3 11.2 75.5 67.5 84.8 17 7 26.9 74.1 63.6 86.6 15 15 36.4 75.1 65.3 86.6 15 22 46.6 72.1 64.3 80.8 17 30 90.0 67.2 61.1 73.8 17 60 283.0 57.1 53.1 61.4 17 188

slide-23
SLIDE 23

June 2016

Experiments and Results (cont.)

Table 2: Results for different number of speakers for the PLDA based mean shift (proposed system)

Baseline VS New system

ANDS K Number of Speakers 5.0 (6.1) 79.8 (85.7) 3 11.2 (21.1) 75.5 (79.9) 7 26.9 (60.6) 74.1 (70.0) 15 36.4 (136.6) 75.1 (69.9) 22 46.6 (195.0) 72.1 (65.9) 30 90.0 (614.1) 67.2 (61.2) 60 283.0 (1742.1) 57.1 (54.1) 188

slide-24
SLIDE 24

June 2016

Summary

  • While the proposed system is more time

consuming, it outperforms the baseline system in the following aspects:

  • 1. it yields better results when clustering large

numbers of speakers

  • 2. it is more robust to changes in the number of

speakers

  • 3. no bandwidth adjustment is needed (almost)
  • 4. The average number of detected speakers is by

far more accurate

slide-25
SLIDE 25

Thanks

June 2016