Out-of-set i-vector selection for open-set language identification - - PowerPoint PPT Presentation

out of set i vector selection for open set language
SMART_READER_LITE
LIVE PREVIEW

Out-of-set i-vector selection for open-set language identification - - PowerPoint PPT Presentation

Out-of-set i-vector selection for open-set language identification Hamid Behravan, Tomi Kinnunen, Ville Hautamki School of Computing University of Eastern Finland Odyssey 2016 June 21-24 Bilbao Closed-set: a test segment corresponds to one


slide-1
SLIDE 1

Out-of-set i-vector selection for open-set language identification

Hamid Behravan, Tomi Kinnunen, Ville Hautamäki

School of Computing University of Eastern Finland

Odyssey 2016 June 21-24 Bilbao

slide-2
SLIDE 2

Test uttarance

Target languages English Spanish Persian Finnish Swedish

Which language? Spanish

Closed-set: a test segment corresponds to one of the known target (in-set) languages

2

slide-3
SLIDE 3

Test uttarance

Target languages English Spanish Persian Finnish Swedish

Which language?

Open-set: the language of a test segment might not be any of the in-set languages

Non-target languages Unknown languages

Out-of-set

One of the target languages or Out-of-set model

3

slide-4
SLIDE 4

One way to perform open-set LID is to train an

  • ut-of-set model

4

LID: language identification

slide-5
SLIDE 5
slide-6
SLIDE 6

What are the good out-of-set candidates?

+ + + + + ++++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++++

× ×

Good candidates for out-set-data In-set data

  • Out-of-set candidates should come

from different linguistic language families

  • Out-of-set candidates should be

close to in-set languages; while

  • thers far away [Zhang and Hansen, 2014]

B A

  • Q. Zhang and J. H. L. Hansen, “Training candidate selection for effective rejection in open-set language

identification,” in Proc. of SLT, 2014, pp. 384–389.

6

× × ×

slide-7
SLIDE 7

Out-of-set candidate detection methods (1) One-class SVM:

Idea: Enclose data with an hypersphere and classify new data as normal (+) if it falls within the hypersphere and otherwise as out-of-set (-).

+

  • 7
slide-8
SLIDE 8

Out-of-set candidate detection methods (2)

d1 d2 d3

K=3

8

  • K-nearest neighbour (kNN):
  • Distance to class mean
slide-9
SLIDE 9

Proposed method: non-parametric Kolmogorov-Smirnov test

Idea: Estimate whether two samples have the same underlying distribution by computing the maximum difference between their empirical cumulative distribution functions (ECDFs):

Maximum difference (KS)

9

slide-10
SLIDE 10

Adopting Kolmogorov-Smirnov test to our open-set LID task

Goal: Giving each unlabeled i-vector an outlier score

Compute ECDFs Taking average over all KS values

slide-11
SLIDE 11

Computing outlier score for an unlabled i-vector

. . . Min

11

slide-12
SLIDE 12

KSEs within each language have values close to zero, whereas, they tend to values close to 1 for out-of-set data.

Distribution of in-set and OOS KSE values for two different languages, a) Dari and b) French.

12

slide-13
SLIDE 13

So far four methods were presented for

  • ut-of-set data selection

13

slide-14
SLIDE 14

NIST language i-vector challenge 2015 corpus

Distribution of training, development and test sets from the NIST 2015 language i-vector machine learning challenge.

  • 300 i-vectors for each of the 50 target languages
  • i-vectors are of dimensionality 400
  • i-vectors are further post-processed by within-class covariance

normalization (WCCN) and linear discriminant analysis (LDA)

14

slide-15
SLIDE 15

Segmenting train data into three portions for out-of-set evaluation

All portions are subsets of the original NIST 2015 LRE i-vector challenge training set.

15

slide-16
SLIDE 16

Example of test utterance labeling for the evaluation of

  • ut-of-set (OOS) data detection task given multiple inset languages

16

slide-17
SLIDE 17

KSE outperforms kNN and one-class SVM by 14% and 16% relative EER reductions, respectively.

17

slide-18
SLIDE 18

Fusion of KSE to baseline OOS detection methods. Fusion

  • f KSE to one-class SVM yields the best performance.

18

slide-19
SLIDE 19

Data selected for out-of-set modeling Identification cost Random (1067) 32.11 Training (15000) 32.61 Development (6431) 31.23 Training + development (21431) 31.74 Proposed selection method (1067) 26.61

Closed-set (no OOS model) 37.23

  • Results are reported from the NIST evaluation online system
  • Numbers in parentheses indicate amounts of selected data for OOS modeling
  • Back-end is based on SVM classifier
  • NIST Baseline result: 39.59

The lowest identification cost is 26.61, outperforming the NIST baseline system by 33% relative improvement.

slide-20
SLIDE 20

Open-set LID results for different out-of-set data selection

  • methods. KSE outperforms the other methods.

20

  • The results are reported from the NIST evaluation online system.
  • Out of 1500 out-of-set data, 1012 are classified correctly as out-of-set

using KSE.

slide-21
SLIDE 21

A simple and effective technique to find out-of-set data in the i-vector space.

Open-set LID

33% relative reduction in identification accuracy over the closed-set LID

21