out of set i vector selection for open set language
play

Out-of-set i-vector selection for open-set language identification - PowerPoint PPT Presentation

Out-of-set i-vector selection for open-set language identification Hamid Behravan, Tomi Kinnunen, Ville Hautamki School of Computing University of Eastern Finland Odyssey 2016 June 21-24 Bilbao Closed-set: a test segment corresponds to one


  1. Out-of-set i-vector selection for open-set language identification Hamid Behravan, Tomi Kinnunen, Ville Hautamäki School of Computing University of Eastern Finland Odyssey 2016 June 21-24 Bilbao

  2. Closed-set: a test segment corresponds to one of the known target (in-set) languages Target languages English Spanish Which language? Spanish Test uttarance Persian Finnish Swedish 2

  3. Open-set: the language of a test segment might not be any of the in-set languages Target languages English Spanish Persian One of the target languages or Finnish Which language? Out-of-set model Swedish Test uttarance Non-target languages Unknown languages Out-of-set 3

  4. One way to perform open-set LID is to train an out-of-set model LID: language identification 4

  5. What are the good out-of-set candidates? In-set data  Out-of-set candidates should come + + + + from different linguistic language A + ++++ + + + + families × × + + + ++ + + × B × + + + + + + + + ×  Out-of-set candidates should be + + + + + + + + close to in-set languages; while ++++ others far away [Zhang and Hansen, 2014] Good candidates for out-set-data Q. Zhang and J. H. L. Hansen, “Training candidate selection for effective rejection in open-set language 6 identification,” in Proc. of SLT, 2014, pp. 384–389.

  6. Out-of-set candidate detection methods (1) One-class SVM: Idea: Enclose data with an hypersphere and classify new data as normal ( + ) if it falls within the hypersphere and otherwise as out-of-set ( - ). + - 7

  7. Out-of-set candidate detection methods (2) - K-nearest neighbour ( k NN): K=3 d1 d2 d3 - Distance to class mean 8

  8. Proposed method: non-parametric Kolmogorov-Smirnov test Idea: Estimate whether two samples have the same underlying distribution by computing the maximum difference between their empirical cumulative distribution functions (ECDFs): Maximum difference (KS) 9

  9. Adopting Kolmogorov-Smirnov test to our open-set LID task Goal: Giving each unlabeled i-vector an outlier score Taking average over all Compute ECDFs KS values

  10. Computing outlier score for an unlabled i-vector Min . . . 11

  11. KSEs within each language have values close to zero, whereas, they tend to values close to 1 for out-of-set data. Distribution of in-set and OOS KSE values for two different languages, a) Dari and b) French. 12

  12. So far four methods were presented for out-of-set data selection 13

  13. NIST language i-vector challenge 2015 corpus Distribution of training, development and test sets from the NIST 2015 language i-vector machine learning challenge. - 300 i-vectors for each of the 50 target languages - i-vectors are of dimensionality 400 - i-vectors are further post-processed by within-class covariance normalization (WCCN) and linear discriminant analysis (LDA) 14

  14. Segmenting train data into three portions for out-of-set evaluation All portions are subsets of the original NIST 2015 LRE i-vector challenge training set. 15

  15. Example of test utterance labeling for the evaluation of out-of-set (OOS) data detection task given multiple inset languages 16

  16. KSE outperforms kNN and one-class SVM by 14% and 16% relative EER reductions, respectively. 17

  17. Fusion of KSE to baseline OOS detection methods. Fusion of KSE to one-class SVM yields the best performance. 18

  18. The lowest identification cost is 26.61, outperforming the NIST baseline system by 33% relative improvement. Data selected for out-of-set modeling Identification cost Random (1067) 32.11 Training (15000) 32.61 Development (6431) 31.23 Training + development (21431) 31.74 Proposed selection method (1067) 26.61 Closed-set (no OOS model) 37.23 - Results are reported from the NIST evaluation online system - Numbers in parentheses indicate amounts of selected data for OOS modeling - Back-end is based on SVM classifier - NIST Baseline result: 39.59

  19. Open-set LID results for different out-of-set data selection methods. KSE outperforms the other methods. - The results are reported from the NIST evaluation online system. - Out of 1500 out-of-set data, 1012 are classified correctly as out-of-set using KSE. 20

  20. A simple and effective technique to find out-of-set data in the i-vector space. Open-set LID 33% relative reduction in identification accuracy over the closed-set LID 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend