BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino - - PowerPoint PPT Presentation

bat system description for nist lre 2015
SMART_READER_LITE
LIVE PREVIEW

BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino - - PowerPoint PPT Presentation

BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino Oldrich Plchot, Pavel Matejka, Radek Fer, Ondrej Glembek,Ondrej Novotny, Jan Pesan, Lukas Burget, Martin Karafiat, Karel Vesely, Lucas Ondel, Santosh Kesiraju, Frantisek Grezl, Sri


slide-1
SLIDE 1

BAT System Description for NIST LRE 2015

BUT+Agnitio+Torino

Oldrich Plchot, Pavel Matejka, Radek Fer, Ondrej Glembek,Ondrej Novotny, Jan Pesan, Lukas Burget, Martin Karafiat, Karel Vesely, Lucas Ondel, Santosh Kesiraju, Frantisek Grezl, Sri Harish Mallidi (JHU), Ruizhi Li (JHU), Niko Brummer, Albert Swart, Sandro Cumani

June 22, Bilbao, Odyssey 2016

slide-2
SLIDE 2

Data

  • Fixed training condition

○ Train - 60% of training data, short cuts generated evenly from 3 to 30 seconds ○ Dev - 40% - short cuts ranging from 3 to 30 seconds with uniform distribution

  • Open training condition

○ all relevant data we managed to find ;) (no Babel data for i-vec, just for BN features) ○ main additions are KALAKA-3 (European Spanish, British English) and Arabic - Al Jazeera free corpus

  • Details in our system description / Odyssey paper
slide-3
SLIDE 3

Stacked Bottleneck features (SBN)

  • Based on a hierarchy of two NNs. Bottlenecks from the first network are

stacked in time and used as inputs to the second NN.

  • Bottlenecks from the second NN are the final features.
  • Fixed condition training data

○ Switchboard with ~7k triphone state targets ○ LRE15 training data with labels obtained using acoustic unit discovery tool (200 3-state units)

  • Open condition training data

○ 17 languages from Babel project (IARPA) as Multilingual BN - with ~100 phone states per language

slide-4
SLIDE 4

General system overview

  • i-vector based systems:

○ Features: ■ DNN bottlenecks trained on

  • Switchboard English (Fixed cond.)
  • Babel data – multilingual bottleneck features (Open cond.)

■ MFCC-SDC+PLLR (phone LLH ratios) ○ 2048 Full or Diagonal GMM/UBM, 600 dimensional i-vectors ○ Gaussian Linear Classifier (GLC) seems sufficient ■ Including i-vector uncertainty in scoring helps

  • Frame Level Sequence Summarizing NN (SSNN)
slide-5
SLIDE 5

Fusion with Prior-weighted Logistic Regression

  • Fusion is trained on dev data in score domain
  • One weight per system and one bias per language
  • Cluster prior: For the data of each cluster, we used a cluster-specific prior,

with zero probabilities for out-of-cluster languages and equal weights within the cluster.

  • Alternative system to allow between cluster analysis: Uniform prior: (flat) over

all languages

slide-6
SLIDE 6

Fixed Training Condition

Single systems DEV EVL System name classf cavg* cavg SBN80-SWB1-KALDI--CD GLCCOV 2.41 SBN80-SWB1--CD NN 2.80 SDC-PLLR--CD GLC 4.72 SBN80-AUTO600-KALDI--CD GLCCOV 5.46 SSNN/Alternate 2 NN 10.46 SBN80-SWB1-KALDI--CD/Alt3 GLC 2.31 Fusion DEV EVL System name cavg* cavg/cavg* Primary 1.9 Alternate1 1.24

slide-7
SLIDE 7

Fixed Training Condition

  • Eval: Single best system better than Primary fusion
  • Calibration

Almost no calibration loss on Dev

Fairly large calibration loss on eval Single systems DEV EVL System name classf cavg* cavg SBN80-SWB1-KALDI--CD GLCCOV 2.41 16.9 SBN80-SWB1--CD NN 2.80 19.9 SDC-PLLR--CD GLC 4.72 22.0 SBN80-AUTO600-KALDI--CD GLCCOV 5.46 27.0 SSNN/Alternate 2 NN 10.46 35.0 SBN80-SWB1-KALDI--CD/Alt3 GLC 2.31 18.48 Fusion DEV EVL System name cavg* cavg/cavg* Primary 1.9 18.1 / 13.5 Alternate1 1.24 19.4 / 13.4

slide-8
SLIDE 8

Cluster dependent i--vector

  • Average of scores from 6 systems, where
  • UBM is trained only on data in a given cluster

Fixed Training Condition DEV EVAL SYSTEM NAME cavg* cavg cavg* SBN80-SWB1-KALDI 2.9 20.1 16.2 SBN80-SWB1-KALDI-CD 2.5 19.7 15.4 SBN80-SWB1-KALDI-CD diag 2.3 18.5 14.9

slide-9
SLIDE 9

Sequence Summarizing NN

slide-10
SLIDE 10

Open Training Condition

Single systems DEV EVL System name classf cavg* cavg SSNN NN 30.0 ML-17-SBN-CD GLC 8.8 MultilangRDT GLC 10.4 SBN80-SWB1-KALDI--CD GLC 10.4 SDC-PLLR-CD NN 12.7 SNB80-AUTO600-KALDI NN 15.6 ML-17-SBN - trained on Open GLCCOV 8.9 Fusion DEV EVL System name cavg* cavg/cavg* Primary 7.14 Alternate1 7.15

slide-11
SLIDE 11

Open Training Condition

  • Single best system trained fully on Open Training condition better than fusion

Single systems DEV EVL System name classf cavg* cavg SSNN NN 30.0 41.3 ML-17-SBN-CD GLC 8.8 13.9 MultilangRDT GLC 10.4 13.6 SBN80-SWB1-KALDI--CD GLC 10.4 17.6 SDC-PLLR-CD NN 12.7 21.4 SNB80-AUTO600-KALDI NN 15.6 25.0 ML-17-SBN - trained on Open GLCCOV 8.9 12.0 Fusion DEV EVL System name cavg* cavg/cavg* Primary 7.14 14.1 / 10.3 Alternate1 7.15 14.1 / 10.4

slide-12
SLIDE 12

Analysis of training data

  • Analysis of using different training data for UBM/ivec and classifier
  • Important to train i--vector and classifier on Open dataset

F …. Fixed O … Open Training data UBM/IVEC_Classifier Submission

slide-13
SLIDE 13

Comparison of different features

  • Fixed Training Condition
  • all systems with 2048G FullCov UBM, 600 ivec and Gaussian classifier
  • Violates fixed data condition

(post eval analysis only)

* * *

16.1 20.1 19.7 22.1 28.9 23.8

slide-14
SLIDE 14

French cluster disaster

  • Radio vs. Telephone in DEV - most probably overtrained for channel
  • Channel is taking over on the EVAL data
  • Calibration on eval data is not able to fix a wrong classifier
slide-15
SLIDE 15

Comparison of different i-vector classifiers

  • Different classifiers performs similarly
  • Gaussian Linear Classifier (GLC)
  • Language Dependent Ivector (LDI)
  • Multiclass Multivariate Fully Bayesian Gaussian Classifier (MMFBG)
  • Neural Network
  • Logistic Regression
slide-16
SLIDE 16

Automatically derived acoustic units for BN training

  • Variational Bayes trained Dirichlet Process mixture of HMMs
  • Open loop of infinite number of phone-like units
  • 3 state HMMs, 2 Gaussians per state
  • 2048G FullCov UBM, 600 ivec and GLC + cuts
  • we can do better than SDC baseline on DEV even without transcription
  • conventional bottleneck trained on (probably) any data is still better

Fixed data condition DEV EVAL Features cavg* cavg/cavg* MFCC-SDC 6.3 23.8 / 21.5 SBN80-AUTO600-KALDI 5.4 28.9 / 24.2 SBN80-SWB1-KALDI 2.9 20.1 / 16.2

slide-17
SLIDE 17

Conclusion - lessons learned

  • State-of-the-art system is i--vector system with Bottleneck features
  • GLC with uncertainty performs similar to GLC trained with a lot of small cuts
  • Phonotactic systems do not contribute to the final fusion
  • Data engineering is always important
  • Frame level NN approaches

○ prone to overtraining ○ Better to use NN as a source of counts which are modelled by other classifier

  • Other systems

○ Denoising/Dereverberation with NN - helping on EVL but not on DEV ○ Phonotactic systems - with Switchboard phoneme recognizer ○ Frame level DNN

slide-18
SLIDE 18

THANK YOU