The MITLL NIST LRE 2015 Language Recognition System* Contributors - - PowerPoint PPT Presentation

the mitll nist lre 2015 language recognition system
SMART_READER_LITE
LIVE PREVIEW

The MITLL NIST LRE 2015 Language Recognition System* Contributors - - PowerPoint PPT Presentation

The MITLL NIST LRE 2015 Language Recognition System* Contributors in alphabetical order Najim Dehak**, Elizabeth Godoy, Douglas Reynolds Fred Richardson, Stephen Shum**, Elliot Singer Doug Sturim, Pedro Torres-Carrasquillo ** Johns Hopkins


slide-1
SLIDE 1

Contributors in alphabetical order Najim Dehak**, Elizabeth Godoy, Douglas Reynolds Fred Richardson, Stephen Shum**, Elliot Singer Doug Sturim, Pedro Torres-Carrasquillo

The MITLL NIST LRE 2015 Language Recognition System*

** Johns Hopkins University ***Spoken Language System Group, MIT-CSAIL

* This work was sponsored by the Department of Defense under Air Force contract F19628-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

slide-2
SLIDE 2

Odyssey 2016 PAT 2

  • Systems
  • Development Data
  • Evaluation Results
  • Observations

Outline

slide-3
SLIDE 3

Odyssey 2016 PAT 3

  • Classic I-Vector systems

– IVEC: cep + sdc features – PITCH1: cep + sdc + log_F0 + Dlog_F0 features

  • ASR DNN / I-Vector systems

– BNF1, BNF2: DNN bottleneck features – PITCH2: DNN bottleneck + log_F0+Dlog_F0 features – STATS: DNN posteriors and cep+sdc features

  • ASR DNN / GMM-MMI

– MMI: GMM-MMI classifier using DNN bottleneck features

  • Multilingual ASR DNN / I-Vector system (Open data task)

– MLBNF: 5 Babel language DNN bottleneck features

LRE15 Systems - I

All ivec systems scored with LDA+WCCN or WCCN.

slide-4
SLIDE 4

Odyssey 2016 PAT 4

  • Unsupervised Unit Discovery DNN / I-Vector system

– BAUD: DNN bottleneck features

  • DNN Counts Subspace Multinomial Model systems

– CNT1: Counts from ASR DNN layers – CNT2: Counts from LID DNN layers – CNT3: Joint subspace of CNT1 and CNT2 counts

  • Calibration and Fusion

– Multiclass calibration followed by linear fusion – Duration weighting on system scores – Per system calibration: MMI-trained Gaussian – Linear fusion optimized with logistic regression

LRE15 Systems - II

All ivec systems scored with LDA+WCCN or WCCN.

slide-5
SLIDE 5

Odyssey 2016 PAT 5

  • Systems
  • Development Data
  • Evaluation Results
  • Observations

Outline

slide-6
SLIDE 6

Odyssey 2016 PAT 6

  • Randomly divided the development data by file count

– 60% train – 40% test

  • Augmented both train and test sets with variable duration

segmentation (uniform distribution between 3-30 secs)

– Allowed for duration calibration in test – Found that duration augmentation of train data improved performance – Other forms of augmentation (warping pitch, spectrum, speed) did not show any appreciable gains

  • For submissions, calibration and fusion trained using scores

from train+test sets

Fixed Development Data Preparation

slide-7
SLIDE 7

Odyssey 2016 PAT 7

  • Found other data sources for all languages

– LRE07, 09, 11, OHSU, OGI-22, Fisher, Callfriend, Babel, Ahumada, MI5-UK, Appen, Qatar-Dialect, Kalaka – Types of speech: CTS, BNBS, BWBS – All data audited

  • Extra data used for language model training

– Used fixed data test set for performance estimation

  • The multi-lingual DNN was the only system to explicitly rely on

using extra data

  • During development found that using all the extra data hurt

performance

– Only 3 of the languages contributed to improved performance (Brazilian Portuguese, British English, and Arabic MSA)

Open Development Data Preparation

CTS = Conversational Telephone Speech; BNBS = Broadcast Narrow Band Speech; BWBS = Broadcast Wide Band Speech

slide-8
SLIDE 8

Odyssey 2016 PAT 8

0.01 0.02 0.03 0.04 0.05 0.06 Fixed Primary Open Primary arabic chinese english french iberian slavic average

Development Results

Primary Systems

COST

slide-9
SLIDE 9

Odyssey 2016 PAT 9

  • Systems
  • Development Data
  • Evaluation Results
  • Observations

Outline

slide-10
SLIDE 10

Odyssey 2016 PAT 10

0.176 0.173 0.093 0.089

0.05 0.1 0.15 0.2 0.25 0.3

BAUD CNT1 BNF1 PITCH1 STATS PRIMARY Oracle

Cost Average Sans Français

  • Primary not far from oracle fusion
  • Unsupervised BAUD does almost as well as single best ASR DNN BNF1

Fixed Primary

Component Breakout

Best Single System

slide-11
SLIDE 11

Odyssey 2016 PAT 11

  • We have analysis showing that BNBS vs. CTS is a major effect in

French cluster

  • Arabic and Iberian clusters have the highest costs after French

– Language / source?*

Fixed Primary

Per-Cluster Breakout

0.05 0.1 0.15 0.2 0.25

arabic chinese english french iberian slavic average avg_nofr

Cost

*MSA and Portuguese are least confusable languages in their clusters (both dominated by BNBS)

slide-12
SLIDE 12

Odyssey 2016 PAT 12

  • Type (BNBS vs. CTS) appears to be a large factor in dev/eval mismatch

French Cluster Analysis

HAITIAN WAF BNBS CTS WAF HAITIAN

slide-13
SLIDE 13

Odyssey 2016 PAT 13

  • Type (BNBS vs. CTS) is a factor but does not affect language separation

Slavic Cluster Analysis

BNBS RUSSIAN CTS BNBS CTS

slide-14
SLIDE 14

Odyssey 2016 PAT 14

0.169 0.167 0.086 0.084

0.05 0.1 0.15 0.2 0.25 0.3

CNT1 BNF1 PITCH1 MLBNF STATS PRIMARY Oracle

Cost Average Sans Français

  • Minor improvement using extra data
  • Multilingual BNF has slight gain over BNF1

Open Primary

Component Breakout

Best Single System

slide-15
SLIDE 15

Odyssey 2016 PAT 15

  • Looked at effect of

adding extra data to Arabic languages

  • Bottom line: extra data

provided little gain or hurt performance on eval

  • Post-eval?

Open Task

Adding Data to Arabic

Source Languages Audit Files Speech (hrs)

Appen Iraqi Levantine Appen 2012 121.90 Fisher Levantine LDC 1572 120.69 LRE11 Iraqi Levantine Maghrebi MSA LDC 2727 29.89 Qatar Egyptian Levantine Maghrebi MSA Mechanical Turk 20056 122.91

System Cost

Baseline 0.2292 Baseline+Appen 0.2235 Baseline+Fisher 0.2255 Baseline+LRE11 0.2155 Baseline+Qatar 0.2604

slide-16
SLIDE 16

Odyssey 2016 PAT 16

  • Additional data

– After revisiting open-set submission, training with all data available would have reduced “French” cluster error

  • Multilingual

– Work in progress but reductions observed for some configurations that include a more diverse set of languages

Post-eval Experiments

Highlights

slide-17
SLIDE 17

Odyssey 2016 PAT 17

  • Spanish errors

– 50 samples chosen randomly – Main issues present on these errors

  • Cuban females (10)
  • Little speech content (5-7)
  • English errors

– 50 samples chosen randomly – Main issues present on these errors

  • 80% errors do not involved Indian English
  • 5 files with no or little speech content

Post-eval Experiments

Highlights

slide-18
SLIDE 18

Odyssey 2016 PAT 18

  • DNN Bottleneck features used in an i-vector system continues

to be best single system

  • Fusion with count (phonotactic) systems provides moderate

gains

  • Possible factors affecting performance this year

– Language confusability (amplified by short durations) – Source mismatch (BNBS vs. CTS)

  • Adding more data did not solve the problem… on dev set
  • Path forward

– Need to better focus on robustness over wider conditions vs. incremental improvements over narrow conditions

Observations

slide-19
SLIDE 19

Odyssey 2016 PAT 19

slide-20
SLIDE 20

Odyssey 2016 PAT 20

Fixed Development Data

CODE LANGUAGE # Cuts Speech (hrs) CODE LANGUAGE # Cuts Speech (hrs) ara-acm Iraqi 2206 75.59 por-brz

  • Braz. Port.

1838 5.96 ara-apc Levantine 4073 266.67 qsl-pol Polish 695 32.14 ara-arb MSA 912 8.18 qsl-rus Russian 2021 37.80 ara-ary Maghrebi 919 46.91 spa-car

  • Carib. Spa.

194 30.59 ara-arz Egyptian 440 97.27 spa-eur

  • Eur. Spa.

366 8.55 eng-gbr British Eng. 147 2.10 spa-lac

  • Lat. Am. Spa.

160 15.30 eng-sas Indian Eng. 1689 25.37 zho-cdo Min 209 6.46 eng-usg

  • Amer. Eng.

2448 165.92 zho-cmn Mandarin 4131 200.70 fre-hat Hatian Cr. 2192 110.79 zho-wuu Wu 234 10.36 fre-waf West Afr. Fr. 1229 7.02 zho-yue Cantonese 2382 123.61

slide-21
SLIDE 21

Odyssey 2016 PAT 21

Open Development Data Preparation

LANGUAGE Sources Type Cuts Arabic.egyptian None Arabic.iraqi LRE11, Appen CTS 1788 Arabic.levantine LRE11, Fisher, Appen CTS 3623 Arabic.maghrebi LRE11 BNBS 505 Arabic.msa LRE11 BNBS 506 Chinese.cantonese LRE09, Babel CTS, BNBS 2359 Chinese.mandarin LRE05-07-09-11, Callfriend, OHSU CTS, BNBS 3693 Chinese.minnan LRE07-09 CTS 168 Chinese.wu LRE07-09 CTS 189 Spanish.caribbean LRE07 CTS 74 Spanish.european Ahumada CTS 328 Spanish.latinamerican OHSU (Mexican) CTS 130 Portuguese.brazilian LRE09, OGI-22, VOA scrape CTS, BNBS 1791 English.american LRE05-07-09-11, Callfriend, OHSU CTS 2088 English.indian LRE07-09-11, OHSU, OGI- 22 CTS 1271 English.british UK-MI5 SID CTS 148 Polish LRE11 CTS, BNBS 208 Russian LRE07-09-11, Callfriend CTS, BNBS 1551 West African French LRE09, VOA scrape BNBS 1195 Haitian Creole Babel, VOA scrape CTS, BNBS 1869

Missing Qatar and Kalaka

slide-22
SLIDE 22

Odyssey 2016 PAT 22

Calibration and Fusion Backend

Detector 1 Detector k

… …

… … …

sk,M sk,1 s1,M s1,1 sK,M sK,1 System Calibration

MMI Gaussian MMI Gaussian MMI Gaussian

Bayes’ Rule priors LL1 LLM System Fusion LL1,1

Weight

w1

Weight

wk

Weight

wK S U M LL1,M LLk,1 LLk,M LLK,M LLK,1

… … … …

P1 PM

Duration Scale

Detector K

Duration Scale Duration Scale

Duration Norm # frames (N)

k k

N d a N  

slide-23
SLIDE 23

Odyssey 2016 PAT 23

Calibration and Fusion

  • Multiclass calibration followed by linear fusion

– Per system calibration: MMI-trained Gaussian

  • Maximum Mutual Information equivalent to minimum average cross

entropy with answer key (MCLLR)

  • Shared diagonlized covariance to reduce free parameters
  • Replaces redundant combination of ML Gaussian + regression

– Linear fusion optimized with logistic regression (FoCal) – Multiclass: generates identification posteriors

  • Single back-end trained for all pairs and durations

– Parametric duration modeling replaces separate bins – Use Bayes’ rule to get language pair scores

1

( | ) ( ) ( | ) ( | ) ( )

i i ID i M j j j

p C P C P C p C P C

x x x

( , )

( | ) ( ) ( | ) ( | ) ( )

ID m n PAIR m n m ID n m

P C P C LR C P C P C  x x x

Pair likelihood ratio Identification posterior

slide-24
SLIDE 24

Odyssey 2016 PAT 24

  • Data breakdown by source and gender

French Cluster Analysis

Language Dev Eval Source Gender Source Gender Haitian BNBS (LRE09) F: 69 M: 291 CTS: 8997 F: 4478 M: 4519 BNBS: 0 West African CTS (LRE15) F: 153 M: 149 CTS: 6213 F: 3444 M: 2769 BNBS: 722 F: 208 M: 514

slide-25
SLIDE 25

Odyssey 2016 PAT 25

  • Slower gains at

durations > 15s

  • How

distinguishable are confusable languages by humans at 3-5 secs?

Effect of Test Durations

  • Cost per cluster (sans french) from fixed primary submission
slide-26
SLIDE 26

Odyssey 2016 PAT 26

Average over clusters vs. 20 lang detection

Fixed Primary System COST/DCF Average of 6 Language Clusters 0.176458

  • Sans French cluster

0.092629 20 Language Detection 0.102240

  • Sans French languages

0.082453

slide-27
SLIDE 27

Odyssey 2016 PAT 27

LRE Performance Trends: 1996-2011

MITLL Systems

5 10 15 20 25 30 35 1996 2003 2005 2005 2007 2007 2009 2011 2015* 2015 30s 10s 3s

slide-28
SLIDE 28

Odyssey 2016 PAT 28

LRE Performance Trends: 1996-2011

MITLL Systems

10 20 30 40

1996 2003 2005 2005 2007 2007 2009 2011

EER (%)

30s 10s 3s 1.0 1.9 4.2 11.3 3.2 1.4 1.6

CallFriend (12-lang) OHSU (7-lang) CTS+BN (23-lang) Mixer3 (14-lang) PAIRS (24-lang)

3.2

slide-29
SLIDE 29

Odyssey 2016 PAT 29

Fixed Primary Dev vs. Eval Results

0.1 0.2 0.3 0.4 0.5 0.6

arabic chinese english french iberian slavic average

Cost EVAL DEV

slide-30
SLIDE 30

Odyssey 2016 PAT 30

French Cluster

slide-31
SLIDE 31

Odyssey 2016 PAT 31

French Cluster

slide-32
SLIDE 32

Odyssey 2016 PAT 32

Slavic Cluster

slide-33
SLIDE 33

Odyssey 2016 PAT 33

Slavic Cluster