[PPT] - The MITLL NIST LRE 2015 Language Recognition System* Contributors PowerPoint Presentation

SLIDE 1

Contributors in alphabetical order Najim Dehak, Elizabeth Godoy, Douglas Reynolds Fred Richardson, Stephen Shum, Elliot Singer Doug Sturim, Pedro Torres-Carrasquillo

The MITLL NIST LRE 2015 Language Recognition System*

** Johns Hopkins University ***Spoken Language System Group, MIT-CSAIL

* This work was sponsored by the Department of Defense under Air Force contract F19628-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

SLIDE 2

Odyssey 2016 PAT 2

Systems
Development Data
Evaluation Results
Observations

Outline

SLIDE 3

Odyssey 2016 PAT 3

Classic I-Vector systems

– IVEC: cep + sdc features – PITCH1: cep + sdc + log_F0 + Dlog_F0 features

ASR DNN / I-Vector systems

– BNF1, BNF2: DNN bottleneck features – PITCH2: DNN bottleneck + log_F0+Dlog_F0 features – STATS: DNN posteriors and cep+sdc features

ASR DNN / GMM-MMI

– MMI: GMM-MMI classifier using DNN bottleneck features

Multilingual ASR DNN / I-Vector system (Open data task)

– MLBNF: 5 Babel language DNN bottleneck features

LRE15 Systems - I

All ivec systems scored with LDA+WCCN or WCCN.

SLIDE 4

Odyssey 2016 PAT 4

Unsupervised Unit Discovery DNN / I-Vector system

– BAUD: DNN bottleneck features

DNN Counts Subspace Multinomial Model systems

– CNT1: Counts from ASR DNN layers – CNT2: Counts from LID DNN layers – CNT3: Joint subspace of CNT1 and CNT2 counts

Calibration and Fusion

– Multiclass calibration followed by linear fusion – Duration weighting on system scores – Per system calibration: MMI-trained Gaussian – Linear fusion optimized with logistic regression

LRE15 Systems - II

All ivec systems scored with LDA+WCCN or WCCN.

SLIDE 5

Odyssey 2016 PAT 5

Systems
Development Data
Evaluation Results
Observations

Outline

SLIDE 6

Odyssey 2016 PAT 6

Randomly divided the development data by file count

– 60% train – 40% test

Augmented both train and test sets with variable duration

segmentation (uniform distribution between 3-30 secs)

– Allowed for duration calibration in test – Found that duration augmentation of train data improved performance – Other forms of augmentation (warping pitch, spectrum, speed) did not show any appreciable gains

For submissions, calibration and fusion trained using scores

from train+test sets

Fixed Development Data Preparation

SLIDE 7

Odyssey 2016 PAT 7

Found other data sources for all languages

– LRE07, 09, 11, OHSU, OGI-22, Fisher, Callfriend, Babel, Ahumada, MI5-UK, Appen, Qatar-Dialect, Kalaka – Types of speech: CTS, BNBS, BWBS – All data audited

Extra data used for language model training

– Used fixed data test set for performance estimation

The multi-lingual DNN was the only system to explicitly rely on

using extra data

During development found that using all the extra data hurt

performance

– Only 3 of the languages contributed to improved performance (Brazilian Portuguese, British English, and Arabic MSA)

Open Development Data Preparation

CTS = Conversational Telephone Speech; BNBS = Broadcast Narrow Band Speech; BWBS = Broadcast Wide Band Speech

SLIDE 8

Odyssey 2016 PAT 8

0.01 0.02 0.03 0.04 0.05 0.06 Fixed Primary Open Primary arabic chinese english french iberian slavic average

Development Results

Primary Systems

COST

SLIDE 9

Odyssey 2016 PAT 9

Systems
Development Data
Evaluation Results
Observations

Outline

SLIDE 10

Odyssey 2016 PAT 10

0.176 0.173 0.093 0.089

0.05 0.1 0.15 0.2 0.25 0.3

BAUD CNT1 BNF1 PITCH1 STATS PRIMARY Oracle

Cost Average Sans Français

Primary not far from oracle fusion
Unsupervised BAUD does almost as well as single best ASR DNN BNF1

Fixed Primary

Component Breakout

Best Single System

SLIDE 11

Odyssey 2016 PAT 11

We have analysis showing that BNBS vs. CTS is a major effect in

French cluster

Arabic and Iberian clusters have the highest costs after French

– Language / source?*

Fixed Primary

Per-Cluster Breakout

0.05 0.1 0.15 0.2 0.25

arabic chinese english french iberian slavic average avg_nofr

Cost

*MSA and Portuguese are least confusable languages in their clusters (both dominated by BNBS)

SLIDE 12

Odyssey 2016 PAT 12

Type (BNBS vs. CTS) appears to be a large factor in dev/eval mismatch

French Cluster Analysis

HAITIAN WAF BNBS CTS WAF HAITIAN

SLIDE 13

Odyssey 2016 PAT 13

Type (BNBS vs. CTS) is a factor but does not affect language separation

Slavic Cluster Analysis

BNBS RUSSIAN CTS BNBS CTS

SLIDE 14

Odyssey 2016 PAT 14

0.169 0.167 0.086 0.084

0.05 0.1 0.15 0.2 0.25 0.3

CNT1 BNF1 PITCH1 MLBNF STATS PRIMARY Oracle

Cost Average Sans Français

Minor improvement using extra data
Multilingual BNF has slight gain over BNF1

Open Primary

Component Breakout

Best Single System

SLIDE 15

Odyssey 2016 PAT 15

Looked at effect of

adding extra data to Arabic languages

Bottom line: extra data

provided little gain or hurt performance on eval

Post-eval?

Open Task

Adding Data to Arabic

Source Languages Audit Files Speech (hrs)

Appen Iraqi Levantine Appen 2012 121.90 Fisher Levantine LDC 1572 120.69 LRE11 Iraqi Levantine Maghrebi MSA LDC 2727 29.89 Qatar Egyptian Levantine Maghrebi MSA Mechanical Turk 20056 122.91

System Cost

Baseline 0.2292 Baseline+Appen 0.2235 Baseline+Fisher 0.2255 Baseline+LRE11 0.2155 Baseline+Qatar 0.2604

SLIDE 16

Odyssey 2016 PAT 16

Additional data

– After revisiting open-set submission, training with all data available would have reduced “French” cluster error

Multilingual

– Work in progress but reductions observed for some configurations that include a more diverse set of languages

Post-eval Experiments

Highlights

SLIDE 17

Odyssey 2016 PAT 17

Spanish errors

– 50 samples chosen randomly – Main issues present on these errors

Cuban females (10)
Little speech content (5-7)
English errors

– 50 samples chosen randomly – Main issues present on these errors

80% errors do not involved Indian English
5 files with no or little speech content

Post-eval Experiments

Highlights

SLIDE 18

Odyssey 2016 PAT 18

DNN Bottleneck features used in an i-vector system continues

to be best single system

Fusion with count (phonotactic) systems provides moderate

gains

Possible factors affecting performance this year

– Language confusability (amplified by short durations) – Source mismatch (BNBS vs. CTS)

Adding more data did not solve the problem… on dev set
Path forward

– Need to better focus on robustness over wider conditions vs. incremental improvements over narrow conditions

Observations

SLIDE 19

Odyssey 2016 PAT 19

SLIDE 20

Odyssey 2016 PAT 20

Fixed Development Data

CODE LANGUAGE # Cuts Speech (hrs) CODE LANGUAGE # Cuts Speech (hrs) ara-acm Iraqi 2206 75.59 por-brz

Braz. Port.

1838 5.96 ara-apc Levantine 4073 266.67 qsl-pol Polish 695 32.14 ara-arb MSA 912 8.18 qsl-rus Russian 2021 37.80 ara-ary Maghrebi 919 46.91 spa-car

Carib. Spa.

194 30.59 ara-arz Egyptian 440 97.27 spa-eur

Eur. Spa.

366 8.55 eng-gbr British Eng. 147 2.10 spa-lac

Lat. Am. Spa.

160 15.30 eng-sas Indian Eng. 1689 25.37 zho-cdo Min 209 6.46 eng-usg

Amer. Eng.

2448 165.92 zho-cmn Mandarin 4131 200.70 fre-hat Hatian Cr. 2192 110.79 zho-wuu Wu 234 10.36 fre-waf West Afr. Fr. 1229 7.02 zho-yue Cantonese 2382 123.61

SLIDE 21

Odyssey 2016 PAT 21

Open Development Data Preparation

LANGUAGE Sources Type Cuts Arabic.egyptian None Arabic.iraqi LRE11, Appen CTS 1788 Arabic.levantine LRE11, Fisher, Appen CTS 3623 Arabic.maghrebi LRE11 BNBS 505 Arabic.msa LRE11 BNBS 506 Chinese.cantonese LRE09, Babel CTS, BNBS 2359 Chinese.mandarin LRE05-07-09-11, Callfriend, OHSU CTS, BNBS 3693 Chinese.minnan LRE07-09 CTS 168 Chinese.wu LRE07-09 CTS 189 Spanish.caribbean LRE07 CTS 74 Spanish.european Ahumada CTS 328 Spanish.latinamerican OHSU (Mexican) CTS 130 Portuguese.brazilian LRE09, OGI-22, VOA scrape CTS, BNBS 1791 English.american LRE05-07-09-11, Callfriend, OHSU CTS 2088 English.indian LRE07-09-11, OHSU, OGI- 22 CTS 1271 English.british UK-MI5 SID CTS 148 Polish LRE11 CTS, BNBS 208 Russian LRE07-09-11, Callfriend CTS, BNBS 1551 West African French LRE09, VOA scrape BNBS 1195 Haitian Creole Babel, VOA scrape CTS, BNBS 1869

Missing Qatar and Kalaka

SLIDE 22

Odyssey 2016 PAT 22

Calibration and Fusion Backend

Detector 1 Detector k

… …

… … …

sk,M sk,1 s1,M s1,1 sK,M sK,1 System Calibration

MMI Gaussian MMI Gaussian MMI Gaussian

Bayes’ Rule priors LL1 LLM System Fusion LL1,1

Weight

w1

Weight

wk

Weight

wK S U M LL1,M LLk,1 LLk,M LLK,M LLK,1

… … … …

P1 PM

…

Duration Scale

Detector K

Duration Scale Duration Scale

Duration Norm # frames (N)

k k

N d a N  

SLIDE 23

Odyssey 2016 PAT 23

Calibration and Fusion

Multiclass calibration followed by linear fusion

– Per system calibration: MMI-trained Gaussian

Maximum Mutual Information equivalent to minimum average cross

entropy with answer key (MCLLR)

Shared diagonlized covariance to reduce free parameters
Replaces redundant combination of ML Gaussian + regression

– Linear fusion optimized with logistic regression (FoCal) – Multiclass: generates identification posteriors

Single back-end trained for all pairs and durations

– Parametric duration modeling replaces separate bins – Use Bayes’ rule to get language pair scores

1

( | ) ( ) ( | ) ( | ) ( )

i i ID i M j j j

p C P C P C p C P C





x x x

( , )

( | ) ( ) ( | ) ( | ) ( )

ID m n PAIR m n m ID n m

P C P C LR C P C P C  x x x

Pair likelihood ratio Identification posterior

SLIDE 24

Odyssey 2016 PAT 24

Data breakdown by source and gender

French Cluster Analysis

Language Dev Eval Source Gender Source Gender Haitian BNBS (LRE09) F: 69 M: 291 CTS: 8997 F: 4478 M: 4519 BNBS: 0 West African CTS (LRE15) F: 153 M: 149 CTS: 6213 F: 3444 M: 2769 BNBS: 722 F: 208 M: 514

SLIDE 25

Odyssey 2016 PAT 25

Slower gains at

durations > 15s

How

distinguishable are confusable languages by humans at 3-5 secs?

Effect of Test Durations

Cost per cluster (sans french) from fixed primary submission

SLIDE 26

Odyssey 2016 PAT 26

Average over clusters vs. 20 lang detection

Fixed Primary System COST/DCF Average of 6 Language Clusters 0.176458

Sans French cluster

0.092629 20 Language Detection 0.102240

Sans French languages

0.082453

SLIDE 27

Odyssey 2016 PAT 27

LRE Performance Trends: 1996-2011

MITLL Systems

5 10 15 20 25 30 35 1996 2003 2005 2005 2007 2007 2009 2011 2015* 2015 30s 10s 3s

SLIDE 28

Odyssey 2016 PAT 28

LRE Performance Trends: 1996-2011

MITLL Systems

10 20 30 40

1996 2003 2005 2005 2007 2007 2009 2011

EER (%)

30s 10s 3s 1.0 1.9 4.2 11.3 3.2 1.4 1.6

CallFriend (12-lang) OHSU (7-lang) CTS+BN (23-lang) Mixer3 (14-lang) PAIRS (24-lang)

3.2

SLIDE 29

Odyssey 2016 PAT 29

Fixed Primary Dev vs. Eval Results

0.1 0.2 0.3 0.4 0.5 0.6

arabic chinese english french iberian slavic average

Cost EVAL DEV

SLIDE 30

Odyssey 2016 PAT 30

French Cluster

SLIDE 31

Odyssey 2016 PAT 31

French Cluster

SLIDE 32

Odyssey 2016 PAT 32

Slavic Cluster

SLIDE 33

Odyssey 2016 PAT 33

Contributors in alphabetical order Najim Dehak**, Elizabeth Godoy, Douglas Reynolds Fred Richardson, Stephen Shum**, Elliot Singer Doug Sturim, Pedro Torres-Carrasquillo

The MITLL NIST LRE 2015 Language Recognition System*

Outline

– IVEC: cep + sdc features – PITCH1: cep + sdc + log_F0 + Dlog_F0 features

– BNF1, BNF2: DNN bottleneck features – PITCH2: DNN bottleneck + log_F0+Dlog_F0 features – STATS: DNN posteriors and cep+sdc features

– MMI: GMM-MMI classifier using DNN bottleneck features

– MLBNF: 5 Babel language DNN bottleneck features

LRE15 Systems - I

– BAUD: DNN bottleneck features

– CNT1: Counts from ASR DNN layers – CNT2: Counts from LID DNN layers – CNT3: Joint subspace of CNT1 and CNT2 counts

– Multiclass calibration followed by linear fusion – Duration weighting on system scores – Per system calibration: MMI-trained Gaussian – Linear fusion optimized with logistic regression

LRE15 Systems - II

Outline

– 60% train – 40% test

segmentation (uniform distribution between 3-30 secs)

– Allowed for duration calibration in test – Found that duration augmentation of train data improved performance – Other forms of augmentation (warping pitch, spectrum, speed) did not show any appreciable gains

from train+test sets

Fixed Development Data Preparation

– LRE07, 09, 11, OHSU, OGI-22, Fisher, Callfriend, Babel, Ahumada, MI5-UK, Appen, Qatar-Dialect, Kalaka – Types of speech: CTS, BNBS, BWBS – All data audited

– Used fixed data test set for performance estimation

using extra data

performance

– Only 3 of the languages contributed to improved performance (Brazilian Portuguese, British English, and Arabic MSA)

Open Development Data Preparation

0.01 0.02 0.03 0.04 0.05 0.06 Fixed Primary Open Primary arabic chinese english french iberian slavic average

Development Results

Primary Systems

COST

Outline

Fixed Primary

Component Breakout

French cluster

– Language / source?*

Fixed Primary

Per-Cluster Breakout

French Cluster Analysis

Slavic Cluster Analysis

Open Primary

Component Breakout

adding extra data to Arabic languages

provided little gain or hurt performance on eval

Open Task

Adding Data to Arabic

– After revisiting open-set submission, training with all data available would have reduced “French” cluster error

– Work in progress but reductions observed for some configurations that include a more diverse set of languages

Post-eval Experiments

Highlights

– 50 samples chosen randomly – Main issues present on these errors

– 50 samples chosen randomly – Main issues present on these errors

Post-eval Experiments

Highlights

to be best single system

gains

– Language confusability (amplified by short durations) – Source mismatch (BNBS vs. CTS)

– Need to better focus on robustness over wider conditions vs. incremental improvements over narrow conditions

Observations

Fixed Development Data

Open Development Data Preparation

Calibration and Fusion Backend

Detector 1 Detector k

… …

… … …

sk,M sk,1 s1,M s1,1 sK,M sK,1 System Calibration

Bayes’ Rule priors LL1 LLM System Fusion LL1,1

w1

wk

wK S U M LL1,M LLk,1 LLk,M LLK,M LLK,1

… … … …

P1 PM

…

Detector K

Duration Norm # frames (N)

Calibration and Fusion

– Per system calibration: MMI-trained Gaussian

entropy with answer key (MCLLR)

– Linear fusion optimized with logistic regression (FoCal) – Multiclass: generates identification posteriors

– Parametric duration modeling replaces separate bins – Use Bayes’ rule to get language pair scores

( | ) ( ) ( | ) ( | ) ( )

p C P C P C p C P C



Contributors in alphabetical order Najim Dehak, Elizabeth Godoy, Douglas Reynolds Fred Richardson, Stephen Shum, Elliot Singer Doug Sturim, Pedro Torres-Carrasquillo