High-Performance Session Variability Compensation Session - - PowerPoint PPT Presentation

high performance session variability compensation session
SMART_READER_LITE
LIVE PREVIEW

High-Performance Session Variability Compensation Session - - PowerPoint PPT Presentation

High-Performance Session Variability Compensation Session Variability Compensation in Forensic Automatic Speaker Recognition Daniel Ramos Javier Gonzalez Dominguez Daniel Ramos , Javier Gonzalez-Dominguez, Eugenio Arevalo and Joaquin


slide-1
SLIDE 1

High-Performance Session Variability Compensation Session Variability Compensation in Forensic Automatic Speaker Recognition

Daniel Ramos Javier Gonzalez Dominguez Daniel Ramos, Javier Gonzalez-Dominguez, Eugenio Arevalo and Joaquin Gonzalez-Rodriguez

ATVS – Biometric Recognition Group g p Universidad Autonoma de Madrid daniel.ramos@uam.es http://atvs.ii.uam.es http://atvs.ii.uam.es

3aSC5 Special Session on Forensic Voice Comparison and Forensic Acoustics @ 2nd Pan-American/Iberian Meeting on Acoustics, Cancún, México, 15–19 November, 2010 http://cancun2010.forensic-voice-comparison.net

slide-2
SLIDE 2

Outline

Forensic Automatic Speaker Recognition:

Where are we?

State of the art dominated by high-performance session variability compensation

Some challenges affecting session var. comp.

Database mismatch

Sparse background data

Duration variability

Research trends

Facing the challenges

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 2

slide-3
SLIDE 3

Where Are We?

Automatic Speaker Recognition (ASpkrR) technology Automatic Speaker Recognition (ASpkrR) technology

Driven by NIST Speaker Recognition Evaluations (SRE)

St t Of Th A t d i t d b

State Of The Art dominated by

Spectral systems

High-performance session variability compensation

Factor Analysis, flavors and evolutions Data driven

Data-driven

Currently a mature technology

Usable in many applications

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 3

slide-4
SLIDE 4

Where Are We?

Discrimination performance (DET plots)

ATVS single spectral system in NIST SRE 2010

ATVS single spectral system in NIST SRE 2010

i-Vectors, session variability compensation

Primary Male (EER=5.0%) Primary Male (EER 5.0%) Primary Female (EER=7.1%) Contrastive Male (EER=6.0%) Contrastive Female (EER=8.1%) 2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 4

slide-5
SLIDE 5

Where Are We?

To consider in Forensic ASpkrR

Convergence to scientific standards

“Emulating DNA”, Likelihood Ratio (LR) paradigm

Unfavorable environment

Mostly uncontrolled conditions

Sparse amount of speech (comparison and background)

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 5

slide-6
SLIDE 6

Where Are We?

LR paradigm in Forensic ASpkrR

LR

Speaker Recognition Score to LR Transformation

LR

Recognition System Transformation (calibration)

 

,

p

p I E 

Score taken as Evidence (E)

 

 

, ,

p d

p E LR p I  

Two stages

( )

Discrimination stage (standard, score-based architecture)

Calibration stage (LR computation)

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 6

slide-7
SLIDE 7

Where Are We?

Discrimination performance

Example with AhumadaIV-Baeza database

Example with AhumadaIV Baeza database

Thanks to Guardia Civil Española

NIST-SRE-like task: comparison between

NIST SRE like task: comparison between

120s of GSM or microphone (controlled) speech (controlled) speech

Acquired following Guardia Civil protocols p

120s GSM-SITEL speech

Acquired using the SITEL

Acquired using the SITEL Spanish National wire- tapping system

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 7

slide-8
SLIDE 8

NIST SRE vs. Forensic ASpkrR p

Main commonalities

Highly variable environment (telephone different

Highly variable environment (telephone, different microphones, interview, etc.)

LR paradigm

LR paradigm

NIST SRE allow LR calibration (assessed by Cllr)…

although we believe this should be further encouraged

…although we believe this should be further encouraged

But in Forensic ASpkrR (and not in NIST SRE)

Typical lack of representative background data

NIST SRE: lots of speech from past SRE

Utterance duration is uncontrolled

NIST SRE: conditions of fixed, controlled duration

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 8

slide-9
SLIDE 9

Challenges of Session Variability Comp. g y p

Some typical forensic scenarios where session variability compensation degrades y p g

Strong database mismatch

Sparse background data

Sparse background data

Extreme duration variability

S S S

Scenarios not present in NIST SRE

Minor attention to these problems

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 9

slide-10
SLIDE 10

Challenges: Database Mismatch g

LR

Speaker Recognition Score to LR Transformation

LR

Recognition System Transformation (calibration)

S Q Background database conditions (different from Q and S conditions)

D t b i t h b k d d i

 Database mismatch: background and comparison

(Questioned Q, Suspect S) databases are different

Additi l bl t i t h Q d S

 Additional problem to mismatch among Q and S  Degrades performance of session variability compensation

Subspaces are not representative of comparison speech

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010

Subspaces are not representative of comparison speech

slide-11
SLIDE 11

Challenges: Database Mismatch g

Example in NIST SRE 2008

40

Example in NIST SRE 2008

Comparison of two speech utterances

10 20

  • bability (in %)

utterances

Speech from a single channel (microphone m3 or m5)

1 2 5

lse Rejection Proba

( p )

Speech from any channel in SRE08

0.1 0.2 0.5 1

False m5 match: EER−DET = 7.28 m5 mismatch no m5: EER−DET = 8.82 m3 match: EER−DET = 21.06 m3 mismatch no m3: EER−DET = 22.60

Speech from m3/m5 included

  • r not in background

0.1 0.2 0.5 1 2 5 10 20 40

False Acceptance Probability (in %)

UBM, normalization and session variability compensation

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 11

slide-12
SLIDE 12

Challenges: Database Mismatch

Example: AhumadaIV-Baeza

Background: NIST SRE telephone-only speech g p y p

Bad performance for low FA rates when FA rates when microphonic speech is used for training

Even when microphone speech is controlled and of higher quality

Following the standard acquisition procedures of acquisition procedures of Guardia Civil Española

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 12

slide-13
SLIDE 13

Database Mismatch: Research

Need of collection of more representative databases

Ahumada-Gaudi (2000,

Case study: continuous efforts of Guardia Civil Española

Ahumada Gaudi (2000, spontaneous speech, landline telephone and microphone)

AhumadaIII (2008, real forensic cases, multidialect, GSM over magnetic tape) magnetic tape)

AhumadaIV (2009, speech from SITEL)

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 13

slide-14
SLIDE 14

Database Mismatch: Research

Predictors of database mismatch

E g : log likelihood with respect to UBM (UBML)

  • E. g.: log-likelihood with respect to UBM (UBML)

Low UBML indicates database mismatch

f

Performance degrades

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 14

slide-15
SLIDE 15

Challenges: Sparse Background Data g p g

Typical in forensics: some representative background data is available

But typically a sparse corpus

Optimal use of this background data for session

Optimal use of this background data for session variability compensation

LR

Speaker Recognition System Score to LR Transformation (calibration)

Background database

System (calibration)

Background database S Q

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 15

slide-16
SLIDE 16

Sparse Background Data: Research

Example: simulation using NIST SRE 2008

Wealth background corpus of telephone data g p p

Sparse background corpus of microphone data

Microphone and telephone data to be compared

Microphone and telephone data to be compared

Session variability compensation strategies

Joining compensation matrices

Pooling Gaussian statistics

Scaling Gaussian statistics

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 16

slide-17
SLIDE 17

Sparse Background Data: Research p g

Combination strategies of available data

Wealth corpus telephone data (dTel)

Wealth corpus, telephone data (dTel)

Small corpus, sparse microphone data (dMic3)

10 12 6 8

EER

1conv4w 1conv4w 1conv4w 1mic

4

1mic 1conv4w 1mic 1mic

2

U=0 dTel dMic3 Joint Pooling Scaling

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 17

slide-18
SLIDE 18

Challenges: Duration Variability g y

Impact in session variability compensation and

Impact in session variability compensation and score normalization

S b / h t t i d ith l tt

Subspaces/cohorts trained with long utterances

Comparison with short utterances

Other effects

Misalignment in the scores due to duration variability g y

Degrades global discrimination performance

Seriously affects calibration

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 18

slide-19
SLIDE 19

Challenges: Duration Variability g

Impact on score normalization

Cohorts trained with fixed-length utterances

Example in Ahumada III (10s test utterances)

Example in Ahumada III (10s. test utterances)

NIST SRE cohorts (roughly 150s) Cohorts adjusted in length 

Impact on session variability compensation

( g y ) g EER with ZT-Norm 12,48% 10,46% 

Impact on session variability compensation

More difficult to avoid

Supervectors from short utterances are highly variable

Supervectors from short utterances are highly variable

More research needed

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010

slide-20
SLIDE 20

Challenges: Duration Variability

Duration variability: misalignment effects

Different ranges for different test segment durations g g

Even after score normalization (T-Norm)

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 20

slide-21
SLIDE 21

Duration Variability: Research

Calibration incorporating duration variability

Corrects misalignments due to fixed-cohort normalizations

Improves overall discrimination performance

Score Log-Likelihood Calibration Different score Better score Score Ratio Transformation Different score allignment for different duration Better score allignment Duration information Training scores

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 21

slide-22
SLIDE 22

Duration Variability: Research

Calibration incorporating duration variability

Corrects misalignments due to fixed-cohort normalizations

Improves overall discrimination performance

Exception…

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 22

slide-23
SLIDE 23

Conclusions

High-performance session variability compensation

Works for NIST SRE scenarios

Works for NIST SRE scenarios

Works for forensic scenarios comparable to NIST

Forensic scenarios where session var comp degrades

Forensic scenarios where session var. comp. degrades

Database mismatch

Sparse background data

Sparse background data

Duration of utterances

Research directions

Research directions

Predicting and compensating database mismatch Robustness to the lack of backgrund data

Robustness to the lack of backgrund data

Robustness to variability in the duration of the utterances

2nd Pan-American Meeting on Acoustics, ASA Cancun, Mexico, November 2010 23

slide-24
SLIDE 24

High-Performance Session Variability Compensation Session Variability Compensation in Forensic Automatic Speaker Recognition

Daniel Ramos Javier Gonzalez Dominguez Daniel Ramos, Javier Gonzalez-Dominguez, Eugenio Arevalo and Joaquin Gonzalez-Rodriguez

ATVS – Biometric Recognition Group g p Universidad Autonoma de Madrid daniel.ramos@uam.es http://atvs.ii.uam.es http://atvs.ii.uam.es