The present and future of voiceprint based security Prof. - - PowerPoint PPT Presentation

the present and future of
SMART_READER_LITE
LIVE PREVIEW

The present and future of voiceprint based security Prof. - - PowerPoint PPT Presentation

APSIPA APSIPA Asia-Pacific Signal and Information Processing Association Asia-Pacific Signal and Information Processing Association Speaker Verification The present and future of voiceprint based security Prof. Eliathamby Ambikairajah Head


slide-1
SLIDE 1

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA

Asia-Pacific Signal and Information Processing Association

Speaker Verification – The present and future of voiceprint based security

  • Prof. Eliathamby Ambikairajah

Head of School of Electrical Engineering & Telecommunications, University of New South Wales, Australia

21 Oct 2013

slide-2
SLIDE 2

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Outline

  • Introduction
  • Speaker Verification Applications
  • Speaker Verification System
  • Performance measure
  • NIST Speaker Recognition Evaluation (SRE)
  • Discussion

1

slide-3
SLIDE 3

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

“How are you?”

Introduction

  • Speech conveys several types of information

– Linguistic: message and language information – Paralinguistic : emotional and physiological characteristics

Speech Recognition Language Recognition Speaker Recognition Emotion Recognition Accent Recognition

“How are you?” English Hsing Ming Happy Taiwanese

Linguistic Paralinguistic

2

slide-4
SLIDE 4

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Introduction

Speech Recognition Language Recognition Speaker Recognition Emotion Recognition Accent Recognition

“How are you?”

3

Speaker Identification determines who is speaking given a set of enrolled speakers Speaker Verification determines if the unknown voice is from the claimed speaker Speaker Diarization partition an input audio stream into homogeneous segments according to the speaker identity

slide-5
SLIDE 5

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 4

Speaker Identification determines who is speaking given a set of enrolled speakers Speaker Verification determines if the unknown voice is from the claimed speaker Speaker Diarization partition an input audio stream into homogeneous segments according to the speaker identity

Model repository

Speaker 1 Model Unknown Speaker Best Matching Speaker Speaker 2 Model Speaker M Model

Model repository

Speaker 1 Model Claimed Speaker 2 Reject Speaker 2 Model Speaker M Model Speaker 1 Speaker 2 Speaker 1

slide-6
SLIDE 6

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Speaker Verification Applications - Biometrics

5

Physical facilities

Access control

Telephone credit card purchases

Transaction authentication

slide-7
SLIDE 7

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Speaker Verification System – Basic Overview

  • In automatic speaker verification,

– The front-end converts speech signal into a more convenient representation (typically a set of feature vectors) – The back-end compares this representation to a model of a speaker to determine how well they match

6

Feature Extraction Classification Speech Speaker Model Decision Making Accept/ Reject Front-end Back-end

slide-8
SLIDE 8

UBM: represent general, speaker independent model to be compared against a person-specific model when making an accept or reject decision.

Speaker Models Speaker 1 Model John’s Model

Speaker Verification System

I am John

Determine level

  • f Match

Determine level

  • f Match

Likelihood of Generic Male Likelihood of John Likelihood Ratio Decision Making

NOT JOHN

Universal Background Models (UBM)

Generic Male Generic Female

Feature Extraction

c0c1 cn c2

slide-9
SLIDE 9

Speaker Verification System – Speaker Enrolment

8

Universal Background Models (UBM)

Generic Male Generic Female Speaker 1 Speaker 2 Speaker N Feature Extraction Model Training Speaker x1 Speaker x2 Speaker xM Feature Extraction Feature Extraction Feature Extraction Model Adaptation Model Adaptation Model Adaptation Background male speaker data Target male speaker data

Creating a male UBM Speaker Models

Speaker x1 Model Speaker x2 Model Speaker xM Model

Creating male speaker-specific models

Step 1 Step 2

slide-10
SLIDE 10

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Detailed Speaker Verification System

9

Feature Extraction Speaker Modelling Classification (Scoring) Speech Accept/ Reject Feature Normalisation Model Normalisation Score Normalisation Decision Making · Cepstral Mean Subtraction (CMS) · RelAtive SpecTrAl (RASTA) · Feature Warping · Feature Mapping · Nuisance Attribute Projection (NAP) · Joint Factor Analysis (JFA) · i-vectors · Within Class Covariance Normalisation (WCCN) · Linear Discriminant Analysis (LDA) · Probabilistic Linear Discriminant Analysis (PLDA) · Zero-normalisation (Z-norm) · Test-normalisation (T-Norm)

slide-11
SLIDE 11

Front-end: Feature Extraction

10

Frame 1 Frame 2 Frame 3 Frame N 25ms 25ms 25ms 25ms

Feature Vector Feature Vector Feature Vector Feature Vector Windowing Feature Extraction Feature Normalisation

BASIC FEATURES

  • 5

5

  • 5

5

Windowing Feature Extraction Windowing Feature Extraction Windowing Feature Extraction

c0c1 cn c2 c0c1 cn c2 c0c1 cn c2 c0c1 cn c2

Normalised Feature Vector

c0c1 cn c2 c0c1 cn c2 c0c1 cn c2 c0c1 cn c2

NORMALISED FEATURES

Normalised Feature Vector Normalised Feature Vector Normalised Feature Vector Co distribution Normalised Co distribution

slide-12
SLIDE 12

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 11

Temporal Derivative

c0 c1 cn c2 c0 c1 cn c2 c0 c1 cn c2

Normalised Feature vectors (Frame 1) Normalised Feature vectors (Frame 2) Normalised Feature vectors (Frame P)

d0d1 dn d2

Delta Feature vectors (Frame 1) Delta Feature vectors (Frame 2) Delta Feature vectors (Frame P)

d0d1 dn d2 d0d1 dn d2

Temporal Derivative

a0a1 an a2

Acceleration Feature vectors (Frame 1) Acceleration Feature vectors (Frame 2) Acceleration Feature vectors (Frame P)

a0a1 an a2 a0a1 an a2

c0c1 cn c2

d0d1 dn d2

a0a1 an a2

Frame 1 Features: (e.g: 39 dimensions)

slide-13
SLIDE 13

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Detailed Speaker Verification System

12

Feature Extraction Speaker Modelling Classification (Scoring) Speech Accept/ Reject Feature Normalisation Model Normalisation Score Normalisation Decision Making · Cepstral Mean Subtraction (CMS) · RelAtive SpecTrAl (RASTA) · Feature Warping · Feature Mapping · Nuisance Attribute Projection (NAP) · Joint Factor Analysis (JFA) · i-vectors · Within Class Covariance Normalisation (WCCN) · Linear Discriminant Analysis (LDA) · Probabilistic Linear Discriminant Analysis (PLDA) · Zero-normalisation (Z-norm) · Test-normalisation (T-Norm)

slide-14
SLIDE 14

Speaker Modelling

 Probability density function approximated by 3- component Gaussian mixture models  Each Gaussian mixture consist of a mean (µ), covariance (Σ) and weight (w)

  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Overall PDF Weighted Gaussian 3 Weighted Gaussian 2 Weighted Gaussian 1 All Weights must sum to 1

13

4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 3 4 5 6 7 8 9

4 5 6 7 8 9 3 4 5 6 7 8 9 0.2 0.4

FEATURE SPACE MODELLING PROBABILITY DISTRIBUTION

Dimension 1 (C0) Dimension 2 (C1)

slide-15
SLIDE 15

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Database for creating UBM (example)

  • Training set

– 56 male speakers (each speaker consists of 2 minutes of active speech) for creating the UBM

  • Target set

– 20 male speakers (each speaker consists of 2 minutes of active speech) for speaker-specific model

  • Test set

– 250 male utterances (each speaker has many test utterances) with the known identity

14

slide-16
SLIDE 16

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 15

UBM Target Speaker Data Target Model Mean = 0.9 Covariance = 0.9 Mean = 0.8 Covariance = 0.5 Weight = 0.2 Weight = 0.3 Feature Dimension 1 Feature Dimension 2 Feature Dimension 1 Feature Dimension 2

Universal Background Model (UBM) consists

  • f 1024 Gaussian

mixtures Target speaker model consists of 1024 Gaussian mixtures  Gaussian mixture consists of a mean (µ), covariance (Σ) and weight (w)

1 2 1024 998 1024 998

slide-17
SLIDE 17

Representing GMMs

16

GMM

Mixture 1 Mixture 2 Mixture 1024

1x1 - Weight 39x1 Means 39x1 Covariances 1x1 - Weight 39x1 Means 39x1 Covariances 1x1 - Weight 39x1 Means 39x1 Covariances 1x1024 Weight vector 39x1024 Means matrix 39x1024 Covariances matrix

GMM REPRESENTATION

 The UBM and each speaker model is a GMM  Each of them will be represented by a vector of weights, a matrix of means and a matrix of covariances

slide-18
SLIDE 18

Universal Background Models Generic Male Generic Female Speaker Models Speaker 1 Model John’s Model

Decision Making

Likelihood S came from speaker model

Likelihood S did not come from speaker model

Score, L = log

𝑴 ≶ 𝜾 Reject/Accept Determine level of Match Determine level of Match Likelihood of Generic Male Likelihood of John

17

Feature Extraction

slide-19
SLIDE 19

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Score Normalisation

Speaker Verification System 1 Speaker Verification System 2 Speaker Verification System N Speech

Different Systems perform speaker verification in parallel

18

slide-20
SLIDE 20

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Score Normalisation

Speaker Verification System 1 Speaker Verification System 2 Speaker Verification System N Speech Score 1 Score 2 Score N

May not fall in the same range. i.e., NOT directly comparable

19

slide-21
SLIDE 21

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Score Normalisation

Speaker Verification System 1 Speaker Verification System 2 Speaker Verification System N Speech Score 1 Score 2 Score N Score Normalisation Score Normalisation Score Normalisation Normalised Score 1 Normalised Score 2 Normalised Score N

Normalised scores (L) are comparable

20

slide-22
SLIDE 22

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Fusion

21

Final score will be a weighted sum of score from each system

Speaker Verification System 1 Speaker Verification System 2 Speaker Verification System N Speech Score Normalisation Score Normalisation Score Normalisation Normalised Score 1, s1 Normalised Score 2, s2 Normalised Score N, sN Final Score = w1s1 + w2s2 +…+wNsN Fusion

slide-23
SLIDE 23

Performance measure

  • Types of error:

– Misses: valid identity is rejected

  • Probability of miss: ratio of the number of falsely rejected

speaker tests to the total number of correct speaker trials.

– False alarms: invalid identity is accepted

  • Probability of false alarm: ratio of the number of falsely accepted

speaker tests to the total number of impostor trials

TRUE SPEAKER IMPOSTER ACCEPT CLAIM REJECT CLAIM CORRECT DECISION CORRECT DECISION FALSE ACCEPTANCE MISS

22

slide-24
SLIDE 24

Performance measure - Detection error trade-off (DET) curve

False Acceptance Rate (in %) Miss Rate (in %)

Each point on the curve corresponds to a different 𝜄

23

slide-25
SLIDE 25

Performance measure - Detection error trade-off (DET) curve

Equal Error Rate (EER) = 1 % Wire Transfer: False acceptance is very costly Users may tolerate rejections for security Customization: False rejections alienate customers Any customization is beneficial

Application operating point depends on relative costs of the two error types High Convenience High Security Balance

False Acceptance Rate (in %) Miss Rate (in %)

24

slide-26
SLIDE 26

NIST Speaker Recognition Evaluation (SRE)

  • Ongoing text independent speaker recognition

evaluations conducted by NIST (http://www.itl.nist.gov/iad/mig/tests/spk/)

– driving force in advancing the state-of-the-art – Conditions for different amounts of data

  • 10 sec.
  • 3-5 minutes
  • 8 minutes
  • Separate channel and summed channel conditions

– English-speakers, non-English speakers, multilingual speakers

25

slide-27
SLIDE 27

NIST SRE Trends

  • 1996 – First SRE in current series
  • 2000 – AHUMADA Spanish data, first non-English

speech

  • 2001 – Cellular data, Automatic Speech Recognition

(ASR) transcripts provided

  • 2005 – Multiple languages with bilingual speakers,

room mic recordings, cross-channel trials

  • 2008 – Interview data
  • 2010 – High and low vocal effort, aging, HASR

(Human-Assisted Speaker Recognition) Evaluation

  • 2012 – Broad range of test conditions, with added

noise and reverberation, target speakers defined beforehand

26

slide-28
SLIDE 28

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Basic System

Feature Extraction MAP adaptation Log-likelihood UBM Speech

27

slide-29
SLIDE 29

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

  • In 2004’s: Classification

Feature Extraction MAP adaptation Extract Supervectors UBM Speech SVM Scoring

28

slide-30
SLIDE 30

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

  • In 2005’s: Channel compensation - NAP

Feature Extraction MAP adaptation Extract Supervectors UBM Speech NAP SVM Scoring

29

slide-31
SLIDE 31

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

  • In 2007’s: Channel compensation - JFA

Feature Extraction Baum-Welch statistics estimation Factor analysis Speaker Factor Extraction WCCN SVM UBM JFA Hyperparameters (V, U , D) Speech

30

slide-32
SLIDE 32

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

  • In 2009’s: Channel compensation – i-vector

Feature Extraction Baum-Welch statistics estimation Factor analysis i-vector extraction WCCN LDA Cosine Distance Scoring UBM Total Variability Matrix Speech

31

slide-33
SLIDE 33

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

  • In 2009’s: Channel compensation – PLDA

Feature Extraction Baum-Welch statistics estimation Factor analysis i-vector extraction PLDA Log-likelihood UBM Total Variability Matrix Speech

32

slide-34
SLIDE 34

APSIPA

Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 33