Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart - - PowerPoint PPT Presentation

measuring reliability in forensic voice comparison
SMART_READER_LITE
LIVE PREVIEW

Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart - - PowerPoint PPT Presentation

3aSC6 Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart Morrison Julien Epps Philip Rose Tharmarajah Thiruvaran Cuiling Zhang SCHOOL OF LANGUAGE STUDIES China Criminal Police University Special


slide-1
SLIDE 1

Measuring Reliability in Forensic Voice Comparison

Geoffrey Stewart Morrison Julien Epps Philip Rose Tharmarajah Thiruvaran Cuiling Zhang

3aSC6

SCHOOL OF LANGUAGE STUDIES

  • China Criminal

Police University

Special Session on Forensic Voice Comparison and Forensic Acoustics @ 2nd Pan-American/Iberian Meeting on Acoustics, Cancún, México, 15–19 November, 2010 http://cancun2010.forensic-voice-comparison.net

slide-2
SLIDE 2

Validity and Reliability (Accuracy and Precision)

slide-3
SLIDE 3

true value mean poor accuracy poor precision good accuracy poor precision poor accuracy good precision good accuracy good precision

slide-4
SLIDE 4

Validity and Reliability in Forensic Science

  • The National Research Council report to Congress on

(2009) urged that procedures be adopted which include: “the reporting of a measurement with an interval that has a high probability of containing the true value” “the conducting of validation studies of the performance of a forensic procedure” (p. 121) Strengthening Forensic Science in the United States “quantifiable measures of the reliability and accuracy of forensic analyses” (p. 23) (p. 121)

slide-5
SLIDE 5

Testing the Validity of a Forensic-Comparison System

slide-6
SLIDE 6

Measuring Validity

  • Test set consisting of a large number of pairs known to be same
  • rigin and a large number of pairs known to be different origin

Use forensic-comparison system to calculate LR for each pair Compare output with knowledge about input

slide-7
SLIDE 7

Measuring Validity

Correct-classification / classification-error rate is not appropriate

– based on posterior probabilities – hard threshold rather than gradient decision fact same different same different correct acceptance correct rejection incorrect rejection incorrect acceptance

slide-8
SLIDE 8

Measuring Validity

  • Goodness is

to which LRs from same-origin pairs > 1, and different-origin pairs < 1 A metric which captures the gradient goodness of a set of likelihood ratios derived from test data is the log-likelihood-ratio cost, extent LRs from Cllr

slide-9
SLIDE 9

Measuring Validity

  • Goodness is

to which LRs from same-origin pairs > 1, and different-origin pairs < 1 extent LRs from Goodness is to which log(LR)s from same-origin pairs > 0, and log(LR)s from different-origin pairs < 0 extent

1/1000 1/100 1/10 1 10 100 1000

  • 3
  • 2
  • 1

+1 +2 +3 LR log (LR)

10

slide-10
SLIDE 10
  • C

N LR N LR

llr ss i N ss ds j N ds

ss i ds j

  • 1

2 1 1 1 1 1

2 1 2 1

log log

slide-11
SLIDE 11

Log Likelihood Ratio

10

Cllr

  • 3
  • 2
  • 1

1 2 1 2 3 4 5 6 7 8 9 3

slide-12
SLIDE 12

Example of Testing the Validity

  • f Forensic-Comparison Systems
slide-13
SLIDE 13

System and Data

(Morrison, 2011) – “initial target” and “final target” in tokens – coefficient values of cubic polynomial fitted to formant trajectories of – Aitken & Lucy (2004) MVKD – logistic-regression calibration – 25 male Australian English speakers – two non-contemporaneous recordings (24 tokens / recording) – cross-validation

  • Acoustic-phonetic systems:

dual-target: trajectory: Database:

  • tokens
  • 0.05

0.1 0.15 0.2 0.25 500 1000 1500 2000 2500

time (s) frequency (Hz)

slide-14
SLIDE 14

Results

dual-target C C

llr llr

= 0.43 = 0.10 trajectory

  • 10
  • 5

5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Log Likelihood Ratio

10

Cumulative Proportion

slide-15
SLIDE 15

Testing the Reliability of a Forensic-Comparison System

slide-16
SLIDE 16

Measuring Reliability

Imagine that we have four recordings (A, B, C, D) of each speaker

There are two non-overlapping pairs for each same-speaker comparison and four non-overlapping pairs for each different- speaker comparison These are statistically independent and can be used to estimate a 95% credible interval (CI)

slide-17
SLIDE 17

suspect recording

  • ffender

recording 001 A 001 B 001 C 001 D 002 A 002 B 002 C 002 D : : : :

Measuring Reliability

Two non-overlapping pairs for each same-speaker comparison

slide-18
SLIDE 18

suspect recording

  • ffender

recording 001 A 002 B 001 C 002 D 001 A 003 B 001 C 003 D : : : : 002 A 001 B 002 C 001 D : : : :

Measuring Reliability

Four non-overlapping pairs for each different-speaker comparison

slide-19
SLIDE 19

Measuring Reliability

log(LR) →

slide-20
SLIDE 20

Measuring Reliability

mean mean

log(LR) →

slide-21
SLIDE 21

Measuring Reliability

← → deviation from mean log(LR) →

slide-22
SLIDE 22

log(LR) →

Measuring Reliability

← → deviation from mean

non-parametric (heteroscedastic)

5% 95%

slide-23
SLIDE 23

log(LR) →

Measuring Reliability

| deviation from mean | →

non-parametric (heteroscedastic)

5% 95%

slide-24
SLIDE 24
  • 0.5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 0.5 1 1.5 2 2.5 3 mean log (LR)

10

absolute deviation from mean log (LR)

10

Measuring Reliability

  • non-parametric (heteroscedastic)

local linear regression

slide-25
SLIDE 25

Measuring Reliability

  • non-parametric (heteroscedastic)

local linear regression

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 6

  • 4
  • 3
  • 2
  • 1

1 2 3 4

mean log (LR)

10

deviation from mean

slide-26
SLIDE 26

Measuring Reliability

← → deviation from mean

  • parametric (homoscedastic)

pooled variance distibution assume uniform priors t

95% 5%

slide-27
SLIDE 27

Example of Testing the Validity and Reliability

  • f a Forensic-Comparison System
slide-28
SLIDE 28

System and Data

(Morrison, Thirivaran, Epps, 2010) recordings of each of 100 speakers from NIST SRE 2008 8conv

Automatic system:

Databases: Background: Calibration: Test: – 16 MFCCs (20 ms window, 10 ms overlap) + deltas – cumulative density mapping – 512 mixture GMM-UBM – logistic-regression calibration – 800 recordings from NIST SRE 2004 – 2 recordings of each of 32 speakers from NIST SRE 2008 8conv – 4

slide-29
SLIDE 29

Results

40 s of speech per offender recording in the test set C C

llr llr

= 0.150 95% CI (parametric) = ±1.63 log (LR) 20 s of speech per offender recording in the test set = 0.150 95% CI (parametric) = ±1.69 log (LR)

10 10

slide-30
SLIDE 30

Results

40 s of speech per offender recording in the test set

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 log (LR)

10

cumulative proportion

slide-31
SLIDE 31

Results

20 s of speech per offender recording in the test set

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 log (LR)

10

cumulative proportion

slide-32
SLIDE 32

Summation

If the background and test data were consistent with the conditions in a case at trial, and the comparison of the known- and questioned-voice samples resulted in a likelihood ratio of, say, 100 (log ( ) of +2), then the forensic scientist could make a statementof thefollowingsort:

10 LR

the non-parametric 95% CI estimate would be ±1.17 log ( ), and the

10 LR

slide-33
SLIDE 33

Based on my evaluation of the evidence, I have calculated that one would be 100 times more likely to obtain the acoustic differences between the voice samples if the questioned- voice sample than if it . had been produced by the accused had been produced by someoneotherthantheaccused

slide-34
SLIDE 34

What this means is that whatever you believed before this evidence was presented, you should now be 100 times more likely than before to believe that the voice on the questioned-voice recording is that of the accused.

slide-35
SLIDE 35

Based on my calculations, I am 95% certain that the acoustic differences are at least 7 times more likely and not more than 1450 times more likely if the questioned-voice sample had been produced by the accused than if it had been produced by someone other thantheaccused.

slide-36
SLIDE 36

Latest Thoughts on Measuring the Reliability of a Forensic Comparison System

slide-37
SLIDE 37

Measuring Reliability

  • In a trial the offender sample is fixed, and precision should be

measured given this fixed sample Imagine that we have four recordings (A, B, C, D) of each speaker in

  • ur test database, and that these are matched to the conditions of

the suspect recording from the trial Use each recording to build four suspect models for each test speaker Calculate likelihood ratios using each suspect model and the fixed

  • ffender sample

Use these likelihood ratios to calculate the precision of the system given the fixed offender sample

slide-38
SLIDE 38

suspect recording

  • ffender recording

001 A trial 001 B trial 001 C trial 001 D trial 002 A trial 002 B trial 002 C trial 002 D trial : : :

Measuring Reliability

Suspect models from test database compared to fixed offender data from trial

slide-39
SLIDE 39

Conclusion

slide-40
SLIDE 40

Conclusion

At admissibility hearing ( ), must supply judge with all relevant information about system performance (validity & reliability a.k.a. accuracy & precision) Must take account of the speaker level as well as the recording level (akin to activity and source levels) Intrinsic variability of voice data (cf. DNA profiles) Limited data for suspect models – underestimating within-speaker variability Limited offender data Daubert

Not to present information about the precision of the system would

be to mislead the trier of fact

slide-41
SLIDE 41

Thank You

http://geoff-morrison.net http://forensic-voice-comparison.net http://forensic.unsw.edu.au