Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart - PowerPoint PPT Presentation

3aSC6 Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart Morrison Julien Epps Philip Rose Tharmarajah Thiruvaran Cuiling Zhang SCHOOL OF LANGUAGE STUDIES �� China Criminal Police University Special Session on Forensic Voice Comparison and Forensic Acoustics @ 2nd Pan-American/Iberian Meeting on Acoustics, Cancún, México, 15–19 November, 2010 http://cancun2010.forensic-voice-comparison.net

Validity and Reliability (Accuracy and Precision)

true value mean poor accuracy poor precision good accuracy poor precision poor accuracy good precision good accuracy good precision

Validity and Reliability in Forensic Science The National Research Council report to Congress on Strengthening � Forensic Science in the United States (2009) urged that procedures be adopted which include: � “quantifiable measures of the reliability and accuracy of forensic analyses” (p. 23) “the reporting of a measurement with an interval that has a high � probability of containing the true value” (p. 121) “the conducting of validation studies of the performance of a � forensic procedure” (p. 121)

Testing the Validity of a Forensic-Comparison System

Measuring Validity Test set consisting of a large number of pairs known to be same � origin and a large number of pairs known to be different origin Use forensic-comparison system to calculate LR for each pair � Compare output with knowledge about input �

Measuring Validity � Correct-classification / classification-error rate is not appropriate – based on posterior probabilities – hard threshold rather than gradient decision fact same different same correct incorrect acceptance rejection different incorrect correct acceptance rejection

Measuring Validity Goodness is to which LRs from same-origin pairs > 1, and extent � LRs from different-origin pairs < 1 A metric which captures the gradient goodness of a set of likelihood � ratios derived from test data is the log-likelihood-ratio cost, C llr

Measuring Validity Goodness is to which LRs from same-origin pairs > 1, and extent � LRs from different-origin pairs < 1 Goodness is to which log(LR)s from same-origin pairs > 0, extent � and log(LR)s from different-origin pairs < 0 LR 1/1000 1/100 1/10 1 10 100 1000 -3 -2 -1 0 +1 +2 +3 log (LR) 10

� � � � � � N N 1 1 1 1 ss ds � � � � � � � � � � � C log 1 log 1 LR � � � llr 2 2 ds � � 2 N LR N � j � � � i 1 j 1 ss ss ds i

9 8 7 6 C llr 5 4 3 2 1 -3 -2 -1 0 1 2 3 Log Likelihood Ratio 10

Example of Testing the Validity of Forensic-Comparison Systems

System and Data (Morrison, 2011) Acoustic-phonetic systems: � �� – dual-target: “initial target” and “final target” in tokens – coefficient values of cubic polynomial fitted to trajectory: 2500 �� formant trajectories of tokens 2000 – Aitken & Lucy (2004) MVKD frequency (Hz) – logistic-regression calibration 1500 1000 Database: � 500 – 25 male Australian English speakers 0 0.05 0.1 0.15 0.2 0.25 time (s) – two non-contemporaneous recordings (24 tokens / recording) – cross-validation

Results 1 � dual-target 0.9 0.8 C = 0.43 llr 0.7 Cumulative Proportion 0.6 � trajectory 0.5 C = 0.10 0.4 llr 0.3 0.2 0.1 0 -10 -5 0 5 Log Likelihood Ratio 10

Testing the Reliability of a Forensic-Comparison System

Measuring Reliability � Imagine that we have four recordings (A, B, C, D) of each speaker There are two non-overlapping pairs for each same-speaker � comparison and four non-overlapping pairs for each different- speaker comparison These are statistically independent and can be used to estimate a � 95% credible interval (CI)

Measuring Reliability � Two non-overlapping pairs for each same-speaker comparison suspect recording offender recording 001 A 001 B 001 C 001 D 002 A 002 B 002 C 002 D : : : :

Measuring Reliability � Four non-overlapping pairs for each different-speaker comparison suspect recording offender recording 001 A 002 B 001 C 002 D 001 A 003 B 001 C 003 D : : : : 002 A 001 B 002 C 001 D : : : :

Measuring Reliability log(LR) →

Measuring Reliability mean mean log(LR) →

Measuring Reliability → deviation from mean log(LR) → ←

Measuring Reliability → � non-parametric (heteroscedastic) deviation from mean 5% 95% log(LR) → ←

Measuring Reliability | deviation from mean | → � non-parametric (heteroscedastic) 5% 95% log(LR) →

Measuring Reliability non-parametric (heteroscedastic) � local linear regression � 3 absolute deviation from mean log (LR) 2.5 10 2 1.5 1 0.5 0 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 mean log (LR) 10

Measuring Reliability non-parametric (heteroscedastic) � local linear regression � 4 3 2 deviation from mean 1 0 -1 -2 -3 -4 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 mean log (LR) 10

Measuring Reliability → deviation from mean 5% 95% parametric (homoscedastic) � ← pooled variance � t distibution � assume uniform priors �

Example of Testing the Validity and Reliability of a Forensic-Comparison System

System and Data (Morrison, Thirivaran, Epps, 2010) � Automatic system: – 16 MFCCs (20 ms window, 10 ms overlap) + deltas – cumulative density mapping – 512 mixture GMM-UBM – logistic-regression calibration Databases: � – 800 recordings from NIST SRE 2004 Background: – Calibration: 2 recordings of each of 32 speakers from NIST SRE 2008 8conv – 4 recordings of each of 100 speakers from NIST SRE Test: 2008 8conv

Results � 40 s of speech per offender recording in the test set C = 0.150 95% CI (parametric) = ±1.63 log (LR) llr 10 20 s of speech per offender recording in the test set � C = 0.150 95% CI (parametric) = ±1.69 log (LR) llr 10

Results � 40 s of speech per offender recording in the test set 1 0.9 0.8 0.7 cumulative proportion 0.6 0.5 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 log (LR) 10

Results � 20 s of speech per offender recording in the test set 1 0.9 0.8 0.7 cumulative proportion 0.6 0.5 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 log (LR) 10

Summation If the background and test data were consistent with the conditions in a case at trial, and the comparison of the known- and questioned-voice samples resulted in a likelihood ratio of, say, 100 (log ( 10 LR ) of +2), then the the non-parametric 95% CI estimate would be ±1.17 log ( 10 LR ), and the forensic scientist could make a statementof thefollowingsort:

Based on my evaluation of the evidence, I have calculated that one would be 100 times more likely to obtain the acoustic differences between the voice samples if the questioned- voice sample had been produced by the accused than if it had been produced by someoneotherthantheaccused .

What this means is that whatever you believed before this evidence was presented, you should now be 100 times more likely than before to believe that the voice on the questioned-voice recording is that of the accused.

Based on my calculations, I am 95% certain that the acoustic differences are at least 7 times more likely and not more than 1450 times more likely if the questioned-voice sample had been produced by the accused than if it had been produced by someone other thantheaccused.

Latest Thoughts on Measuring the Reliability of a Forensic Comparison System

Measuring Reliability In a trial the offender sample is fixed, and precision should be � measured given this fixed sample Imagine that we have four recordings (A, B, C, D) of each speaker in � our test database, and that these are matched to the conditions of the suspect recording from the trial Use each recording to build four suspect models for each test speaker � Calculate likelihood ratios using each suspect model and the fixed � offender sample Use these likelihood ratios to calculate the precision of the system � given the fixed offender sample

Measuring Reliability � Suspect models from test database compared to fixed offender data from trial suspect recording offender recording 001 A trial 001 B trial 001 C trial 001 D trial 002 A trial 002 B trial 002 C trial 002 D trial : : :

Conclusion

Conclusion � At admissibility hearing ( Daubert ), must supply judge with all relevant information about system performance (validity & reliability a.k.a. accuracy & precision) � Not to present information about the precision of the system would be to mislead the trier of fact Must take account of the speaker level as well as the recording � level (akin to activity and source levels) Intrinsic variability of voice data (cf. DNA profiles) � Limited data for suspect models � – underestimating within-speaker variability Limited offender data �

Thank You http://geoff-morrison.net http://forensic-voice-comparison.net http://forensic.unsw.edu.au

Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart - PowerPoint PPT Presentation

3aSC6 Measuring Reliability in Forensic Voice Comparison Geoffrey Stewart Morrison Julien Epps Philip Rose Tharmarajah Thiruvaran Cuiling Zhang SCHOOL OF LANGUAGE STUDIES China Criminal Police University Special

Forensic Science Center Forensic Science Center -10 Budget 10 Budget FY 09- FY 09 Forensic

Forensic Challenge V2.0 UNAM-CERT RedIRIS Topics * Forensic Challenge V1.0 * Forensic

Forensic Voice Comparison and Forensic Acoustics 1 Value and Interpretation of Biometric

Specialized Topics in Ethical Forensic Practice, Part 3: Bias in Forensic Evaluations November 18,

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Combining linguistic and non- linguistic information in likelihood-ratio-based forensic voice

Voice quality analysis in forensic voice comparison: developing the vocal profile analysis scheme

Forensic Mental Health Care in the Texas State Hospital System Matthew Faubion, M.D. Forensic

THE NEW FORENSIC PATIENT Learning Objectives Review the epidemiology of forensic populations

Regional Forensic Trainings 2013 Pathways to Conditional Release: An Overview of the Forensic

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

p(E|H p(E|H p ) p ) p(E|H p(E|H d ) d ) Need for testing In forensic voice comparison,

Accrediting a small Forensic Speaker Comparison Lab Text Jonas Lindh Forensic Phonetic Analyst

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische

Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische

p(E|H p(E|H p ) p ) p(E|H p(E|H d ) d ) Concerns Logically correct framework for evaluation

DIMENSIONALITY REDUCTION AND VISUALIZATION Loose ends from HW2 Hyperparameters, bin size =

Goals and Preferences Alice . . . went on Would you please tell me, please, which way I ought

Goals and Preferences Alice . . . went on Would you please tell me, please, which way I ought

PHPE 400 Individual and Group Decision Making Eric Pacuit University of Maryland 1 / 24 Allais

Variable-Lived Short-Run Selves Drew Fudenberg and David K. Levine September 8, 2009 The Problem

Sambuz

Useful Links

Newsletter

Mail Us