[PPT] - Identification of voices in disguised speech Jessica Clark* & PowerPoint Presentation

SLIDE 1

Identification of voices in disguised speech

Jessica Clark* & Paul Foulkes**

* University of York ** University of York & JP French Associates pf11@york.ac.uk IAFPA, Göteborg 2006

SLIDE 2

2

0.1 outline

experiment to test ability of lay listeners to

identify disguised familiar voices

voices have been disguised artificially, as with

commercially available voice changers

– pitch modified

SLIDE 3

3

0.2 structure

1. introduction

– rationale for experiment

2. experimental design

– speakers – listeners – Control condition – Experimental conditions

3. results
4. discussion & conclusion

SLIDE 4

4

1. Introduction

SLIDE 5

5

1. Introduction
technical speaker identification is the most

frequent task for the forensic phonetician

lay identification is also common in legal cases
many previous studies have thus examined lay

listeners’ ability to identify voices and the factors which affect their ability

SLIDE 6

6

1.1 previous studies

identification is not automatic or flawless
listeners can make errors even with highly

familiar voices

– Ladefoged did not recognise his mother from a short sample (Ladefoged & Ladefoged 1980) – flatmates scored only 68% with 10 second samples

(Foulkes & Barron 2000)

SLIDE 7

7

1.1 previous studies

identification may be affected by [Bull & Clifford 1984]

– type of exposure (active/passive) – length of sample – nature of sample (phone, direct, shouting etc) – delay between exposure and test – age of listener – hearing ability – sightedness – natural variability across individual listeners – specific features of voice – degree of familiarity – nature and extent of any disguise

SLIDE 8

8

1.2 degree of familiarity

all things equal, more familiar voices are easier

to identify

e.g. Hollien, Majewski & Doherty (1982)

– listening tests with 10 male voices 27 14 unfamiliar 40 47 trained 98 10 familiar % correct (normal condition) N listener group

SLIDE 9

9

1.3 disguise

all things equal, disguised voices are harder to

identify

e.g. Hollien, Majewski & Doherty (1982)

– various forms of disguise used 30 machine approach (LTAS) 18 21 79 % correct (disguised) 27 14 unfamiliar 40 47 trained 98 10 familiar % correct (normal) N listener group

SLIDE 10

10

1.3 disguise

previous studies have examined various types
f disguise

– whisper, pencils between teeth, hypernasality, dialect change, rate change, professional mimics

but little if any work on voice changers

– hardware based – software based – easily available

SLIDE 11

11

www.maplin.co.uk

www.crimebusters911.com www.blazeaudio.com

SLIDE 12

12

1.3 disguise

in our study we chose not to use real voice

changers, in favour of total control over effects

pitch shift chosen as a universal function

SLIDE 13

13

2. Experimental design

SLIDE 14

14

2.1 design outline

simple design
listeners asked to identify samples of familiar

voices

Control condition

unmodified stimuli

4 Experimental conditions modified stimuli

SLIDE 15

15

2.1 design outline

degree of familiarity known to affect rate of

successful identification

thus we trained listeners to identify a group of

speakers

– controls degree of familiarity – all listeners had exactly the same exposure in terms

f length & quality of samples

– identification task carried out under same conditions

SLIDE 16

16

2.2 speakers

4 male speakers

– 16-18 years old

taken from IViE corpus (Grabe, Post & Nolan 2001)

– Leeds dialect (nearest to York) – reading text of Cinderella story David RP Harry MD Matthew JW Edward JP Experimental name IViE speaker

SLIDE 17

17

2.2 speakers

training materials created for each speaker

– c. 90 seconds of Cinderella (302 words) – edited out disfluencies, non-speech sounds, long pauses – samples normalised for amplitude with Audacity 1.2.5

SLIDE 18

18

2.3 listeners

36 listeners
variety of regional/social backgrounds
York residents
age range 19-55
10 male, 26 female

SLIDE 19

19

2.4 Control condition

all 36 listeners

– 4 voices * 90 seconds = c. 6 minutes – presented by PowerPoint with speakers’ names – Toshiba laptop – Aiwa A170 headphones – individually in quiet room

1. training phase
2. break
3. listening test

SLIDE 20

20

2.4 Control condition

all 36 listeners

– 10 minutes

1. training phase
2. break
3. listening test

SLIDE 21

2.4 Control condition

all 36 listeners

– 8 stimuli (2 per speaker) – duration c. 10 seconds – 5 second gap between – extracts from other parts

f Cinderella story

– normalised for amplitude with Audacity 1.2.5 – answer sheet with names

1. training phase
2. break
3. listening test

SLIDE 22

22

2.5 Experimental conditions

4 Experimental conditions
listening tests same format as Control condition
but stimuli modified for pitch
Sound Forge 8.0

– pitch shift effect – accuracy setting ‘high’ – speech 1 mode – preserved durations

SLIDE 23

23

2.5 Experimental conditions

(i) +8 semitones (ii) +4 semitones (iii)

4 semitones

(iv)

8 semitones

pitch shift > 8 semitones unnatural and partly incomprehensible

SLIDE 24

24

2.5 Experimental conditions

4, +8

18 B

8, +4

18 A conditions (semitones) N listener group

SLIDE 25

25

2.5 Experimental conditions

listening test 16-92 days after Control test

– no clear effects for length of delay

same training as in Control condition
10 minute break
2 stimuli for familiarisation
8 experimental stimuli per condition

– consecutive runs for + and - stimuli – order reversed for half of each group, but no effect

SLIDE 26

26

3. Results

SLIDE 27

27

3.1 Control condition

average correct identification = 4.8/8 (60%)

1 2 3 4 5 6 7 8 Minus 8 Minus 4 Control Plus 4 Plus 8 average N correct

SLIDE 28

28

3.1 Control condition

individuals’ range 8 to 0
29/36 performed better than chance

control

1 2 3 4 5 6 7 8 listeners N correct

SLIDE 29

29

3.2 Experimental conditions

** sig. lower than in Control (p < .005, Wilcoxon)
trend (n.s.) for higher scores in + conditions

1 2 3 4 5 6 7 8 Minus 8 Minus 4 Control Plus 4 Plus 8 average N correct

SLIDE 30

30

8 semitones

1 2 3 4 5 6 7 8 listeners N correct

+8 semitones

1 2 3 4 5 6 7 8 listeners N correct

+4 semitones

1 2 3 4 5 6 7 8 listeners N correct

4 semitones

1 2 3 4 5 6 7 8 listeners N correct

variability in listener performance, esp. ±4
majority perform above chance except -8

SLIDE 31

31

3.3 variation by listener sex

women sig. better in Control (p = .008, Mann-Whitney)

– trend (n.s.) maintained in Experimental tests – same pattern reported by Bull & Clifford (1984)

1 2 3 4 5 6 7 8 Minus 8 Minus 4 Control Plus 4 Plus 8 N correct Male Female

**

SLIDE 32

32

3.4 summary

as predicted, identification rates were lower

with disguised voices

– lowest scores with most extreme form of disguise (±8 semitones)

identification rates slightly better when pitch

shifted up than down

trend for women to perform better than men
variability across listeners

SLIDE 33

33

4. Discussion & conclusion

SLIDE 34

34

4. discussion & conclusion
tests reported here were not forensically

realistic

results may be affected by e.g.

– degree of familiarity with voice – content of sample (vocabulary, syntax etc) – conditions of exposure (stress etc) – specific form of artificial disguise

software, hardware system
combination of effects

SLIDE 35

35

4. discussion & conclusion
considerable variation in listeners’ scores

– courts should not assume all witnesses are equally good at such tasks – supports broader principle that lay witnesses should be tested in their ability to identify a voice

SLIDE 36

36

4. discussion & conclusion
but even marked disguise was not catastrophic

for listeners

a broadly positive conclusion for lay speaker

identification

– a reasonable chance of identifying familiar voices

SLIDE 37

37

4. discussion & conclusion
but a less positive conclusion respect to use of

voice changers as a means of protecting vulnerable witnesses giving evidence

more extreme forms of modification may affect

intelligibility & naturalness

less extreme forms of modification may render

witness’s voice recognisable

different modifications for different voices?

SLIDE 38

38

4. discussion & conclusion
as ever…
more work is needed

SLIDE 39

39

thanks tack

thanks to Peter French, Phil Harrison, Robin How

SLIDE 40

40

References

Bull, R. & Clifford, B. (1984) Earwitness voice recognition accuracy. In G. Wells & E. Loftus (eds.) Eyewitness Testimony: Psychological Perspectives. Cambridge: CUP. pp. 92-123. Foulkes, P. & Barron, A. (2000) Telephone speaker recognition amongst members of a close social network. Forensic Linguistics 7: 181-198. Grabe, E., Post, B. & Nolan, F. (2001) English intonation in the British Isles: the IViE corpus. Final report to UK ESRC R000 237145. www.phon.ox.ac.uk/IViE Hollien, H., Majewski, W. & Doherty, E. (1982) Perceptual identification of voices under normal, stress and disguise speaking

conditions. Journal of Phonetics 10: 139-148.

Ladefoged, P. & Ladefoged, J. (1980) The ability of listeners to identify voices. UCLA Working Papers in Phonetics 49: 43-51.

Identification of voices in disguised speech

Jessica Clark* & Paul Foulkes**

0.1 outline

identify disguised familiar voices

commercially available voice changers

– pitch modified

0.2 structure

– rationale for experiment

– speakers – listeners – Control condition – Experimental conditions

frequent task for the forensic phonetician

listeners’ ability to identify voices and the factors which affect their ability

1.1 previous studies

familiar voices

– Ladefoged did not recognise his mother from a short sample (Ladefoged & Ladefoged 1980) – flatmates scored only 68% with 10 second samples

1.1 previous studies

1.2 degree of familiarity

to identify

– listening tests with 10 male voices 27 14 unfamiliar 40 47 trained 98 10 familiar % correct (normal condition) N listener group

1.3 disguise

identify

– various forms of disguise used 30 machine approach (LTAS) 18 21 79 % correct (disguised) 27 14 unfamiliar 40 47 trained 98 10 familiar % correct (normal) N listener group

1.3 disguise

– whisper, pencils between teeth, hypernasality, dialect change, rate change, professional mimics

– hardware based – software based – easily available

1.3 disguise

changers, in favour of total control over effects

2.1 design outline

voices

unmodified stimuli

2.1 design outline

successful identification

speakers

– controls degree of familiarity – all listeners had exactly the same exposure in terms

– identification task carried out under same conditions

2.2 speakers

– 16-18 years old

– Leeds dialect (nearest to York) – reading text of Cinderella story David RP Harry MD Matthew JW Edward JP Experimental name IViE speaker

2.2 speakers

– c. 90 seconds of Cinderella (302 words) – edited out disfluencies, non-speech sounds, long pauses – samples normalised for amplitude with Audacity 1.2.5

2.3 listeners

2.4 Control condition

– 4 voices * 90 seconds = c. 6 minutes – presented by PowerPoint with speakers’ names – Toshiba laptop – Aiwa A170 headphones – individually in quiet room

2.4 Control condition

– 10 minutes

2.4 Control condition

– 8 stimuli (2 per speaker) – duration c. 10 seconds – 5 second gap between – extracts from other parts

– normalised for amplitude with Audacity 1.2.5 – answer sheet with names

2.5 Experimental conditions

– pitch shift effect – accuracy setting ‘high’ – speech 1 mode – preserved durations

2.5 Experimental conditions

(i) +8 semitones (ii) +4 semitones (iii)

(iv)

pitch shift > 8 semitones unnatural and partly incomprehensible

2.5 Experimental conditions

18 B

18 A conditions (semitones) N listener group

2.5 Experimental conditions

– no clear effects for length of delay

– consecutive runs for + and - stimuli – order reversed for half of each group, but no effect

3.1 Control condition

3.1 Control condition

3.2 Experimental conditions

** ** ** **

3.3 variation by listener sex

– trend (n.s.) maintained in Experimental tests – same pattern reported by Bull & Clifford (1984)

**

3.4 summary

with disguised voices

– lowest scores with most extreme form of disguise (±8 semitones)

shifted up than down

realistic

– degree of familiarity with voice – content of sample (vocabulary, syntax etc) – conditions of exposure (stress etc) – specific form of artificial disguise

– courts should not assume all witnesses are equally good at such tasks – supports broader principle that lay witnesses should be tested in their ability to identify a voice

for listeners

identification

– a reasonable chance of identifying familiar voices

voice changers as a means of protecting vulnerable witnesses giving evidence

intelligibility & naturalness

witness’s voice recognisable

thanks tack