Identification of voices in disguised speech Jessica Clark* & - - PowerPoint PPT Presentation

identification of voices in disguised speech
SMART_READER_LITE
LIVE PREVIEW

Identification of voices in disguised speech Jessica Clark* & - - PowerPoint PPT Presentation

Identification of voices in disguised speech Jessica Clark* & Paul Foulkes** * University of York ** University of York & JP French Associates pf11@york.ac.uk IAFPA, Gteborg 2006 0.1 outline experiment to test ability of lay


slide-1
SLIDE 1

Identification of voices in disguised speech

Jessica Clark* & Paul Foulkes**

* University of York ** University of York & JP French Associates pf11@york.ac.uk IAFPA, Göteborg 2006

slide-2
SLIDE 2

2

0.1 outline

  • experiment to test ability of lay listeners to

identify disguised familiar voices

  • voices have been disguised artificially, as with

commercially available voice changers

– pitch modified

slide-3
SLIDE 3

3

0.2 structure

  • 1. introduction

– rationale for experiment

  • 2. experimental design

– speakers – listeners – Control condition – Experimental conditions

  • 3. results
  • 4. discussion & conclusion
slide-4
SLIDE 4

4

  • 1. Introduction
slide-5
SLIDE 5

5

  • 1. Introduction
  • technical speaker identification is the most

frequent task for the forensic phonetician

  • lay identification is also common in legal cases
  • many previous studies have thus examined lay

listeners’ ability to identify voices and the factors which affect their ability

slide-6
SLIDE 6

6

1.1 previous studies

  • identification is not automatic or flawless
  • listeners can make errors even with highly

familiar voices

– Ladefoged did not recognise his mother from a short sample (Ladefoged & Ladefoged 1980) – flatmates scored only 68% with 10 second samples

(Foulkes & Barron 2000)

slide-7
SLIDE 7

7

1.1 previous studies

  • identification may be affected by [Bull & Clifford 1984]

– type of exposure (active/passive) – length of sample – nature of sample (phone, direct, shouting etc) – delay between exposure and test – age of listener – hearing ability – sightedness – natural variability across individual listeners – specific features of voice – degree of familiarity – nature and extent of any disguise

slide-8
SLIDE 8

8

1.2 degree of familiarity

  • all things equal, more familiar voices are easier

to identify

  • e.g. Hollien, Majewski & Doherty (1982)

– listening tests with 10 male voices 27 14 unfamiliar 40 47 trained 98 10 familiar % correct (normal condition) N listener group

slide-9
SLIDE 9

9

1.3 disguise

  • all things equal, disguised voices are harder to

identify

  • e.g. Hollien, Majewski & Doherty (1982)

– various forms of disguise used 30 machine approach (LTAS) 18 21 79 % correct (disguised) 27 14 unfamiliar 40 47 trained 98 10 familiar % correct (normal) N listener group

slide-10
SLIDE 10

10

1.3 disguise

  • previous studies have examined various types
  • f disguise

– whisper, pencils between teeth, hypernasality, dialect change, rate change, professional mimics

  • but little if any work on voice changers

– hardware based – software based – easily available

slide-11
SLIDE 11

11

www.maplin.co.uk

www.crimebusters911.com www.blazeaudio.com

slide-12
SLIDE 12

12

1.3 disguise

  • in our study we chose not to use real voice

changers, in favour of total control over effects

  • pitch shift chosen as a universal function
slide-13
SLIDE 13

13

  • 2. Experimental design
slide-14
SLIDE 14

14

2.1 design outline

  • simple design
  • listeners asked to identify samples of familiar

voices

  • Control condition

unmodified stimuli

  • 4 Experimental conditions modified stimuli
slide-15
SLIDE 15

15

2.1 design outline

  • degree of familiarity known to affect rate of

successful identification

  • thus we trained listeners to identify a group of

speakers

– controls degree of familiarity – all listeners had exactly the same exposure in terms

  • f length & quality of samples

– identification task carried out under same conditions

slide-16
SLIDE 16

16

2.2 speakers

  • 4 male speakers

– 16-18 years old

  • taken from IViE corpus (Grabe, Post & Nolan 2001)

– Leeds dialect (nearest to York) – reading text of Cinderella story David RP Harry MD Matthew JW Edward JP Experimental name IViE speaker

slide-17
SLIDE 17

17

2.2 speakers

  • training materials created for each speaker

– c. 90 seconds of Cinderella (302 words) – edited out disfluencies, non-speech sounds, long pauses – samples normalised for amplitude with Audacity 1.2.5

slide-18
SLIDE 18

18

2.3 listeners

  • 36 listeners
  • variety of regional/social backgrounds
  • York residents
  • age range 19-55
  • 10 male, 26 female
slide-19
SLIDE 19

19

2.4 Control condition

  • all 36 listeners

– 4 voices * 90 seconds = c. 6 minutes – presented by PowerPoint with speakers’ names – Toshiba laptop – Aiwa A170 headphones – individually in quiet room

  • 1. training phase
  • 2. break
  • 3. listening test
slide-20
SLIDE 20

20

2.4 Control condition

  • all 36 listeners

– 10 minutes

  • 1. training phase
  • 2. break
  • 3. listening test
slide-21
SLIDE 21

2.4 Control condition

  • all 36 listeners

– 8 stimuli (2 per speaker) – duration c. 10 seconds – 5 second gap between – extracts from other parts

  • f Cinderella story

– normalised for amplitude with Audacity 1.2.5 – answer sheet with names

  • 1. training phase
  • 2. break
  • 3. listening test
slide-22
SLIDE 22

22

2.5 Experimental conditions

  • 4 Experimental conditions
  • listening tests same format as Control condition
  • but stimuli modified for pitch
  • Sound Forge 8.0

– pitch shift effect – accuracy setting ‘high’ – speech 1 mode – preserved durations

slide-23
SLIDE 23

23

2.5 Experimental conditions

(i) +8 semitones (ii) +4 semitones (iii)

  • 4 semitones

(iv)

  • 8 semitones

pitch shift > 8 semitones unnatural and partly incomprehensible

slide-24
SLIDE 24

24

2.5 Experimental conditions

  • 4, +8

18 B

  • 8, +4

18 A conditions (semitones) N listener group

slide-25
SLIDE 25

25

2.5 Experimental conditions

  • listening test 16-92 days after Control test

– no clear effects for length of delay

  • same training as in Control condition
  • 10 minute break
  • 2 stimuli for familiarisation
  • 8 experimental stimuli per condition

– consecutive runs for + and - stimuli – order reversed for half of each group, but no effect

slide-26
SLIDE 26

26

  • 3. Results
slide-27
SLIDE 27

27

3.1 Control condition

  • average correct identification = 4.8/8 (60%)

1 2 3 4 5 6 7 8 Minus 8 Minus 4 Control Plus 4 Plus 8 average N correct

slide-28
SLIDE 28

28

3.1 Control condition

  • individuals’ range 8 to 0
  • 29/36 performed better than chance

control

1 2 3 4 5 6 7 8 listeners N correct

slide-29
SLIDE 29

29

3.2 Experimental conditions

  • ** sig. lower than in Control (p < .005, Wilcoxon)
  • trend (n.s.) for higher scores in + conditions

1 2 3 4 5 6 7 8 Minus 8 Minus 4 Control Plus 4 Plus 8 average N correct

** ** ** **

slide-30
SLIDE 30

30

  • 8 semitones

1 2 3 4 5 6 7 8 listeners N correct

+8 semitones

1 2 3 4 5 6 7 8 listeners N correct

+4 semitones

1 2 3 4 5 6 7 8 listeners N correct

  • 4 semitones

1 2 3 4 5 6 7 8 listeners N correct

  • variability in listener performance, esp. ±4
  • majority perform above chance except -8
slide-31
SLIDE 31

31

3.3 variation by listener sex

  • women sig. better in Control (p = .008, Mann-Whitney)

– trend (n.s.) maintained in Experimental tests – same pattern reported by Bull & Clifford (1984)

1 2 3 4 5 6 7 8 Minus 8 Minus 4 Control Plus 4 Plus 8 N correct Male Female

**

slide-32
SLIDE 32

32

3.4 summary

  • as predicted, identification rates were lower

with disguised voices

– lowest scores with most extreme form of disguise (±8 semitones)

  • identification rates slightly better when pitch

shifted up than down

  • trend for women to perform better than men
  • variability across listeners
slide-33
SLIDE 33

33

  • 4. Discussion & conclusion
slide-34
SLIDE 34

34

  • 4. discussion & conclusion
  • tests reported here were not forensically

realistic

  • results may be affected by e.g.

– degree of familiarity with voice – content of sample (vocabulary, syntax etc) – conditions of exposure (stress etc) – specific form of artificial disguise

  • software, hardware system
  • combination of effects
slide-35
SLIDE 35

35

  • 4. discussion & conclusion
  • considerable variation in listeners’ scores

– courts should not assume all witnesses are equally good at such tasks – supports broader principle that lay witnesses should be tested in their ability to identify a voice

slide-36
SLIDE 36

36

  • 4. discussion & conclusion
  • but even marked disguise was not catastrophic

for listeners

  • a broadly positive conclusion for lay speaker

identification

– a reasonable chance of identifying familiar voices

slide-37
SLIDE 37

37

  • 4. discussion & conclusion
  • but a less positive conclusion respect to use of

voice changers as a means of protecting vulnerable witnesses giving evidence

  • more extreme forms of modification may affect

intelligibility & naturalness

  • less extreme forms of modification may render

witness’s voice recognisable

  • different modifications for different voices?
slide-38
SLIDE 38

38

  • 4. discussion & conclusion
  • as ever…
  • more work is needed
slide-39
SLIDE 39

39

thanks tack

thanks to Peter French, Phil Harrison, Robin How

slide-40
SLIDE 40

40

References

Bull, R. & Clifford, B. (1984) Earwitness voice recognition accuracy. In G. Wells & E. Loftus (eds.) Eyewitness Testimony: Psychological Perspectives. Cambridge: CUP. pp. 92-123. Foulkes, P. & Barron, A. (2000) Telephone speaker recognition amongst members of a close social network. Forensic Linguistics 7: 181-198. Grabe, E., Post, B. & Nolan, F. (2001) English intonation in the British Isles: the IViE corpus. Final report to UK ESRC R000 237145. www.phon.ox.ac.uk/IViE Hollien, H., Majewski, W. & Doherty, E. (1982) Perceptual identification of voices under normal, stress and disguise speaking

  • conditions. Journal of Phonetics 10: 139-148.

Ladefoged, P. & Ladefoged, J. (1980) The ability of listeners to identify voices. UCLA Working Papers in Phonetics 49: 43-51.