[PPT] - phil rose Division of Humanities, Hong Kong University of Science PowerPoint Presentation

SLIDE 1

Two sides of the same coin: between-speaker F0 differences in linguistic-tonetic description and forensic voice comparison.

Division of Humanities, Hong Kong University of Science & Technology School of Language Studies, Australian National University Joseph Bell Centre for Forensic Statistics and Legal Reasoning, University of Edinburgh

TAL 2012 (“Prosody in the Real World”)Tonal Aspects across Tone and Non-Tone Languages Invited Talk

phil rose

SLIDE 2

Between-Speaker Differences in tonally-relevant

acoustic output from two complementary perspectives:

(1) BSD’s in Forensic Voice Comparison
A case of “prosody in the real world”:
A real world FVC case where intonational F0

played an important part

(2) BSD’s in Linguistic Tonetics
Tonal Normalisation and some of its uses for a

quantifiable linguistic-tonetic representation of tonal and intonational pitch.

The Theme

SLIDE 3

Cords vibrating like a string Titze

1994

F0 = 1/ 2L * √ σ / ρ
L = vocal cord length
σ = longitudinal stress in the cords
stress = the tension in the cords

divided by the cross-sectional area of vibrating tissue

(cover) tension is controlled by Crico-

thyroid contraction/relaxation

ρ = tissue density

Since F0 is inversely proportional to cord length, other things being equal, if the speaker's cords are long, their F0 will be lower

Cords vibrating like a spring

F0 = 1/ 2π * √ k/m

m = vocal cord mass

Since F0 is proportional to cord mass, other things being equal, if the speaker's cords are bigger, their F0 will be lower

Main anatomical source of F0 BSDs

SLIDE 4

Forensic Voice Comparison

Self-evidently the differences between

speakers that are important

Absent BSD’s not possible to recognise

someone by their voice

FVC = comparing speech samples wrt any

any aspect of voice (not just phonetics!) to help trier-of-fact decide whether suspect said incriminating speech

SLIDE 5

On Christmas Eve 2003 a fraudulent fax was sent to

the investment bank JP Morgan Chase in Australia

requesting the transfer of $150 million to accounts in

Switzerland, Greece and Hong Kong.

About 10 minutes before the closing of business,
the bank received a phone call from a Craig Slater,
asking for a call-back on the fax
= a procedure confirming the details of the fax and

verifying that the transfer could go ahead.

Here is part of the money-making phone-call

The Crime

SLIDE 6

“JP Morgan Greg speaking” “Yeah hello Greg this is Craig Slater here mate” “Oh g’day how are you?” “Not too bad I bin havin a bit of trouble here…”

The Offender

SLIDE 7

“em.. And we’re going to pay Hong Kong dollars 118,678,543 spot 29 to HSBC em…Hong Kong?” “Correct” Hong Kong I think Hong Kong Power Limited six three six double oh three oh five five double

h one [$636,003,055,001] ?

“Yes”

Out goes the money …

SLIDE 8

That is how you make $150 million in one

phone call

And also how the Australian

Commonwealth Superannuation Scheme account administered by the bank lost $150 million.

The Result

SLIDE 9

15 intercepted telephone calls containing

“not too bad”, e.g.

“…mate, how are you?”
“Oh not too bad, everything’s good.”

The Suspect

SLIDE 10

Both suspect and offender contain the utterance

“not too bad” said with same H.L.LH intonation

– rise nuclear tone on bad (“supportive interest encouraging further conversation”). – high head on not (the suspect’s not high/low head)

Therefore F0 highly comparable
Usually F0 not much good in FVC
< high within-speaker variation
> disadvantageous variance ratio.

The (Intonational F0) Evidence

SLIDE 11

罪犯的 “not too bad” F0

60 120 180 240 300 0.11351 0.227019 0.340529 0.454039 0.567549 Duration (sec.)

H on not L on too LH on bad

SLIDE 12

F0 曲线的相似程度 Degreee of similarity

between suspect and offender’s not too bad F0

罪犯 Offender F0 嫌疑人 Suspect Samples F0

SLIDE 13

You want to know the probability the suspect said the incriminating speech, given the similarity between the suspect and offender data? p(H|E)

Evaluating Evidence Rationally

By my theorem, that is proportional to the strength

f your evidence …

… and the probability that the suspect said the incriminating speech BEFORE the evidence is taken into account … Bayes’ Theorem: Posterior Odds = Prior Odds * Likelihood Ratio

SLIDE 14

Strength of Evidence in support of one

hypothesis over another =

Probability of evidence under competing

hypotheses =

p(E | Hsame spk) / p(E | Hdiff spk)
Probability of the difference between

suspect and offender F0 in not too bad assuming the suspect said it, vs. the probability of the difference, assuming it was said by someone else randomly chosen from the relevant population.

The Likelihood Ratio

LR denominator is where the between-speaker differences come in!

SLIDE 15

So we have to collect a Reference Sample of “not too bad”s

Natural responses to

“how’s it going?” etc

Do any two samples

sound as if they are from the same speaker?

Relatively easy to find

speakers with very similar voices!!

Speaker 10 Speaker 9 Speaker 8 Speaker 7 Speaker 6 Speaker 5 Speaker 4 Speaker 3 Speaker 2 Speaker 1

SLIDE 16

2 4 6 100 200 300 Adam 2 4 6 100 200 300 Alderman 2 4 6 100 200 300 Andrew 2 4 6 100 200 300 Bevan 2 4 6 100 200 300 Brown 2 4 6 100 200 300 Cameron 2 4 6 100 200 300 Collette 2 4 6 100 200 300 Dando 2 4 6 100 200 300 Dave 2 4 6 100 200 300 DavidDoroth 2 4 6 100 200 300 GaryNgale 2 4 6 100 200 300 GaryYuko 2 4 6 100 200 300 GaryRenata 2 4 6 100 200 300 Hendriks 2 4 6 100 200 300 Hill 2 4 6 100 200 300 James 2 4 6 100 200 300 Jeffries 2 4 6 100 200 300 Langford 2 4 6 100 200 300 Lee Lee 2 4 6 100 200 300 Mac Mac 2 4 6 100 200 300 Malcolm Malcolm 2 4 6 100 200 300 Pavlic-Searle Pavlic-Searle 2 4 6 100 200 300 Hunter Hunter 2 4 6 100 200 300 Rose Rose 2 4 6 100 200 300 Ruggieri Ruggieri 2 4 6 100 200 300 Sidwell Sidwell 2 4 6 100 200 300 Stephen

2 4 6 100 200 300 Stewart 2 4 6 100 200 300 Windle 2 4 6 100 200 300 Young

You have to go and get this!

Reference sample: non-contemporaneous variation in 30 males’ “not too bad” F0.

SLIDE 17

MVLR的分子 = MVLR的分母 =

( )

( ) ( )

( ) (

) (

)

{ }

( ) (

)

( )

{ } (

)⎥

⎦ ⎤ ⎢ ⎣ ⎡ − + + − × − + − × + +

− − − − = − − − − − − − − − −

∑

i i m i p p

x y C h D D x y

y

y D D y y

C

h D D mh C D D * * exp exp 2

1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 2 1 2 1 2 2 1 1

T T

2 1 2 1

π ( )

( ) ∏

− + − ∑ × +

= ⎢ ⎢ ⎢ ⎣ ⎡ ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = − − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − − − − 2 1 1 2 T 1 2 1 1 2 1 2 1 2 1

2 1 2

exp

l i l l i l m i l l p p

x y C h D x y

C

h D D mh C π

多变量似然率计算公式 Multivariate Likelihood Ratio (Aitken & Lucy 2002)

The Formula

SLIDE 18

20.6 Density（密度）

Multivariate LR values for comparison between suspect and offender samples using F0 in “not too bad” against reference population of 30 males.利用 “not too bad”

中的F0计算得到的多变量似然率结果（以30个男性语音作为参考样本）

About 20 times more likely to get this difference in not too bad F0 if suspect said it than if someone else had said it.

NOT the suspect is about 20 more likely to have said it than someone else!!

The Finding

SLIDE 19

By combining LRs from different features, one can

get quite large strengths of evidence in support of either defence or prosecution hypotheses.

In this case the acoustics (F-pattern) in “yes” were

also used

They gave a LR of about 70
Combined with not too bad F0 the LR is now 1400
All the acoustic voice evidence in the case gave a LR
f about 11 million

Offender Suspect Reference sample

The Other Voice Evidence

SLIDE 20

I don’t know the prior odds (= the other

evidence in the case), but

The suspect was found guilty
Most of the money was recovered

The Verdict

SLIDE 21

Forensic Voice Comparison with Tonal F0?

Yes, small contribution from tones – improves /i/ Cllr on fusion Tippett/reliability plots for F-pattern and [22] tonal F0 in Cantonese yih ‘two’ for 26 young male Cantonese speakers’ non-contemporaneous natural speech. LRs from same- speaker comparisons LRs from different- speaker comparisons Log-LR cost (0.51) EER = 15%

SLIDE 22

Theme 2:

Using Between-speaker differences to get quantified Linguistic- Tonetic description of tones of a variety

–For tonal typology –Acoustic reconstruction

SLIDE 23

Modelling tones

Wu dialect tones
Merit in complexity
Some typically complex data from

Wencheng, Jinyun

Can all be easily modelled with a

continuous model (e.g Fujisaki)

But perhaps not quite so easily with

discrete phonological Bao-type model

SLIDE 24

Jinyun tones (Steed &

Rose 2009)

high rise-fall low fall low fall-rise low rise mid dipping low rise-fall low depressed fall Whole range fall Register Contour

SLIDE 25

Wencheng 文成 tonal acoustics (Rose

2010)

Mean tonal F0 as function of mean tonal

duration

low falling-rising mid? falling-rising lower rising upper short rising mid depressed level low level upper mid? level

Register Contour

SLIDE 26

Observations

These systems mostly do not behave in

the way theory tells us to expect

Simple tones are rare; complex contours

abound; phonation type contrasts are found in nearly all possible different interactions with tone

The rules/constraints relating the isolation

tones to tone sandhi forms lack phoneticity.

Why would such systems evolve?
Difficult to avoid idea of tones as indexical

features

SLIDE 27

Normalisation

Before you can answer these typological

questions you need to be able to characterise varieties’ acoustics quantitatively

Let’s look at a simple single variety -

Shanghai

SLIDE 28

Shanghai raw tonal acoustics

Unstopped tones: “high falling” mid dipping” “low rising” 8 male (thick lines) 8 female Controlled for intrinsic vowel F0 Controlled for intrinsic consonantal F0

SLIDE 29

Shanghai normalised tonal acoustics

(Rose 1993)

8 males 8 females normalisation: F0 - intrinsic z-score duration – percent NB not equalised! Coloured lines = mean normalised F0, duration Solid = male Dotted = female Note sex related differences in high falling tone Normalisation index (Earle 1975): How much does the normalisation reduce the original tonally-related between-speaker F0 variance? With this normalisation, about 9.5 times

SLIDE 30

Comparing varieties

If we want to find out how languages differ in

their tonal acoustics, and how they are the same (Anderson’s 1973 “linguistic-phonetic properties”), we need to compare varieties.

Problem: Comparing different varieties with

normalisation is not straightforward: you need to be sure that your normalisation parameters are comparable across varieties! for example:

How many linguistic-tonetically shared tones are

there between Standard Thai (5) and Southern Thai (7)?

SLIDE 31

Standard Thai female Southern Thai male (Thompson 1996)

What is the correct relationship between these two sets?

SLIDE 32

Using bilingual’s tones

The female speaker is a Southern Thai educated professional, bilingual in both Southern & Standard Thai (Rose 1997).

So our normalisation strategy must adequately reflect the relationship between her two sets of tones …

SLIDE 33

testing z-score normalisation with bilingual’s tones

10 20 30 40 50 60 20 40 60 80 100 percent of total F0 range

Speaker is controlling 11 different tones Z-score normalisation

SLIDE 34

(conservative HK) Cantonese

six contrasting pitch shapes on unstopped

syllables (subminimal sextuplet from CF4):

woman

婦

fu]

low to mid rise “[23]

ancient

古

ku

low to high rise “[24]”

support

扶

fu

falling from low “[21], [1↓]”

part

部

pu

lower mid level “[22]”

cause

故

ku

mid level “[33]”

father's sister

姑

[ku

high level “[55]”

SLIDE 35

Z-score normalised Cantonese unstopped tones

(Rose 2000)

5 males, 5 females, controlled for intrinsic vowel F0

SLIDE 36

Comparing Cantonese, Shanghai tones

8 different tones, “low rising” shared?

SLIDE 37

Comparing tones across varieties: high falling tones in Yongjiang 涌江 & Oujiang 甌江 sub- groups of Wu

10 20 30 40 50 60 70 80 90 100

3
2
1

1 2 3

Oujiang normalisation index = 21.6

normalised F0 & duration for OJ & YJ high falling tone normalised duration (%) normalised F0(sd) Yongjiang Oujiang

Is anything the same??

Problem: these are two different Middle Chinese tonal cognates The amount of variance around these normalised curves is less than that for a single variety (Shanghai). They are demonstrably linguistic-tonetically the same tone.

SLIDE 38

Summary

Talk has focussed on quantified

comparison of tonal/intonational F0 shapes ..

And testing of hypotheses about them!
It has shown that BSD’s (from a lot of

data!) are crucial for doing this.

SLIDE 39

Two sides of the same coin: between-speaker F0 differences in linguistic-tonetic description and forensic voice comparison.

phil rose

acoustic output from two complementary perspectives:

played an important part

quantifiable linguistic-tonetic representation of tonal and intonational pitch.

The Theme

m = vocal cord mass

Main anatomical source of F0 BSDs

Forensic Voice Comparison

speakers that are important

someone by their voice

any aspect of voice (not just phonetics!) to help trier-of-fact decide whether suspect said incriminating speech

the investment bank JP Morgan Chase in Australia

Switzerland, Greece and Hong Kong.

verifying that the transfer could go ahead.

The Crime

“JP Morgan Greg speaking” “Yeah hello Greg this is Craig Slater here mate” “Oh g’day how are you?” “Not too bad I bin havin a bit of trouble here…”

The Offender

“em.. And we’re going to pay Hong Kong dollars 118,678,543 spot 29 to HSBC em…Hong Kong?” “Correct” Hong Kong I think Hong Kong Power Limited six three six double oh three oh five five double

“Yes”

Out goes the money …

phone call

Commonwealth Superannuation Scheme account administered by the bank lost $150 million.

The Result

“not too bad”, e.g.

The Suspect

“not too bad” said with same H.L.LH intonation

– rise nuclear tone on bad (“supportive interest encouraging further conversation”). – high head on not (the suspect’s not high/low head)

The (Intonational F0) Evidence

罪犯的 “not too bad” F0

H on not L on too LH on bad

F0 曲线的相似程度 Degreee of similarity

between suspect and offender’s not too bad F0

罪犯 Offender F0 嫌疑人 Suspect Samples F0

You want to know the probability the suspect said the incriminating speech, given the similarity between the suspect and offender data? p(H|E)

Evaluating Evidence Rationally

By my theorem, that is proportional to the strength

hypothesis over another =

hypotheses =

suspect and offender F0 in not too bad assuming the suspect said it, vs. the probability of the difference, assuming it was said by someone else randomly chosen from the relevant population.

The Likelihood Ratio

LR denominator is where the between-speaker differences come in!

So we have to collect a Reference Sample of “not too bad”s

“how’s it going?” etc

sound as if they are from the same speaker?

speakers with very similar voices!!

Speaker 10 Speaker 9 Speaker 8 Speaker 7 Speaker 6 Speaker 5 Speaker 4 Speaker 3 Speaker 2 Speaker 1

You have to go and get this!

Reference sample: non-contemporaneous variation in 30 males’ “not too bad” F0.

MVLR的分子 = MVLR的分母 =

( )

( ) ( )

( ) (

) (

)

{ }

( ) (

)

( )

{ } (

)⎥

⎦ ⎤ ⎢ ⎣ ⎡ − + + − × − + − × + +

∑

x y C h D D x y

y D D y y

h D D mh C D D * * exp exp 2

π ( )

( ) ∏

The Formula

20.6 Density（密度）

Multivariate LR values for comparison between suspect and offender samples using F0 in “not too bad” against reference population of 30 males.利用 “not too bad”

中的F0计算得到的多变量似然率结果 （以30个男性语音作为参考样本）

NOT the suspect is about 20 more likely to have said it than someone else!!

The Finding

get quite large strengths of evidence in support of either defence or prosecution hypotheses.

also used

Offender Suspect Reference sample

The Other Voice Evidence

evidence in the case), but

The Verdict

中的F0计算得到的多变量似然率结果（以30个男性语音作为参考样本）