speaker change detection using fundamental frequency with - - PowerPoint PPT Presentation

speaker change detection using fundamental frequency with
SMART_READER_LITE
LIVE PREVIEW

speaker change detection using fundamental frequency with - - PowerPoint PPT Presentation

speaker change detection using fundamental frequency with application to multi-talker segmentation May 16, 2019 Aidan Hogg, Christine Evers and Patrick Naylor Electrical and Electronic Engineering, Imperial College London, UK diarization


slide-1
SLIDE 1

speaker change detection using fundamental frequency with application to multi-talker segmentation

May 16, 2019 Aidan Hogg, Christine Evers and Patrick Naylor Electrical and Electronic Engineering, Imperial College London, UK

slide-2
SLIDE 2

diarization

Motivation

What is speaker diarization? Answers the question “who spoke when?” in an audio recording. Is diarization really that useful? ∙ Speaker indexing and rich transcription ∙ Speaker segmentation and clustering helping Automatic Speech Recognition (ASR) systems ∙ Preprocessing modules for single speaker-based algorithms

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

1

slide-3
SLIDE 3

diarization method

slide-4
SLIDE 4

speech signal

Diarization Method

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

3

slide-5
SLIDE 5

segmentation

Diarization Method

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

4

slide-6
SLIDE 6

clustering

Diarization Method

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

5

slide-7
SLIDE 7

segmentation motivation

Diarization Method

Is good segmentation really that useful? Why not just segment the audio stream into small uniform segments and cluster with realignment? If the speech segments are small then each segment only contains a small amount of information that can be used for clustering.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

6

slide-8
SLIDE 8

speaker pitch tracks

slide-9
SLIDE 9

the ami meeting room corpus

Speaker Pitch Tracks

Multi-modal data set consisting of 100 hours of meeting recordings. Recorded in English using three different rooms with different acoustic properties and includes mostly non-native speakers.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

8

slide-10
SLIDE 10

speaker pitch tracks from ‘es2004b’

Speaker Pitch Tracks

200 400 600 800 1000 Time (s) 100 150 200 250 300 Estimated pitch (Hz)

Speaker A Speaker B Speaker C Speaker D

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

9

slide-11
SLIDE 11

speaker pitch tracks from ‘ts3003b’

Speaker Pitch Tracks

200 400 600 800 1000 Time (s) 100 150 200 250 300 Estimated pitch (Hz)

Speaker A Speaker B Speaker C Speaker D

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

10

slide-12
SLIDE 12

pitch segmentation

slide-13
SLIDE 13

the new idea

Pitch Segmentation

Assumption: If the speaker’s pitch only varies in a smooth manner due to physiological constraints (Xu, 2002) it should be possible to estimate the future pitch of the speaker based on their current pitch. Main Idea: Use a Kalman filter to carry out this future pitch

  • estimation. If the pitch can’t be estimated then the speaker has

potentially changed.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

12

slide-14
SLIDE 14

proposed system

Pitch Segmentation Change detection Segmentation file Kalman filter Pitch Estimation Audio input VAD

Proposed pitch segmentation system

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

13

slide-15
SLIDE 15

kalman filter

Pitch Segmentation

The pitch 𝑦(𝑜) for a given frame 𝑜 can be written in the following way: 𝑦(𝑜 + 1) = 𝑦(𝑜) + 𝑥, 𝑥 ∈ 𝒪(0, 𝜏2

𝑥) .

The measurement 𝑨(𝑜) of the true pitch 𝑦(𝑜) can be modelled according to: 𝑨(𝑜) = 𝑦(𝑜) + 𝑤, 𝑤 ∈ 𝒪(0, 𝜏2

𝑤) .

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

14

slide-16
SLIDE 16

prediction

Pitch Segmentation

Performed on every frame Predicted pitch estimate: ̂ 𝑦𝑜|𝑜−1 = ̂ 𝑦𝑜−1|𝑜−1. Predicted estimate variance: 𝑄𝑜|𝑜−1 = 𝑄𝑜−1|𝑜−1 + 𝜏2

𝑥.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

15

slide-17
SLIDE 17

update

Pitch Segmentation

Performed if the frame is considered to be voiced Updated pitch estimate and updated estimate variance: ̂ 𝑦𝑜|𝑜 = ̂ 𝑦𝑜|𝑜−1 + 𝐿𝑜(𝑨𝑜 − ̂ 𝑦𝑜|𝑜−1) 𝑄𝑜|𝑜 = (1 − 𝐿𝑜)2𝑄𝑜|𝑜−1 + 𝐿2

𝑜𝜏2 𝑤.

If the Kalman gain is 𝐿𝑜 = 1: ̂ 𝑦𝑜|𝑜 = 𝑨𝑜 (just the measurement) If the Kalman gain is 𝐿𝑜 = 0: ̂ 𝑦𝑜|𝑜 = ̂ 𝑦𝑜|𝑜−1 (just the prediction) Optimal Kalman gain: 𝐿𝑜 = 𝑄𝑜|𝑜−1 𝑇𝑜 .

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

16

slide-18
SLIDE 18

variance ‘p’

Pitch Segmentation 0.2 0.4 0.6 kHz − 40 − 35 − 30 − 25 − 20 − 15 − 10 − 5 [dB] 1 2 3 4 5 6 Seconds 0.5 1.0 1.5 2.0 P

Variance

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

17

slide-19
SLIDE 19

speaker change detection

Pitch Segmentation

A Kalman filter is initialised and tracks first speaker. If the error between measurement and prediction becomes larger than a threshold (10 Hz) then all previously generated Kalman tracks are checked. ∙ If the closest previous Kalman pitch track is below a threshold (50 Hz) then this Kalman filter is continued. ∙ If on the other hand, the closest Kalman filter to the measurement does not satisfy this threshold then a new Kalman filter is generated.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

18

slide-20
SLIDE 20

ground truth

slide-21
SLIDE 21

a comparison of pitch and speaker changes

Ground Truth

Meeting SC | PC ES2004a 94.49 ES2004b 89.25 ES2004c 95.21 ES2004d 91.85 IS1009a 96.12 IS1009b 98.94 IS1009c 97.67 IS1009d 98.55 EN2002a 92.35 EN2002b 87.01 EN2002c 79.37 EN2002d 86.00 TS3003a 76.54 TS3003b 76.59 TS3003c 75.82 TS3003d 81.34 Meeting PC | SC ES2004a 78.76 ES2004b 68.60 ES2004c 70.22 ES2004d 73.38 IS1009a 68.91 IS1009b 64.27 IS1009c 59.38 IS1009d 66.60 EN2002a 88.59 EN2002b 83.40 EN2002c 87.70 EN2002d 81.02 TS3003a 52.08 TS3003b 48.46 TS3003c 56.47 TS3003d 62.68

SC | PC The probability that there is a ‘speaker change’ given that there is a ‘pitch change’ PC | SC The probability that there is a ‘pitch change’ given that there is a ‘speaker change’

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

20

slide-22
SLIDE 22

evaluation

slide-23
SLIDE 23

mfcc vs pitch segmentation

EVALUATION Segmentation file MFCC extraction VAD Audio input

Benchmark system (‘Sidekit’)

https://projets-lium.univ-lemans.fr/s4d/

Change detection Segmentation file Kalman filter Pitch Estimation Audio input VAD

Proposed system

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

22

slide-24
SLIDE 24

benchmark system evaluation

EVALUATION

2 4 6 8 10 12 14 16 Meeting 20 40 60 80 100 Rate (%)

Hit Miss Multi-Hit

500 ms collar around each speaker change boundary (250 ms before and after)

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

23

slide-25
SLIDE 25

proposed system evaluation

EVALUATION

2 4 6 8 10 12 14 16 Meeting 20 40 60 80 100 Rate (%)

Hit Miss Multi-Hit

500 ms collar around each speaker change boundary (250 ms before and after)

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

24

slide-26
SLIDE 26

evaluation comparison

EVALUATION

Pitch System MFCC System 10 20 30 40 50 60 70 80 Rate (%)

Hit Miss Multi-Hit

500 ms collar around each speaker change boundary (250 ms before and after)

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

25

slide-27
SLIDE 27

conclusion

EVALUATION

The proposed Kalman filter prediction error-based approach performed well when compared against a previous MFCC-based method. An evaluation on the AMI corpus showed a speaker changed detection increase from 43.3% to 70.5%.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

26

slide-28
SLIDE 28

paper contributions

EVALUATION

In this paper we have... ...carried out a study of meetings in the AMI corpus that has shown that a pitch change is a strong indicator of a speaker change. ...highlighted that an individual’s pitch is smoothly varying and, therefore, can be predicted by using a Kalman filter. ...proposed a Kalman filtering approach to identify speaker change boundaries based on a model of the temporal variation of pitch.

  • A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

27

slide-29
SLIDE 29

Questions?