speaker change detection using fundamental frequency with - PowerPoint PPT Presentation

speaker change detection using fundamental frequency with application to multi-talker segmentation May 16, 2019 Aidan Hogg, Christine Evers and Patrick Naylor Electrical and Electronic Engineering, Imperial College London, UK

diarization Motivation What is speaker diarization? Answers the question “who spoke when?” in an audio recording. Is diarization really that useful? ∙ Speaker indexing and rich transcription ∙ Speaker segmentation and clustering helping Automatic Speech Recognition (ASR) systems ∙ Preprocessing modules for single speaker-based algorithms A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 1

diarization method

speech signal Diarization Method A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 3

segmentation Diarization Method A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 4

clustering Diarization Method A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 5

segmentation motivation Diarization Method Is good segmentation really that useful? Why not just segment the audio stream into small uniform segments and cluster with realignment? If the speech segments are small then each segment only contains a small amount of information that can be used for clustering. A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 6

speaker pitch tracks

the ami meeting room corpus Speaker Pitch Tracks Multi-modal data set consisting of 100 hours of meeting recordings. Recorded in English using three different rooms with different acoustic properties and includes mostly non-native speakers. A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 8

speaker pitch tracks from ‘es2004b’ Speaker Pitch Tracks 300 Speaker A Estimated pitch (Hz) Speaker B 250 Speaker C Speaker D 200 150 100 200 400 600 800 1000 Time (s) A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 9

speaker pitch tracks from ‘ts3003b’ Speaker Pitch Tracks 300 Speaker A Estimated pitch (Hz) Speaker B 250 Speaker C Speaker D 200 150 100 200 400 600 800 1000 Time (s) A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 10

pitch segmentation

the new idea Pitch Segmentation Assumption: If the speaker’s pitch only varies in a smooth manner due to physiological constraints (Xu, 2002) it should be possible to estimate the future pitch of the speaker based on their current pitch. Main Idea: Use a Kalman filter to carry out this future pitch estimation. If the pitch can’t be estimated then the speaker has potentially changed. A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 12

proposed system Pitch Segmentation Pitch Kalman Change Audio input Estimation filter detection Segmentation VAD file Proposed pitch segmentation system A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 13

𝑦(𝑜 + 1) = 𝑦(𝑜) + 𝑥, 𝑥 ∈ 𝒪(0, 𝜏 2 𝑥 ) . 𝑨(𝑜) = 𝑦(𝑜) + 𝑤, 𝑤 ∈ 𝒪(0, 𝜏 2 𝑤 ) . kalman filter Pitch Segmentation The pitch 𝑦(𝑜) for a given frame 𝑜 can be written in the following way: The measurement 𝑨(𝑜) of the true pitch 𝑦(𝑜) can be modelled according to: A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 14

̂ ̂ 𝑦 𝑜−1|𝑜−1 . 𝑥 . prediction Pitch Segmentation Performed on every frame Predicted pitch estimate: 𝑦 𝑜|𝑜−1 = Predicted estimate variance: 𝑄 𝑜|𝑜−1 = 𝑄 𝑜−1|𝑜−1 + 𝜏 2 A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 15

𝑦 𝑜|𝑜−1 𝑄 𝑜|𝑜−1 ̂ 𝑇 𝑜 ̂ . ̂ 𝑤 . 𝑜 𝜏 2 𝑦 𝑜|𝑜−1 ) ̂ ̂ ̂ update Pitch Segmentation Performed if the frame is considered to be voiced Updated pitch estimate and updated estimate variance: 𝑦 𝑜|𝑜 = 𝑦 𝑜|𝑜−1 + 𝐿 𝑜 (𝑨 𝑜 − 𝑄 𝑜|𝑜 = (1 − 𝐿 𝑜 ) 2 𝑄 𝑜|𝑜−1 + 𝐿 2 If the Kalman gain is 𝐿 𝑜 = 1 : (just the measurement) 𝑦 𝑜|𝑜 = 𝑨 𝑜 If the Kalman gain is 𝐿 𝑜 = 0 : (just the prediction) 𝑦 𝑜|𝑜 = Optimal Kalman gain: 𝐿 𝑜 = A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 16

variance ‘p’ Pitch Segmentation [dB] − 40 − 35 − 30 − 25 − 20 − 15 − 10 − 5 0 0.6 0.4 kHz 0.2 0 2.0 Variance 1.5 P 1.0 0.5 0 1 2 3 4 5 6 Seconds A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 17

speaker change detection Pitch Segmentation A Kalman filter is initialised and tracks first speaker. If the error between measurement and prediction becomes larger than a threshold (10 Hz) then all previously generated Kalman tracks are checked. ∙ If the closest previous Kalman pitch track is below a threshold (50 Hz) then this Kalman filter is continued. ∙ If on the other hand, the closest Kalman filter to the measurement does not satisfy this threshold then a new Kalman filter is generated. A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 18

ground truth

a comparison of pitch and speaker changes Ground Truth Meeting SC | PC Meeting PC | SC ES2004a 94.49� ES2004a 78.76� ES2004b 89.25� ES2004b 68.60� ES2004c 95.21� ES2004c 70.22� ES2004d 91.85� ES2004d 73.38� IS1009a 96.12� IS1009a 68.91� IS1009b 98.94� IS1009b 64.27� IS1009c 97.67� IS1009c 59.38� IS1009d 98.55� IS1009d 66.60� EN2002a 92.35� EN2002a 88.59� EN2002b 87.01� EN2002b 83.40� EN2002c 79.37� EN2002c 87.70� EN2002d 86.00� EN2002d 81.02� TS3003a 76.54� TS3003a 52.08� TS3003b 76.59� TS3003b 48.46� TS3003c 75.82� TS3003c 56.47� TS3003d 81.34� TS3003d 62.68� SC | PC The probability that there is a ‘speaker change’ given that there is a ‘pitch change’ PC | SC The probability that there is a ‘pitch change’ given that there is a ‘speaker change’ A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 20

evaluation

mfcc vs pitch segmentation EVALUATION MFCC Segmentation Audio input VAD extraction file Benchmark system (‘Sidekit’) https://projets-lium.univ-lemans.fr/s4d/ Pitch Kalman Change Audio input Estimation filter detection Segmentation VAD file Proposed system A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 22

benchmark system evaluation EVALUATION 100 Hit Miss Multi-Hit 80 60 Rate (%) 40 20 0 0 2 4 6 8 10 12 14 16 Meeting 500 ms collar around each speaker change boundary (250 ms before and after) A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 23

proposed system evaluation EVALUATION 100 Hit Miss Multi-Hit 80 60 Rate (%) 40 20 0 2 4 6 8 10 12 14 16 0 Meeting 500 ms collar around each speaker change boundary (250 ms before and after) A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 24

evaluation comparison EVALUATION 80 Hit 70 Miss Multi-Hit 60 50 Rate (%) 40 30 20 10 0 Pitch System MFCC System 500 ms collar around each speaker change boundary (250 ms before and after) A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 25

conclusion EVALUATION The proposed Kalman filter prediction error-based approach performed well when compared against a previous MFCC-based method. An evaluation on the AMI corpus showed a speaker changed detection increase from 43.3% to 70.5%. A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 26

speaker change detection using fundamental frequency with - PowerPoint PPT Presentation

speaker change detection using fundamental frequency with application to multi-talker segmentation May 16, 2019 Aidan Hogg, Christine Evers and Patrick Naylor Electrical and Electronic Engineering, Imperial College London, UK diarization

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

Computer Graphics Spectral Analysis Philipp Slusallek Spatial Frequency Frequency

Phase and Frequency detection Detection of sinusoidal signal characteristics Phase

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Debate: Writing and Presentation Mr. Winand Debate Proposition America is losing its competitive

A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation

Low frequency observation of cosmic-ray air-shower radio emission by EXTASIS Antony Escudie et.

Axion Detection with Precision Frequency Metrology Maxim Goryachev Ben McAllister Mike Tobar

Plan of the Lecture Review: control design using frequency response: PI/lead Todays

Voice Activity Detection Voice Activity Detection Speaker Recognition Feature Extraction

Fundamental Physics Tests using Fundamental Physics Tests using Rubidium Rubidium and Cesium

Differential VCO and Differential VCO and Frequency Tripler using SiGe SiGe Frequency Tripler

Speaker Change Detection using Siamese Networks Siamese layers share their Acoustic Data

Making Parks More Accessible for People with Disabilities Working towards a more inclusive

ACTIVATING THE PATIENTS IMMUNE SYSTEM TO FIGHT CANCER Company presentation April, 2020

The outline and latest status of fine bubble measurement techniques and UK bubble applications

PERSPECTIVE ON THE IMPLEMENTATION OF A PHARMACOVIGILANCE SYSTEM DR AHMED BOUZIDI, CEO

Making Projects Critical 2017 PMI Research Achievement Award Johann Packendorff

Vad r CERA Malcolm Campbell Tisdag 10e December 2013 CERA CERA = Certified Enterprise Risk

Malawian Agriculture Rui Benfica (IFAD) and James Thurlow (IFPRI) Presentation to the Ministry of

Presentation to NSW PAC re Gullen Range, 5 th September, 2014 Michael Crawford, B Sc, BA Admin, M