Crosstalk Analysis Stuart N Wrigley Vincent Wan Guy J Brown Steve - - PowerPoint PPT Presentation

crosstalk analysis
SMART_READER_LITE
LIVE PREVIEW

Crosstalk Analysis Stuart N Wrigley Vincent Wan Guy J Brown Steve - - PowerPoint PPT Presentation

multimodal meeting manager - m4 Crosstalk Analysis Stuart N Wrigley Vincent Wan Guy J Brown Steve Renals 29 January 2003 Speech and Hearing Research Group, University of Sheffield, UK multimodal meeting manager - m4 Crosstalk Analysis


slide-1
SLIDE 1

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Crosstalk Analysis

Stuart N Wrigley Vincent Wan Guy J Brown Steve Renals

29 January 2003

slide-2
SLIDE 2

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Crosstalk Analysis

Goals

  • Detection of crosstalk.
  • Ideally, would like to segment each channel into channel speaker, channel speaker

alone, channel speaker + crosstalk, crosstalk alone.

  • Segmentation must be channel (i.e. speaker, meeting, environment) independent.

Data

  • ICSI: closetalking mics for each participant (mix of lapel and head-mounted), plus 4

tabletop mics. Large amounts of data which has already been checked and transcribed

  • IDIAP: lapel mics, plus 12 tabletop mics and a manikin. Still in initial stages of

collection and transcription.

slide-3
SLIDE 3

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Initial notes

Despite attractiveness, channel energy may be an unreliable cue to speaker activity

  • ICSI data primarily headmounted headsets - microphone fixed relative to the mouth

(with one or two notable exceptions).

  • However, M4 recordings made with lapel microphones - head and body movement

will change the channel gain throughout the meeting. e.g. If speaker turns head to speak to colleague, signal energy in channel may drop significantly.

slide-4
SLIDE 4

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Channel Activity Classifier

Our goal is to produce a system that will classify a frame of a meeting as either:

  • Current channel speaker alone
  • Current channel speaker + crosstalk
  • Crosstalk alone
  • Silence / background noise

We have taken a similar approach to that of ICSI by using an ergodic HMM (EHMM). However, our classifier differs:

  • Four main states as apposed to ICSI’s two (speech / nonspeech).
  • No intermediate states pairs (used to impose time constraints on transitions).
slide-5
SLIDE 5

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Ergodic HMM (EHMM)

  • Four states, each representing a particular label.
  • Equal prior probability of first state being any one of the four.
  • Each state modelled as a multivariate GMM.
  • Transitions allowed between every state pair.
  • No minimum residency time in each state.

S C SC N

slide-6
SLIDE 6

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Features

As mentioned at the August meeting, we wished to look at many different features and determine which were the best. The list (which is still growing) is:

  • MFCCs (20 coeffs)
  • Energy
  • Zero crossing rate (ZCR)
  • Time-domain kurtosis (a measure of nongaussianity of the signal)
  • Frequency-domain kurtosis (a measure of nongaussianity of the spectrum)
  • Spectral autocorrelation (SAPVR)
  • Fundamentalness (a measure related to AM and FM at different frequencies)
  • max, min and mean crosscorrelation of all channel pairs
  • autocorrelation normalised max, min and mean crosscorrelation of all channel pairs

Total number of features: 13 (Total number of dimensions: 32)

slide-7
SLIDE 7

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Spotlight on...

Kurtosis (fourth central moment divided by fourth power of the standard deviation)

  • Kurtosis is based on the size of a distribution's tails - i.e. a measure of gaussianity.
  • Kurtosis is zero for a gaussian random variable; nongaussian random variables have a

nonzero kurtosis.

  • Kurtosis of co-channel speech (crosstalk) is generally less than the kurtosis of the

individual speech utterances#.

# See LeBlanc and de Leon. Speech Separation by Kurtosis Maximization, IEEE ICASSP

1998, 1029-1032.

slide-8
SLIDE 8

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Spotlight on...

Fundamentalness (see Speech Comm. 27 (1999) page 196, eqns (13)-(19))

  • A wide analysing wavelet makes the output

corresponding to the fundamental component have smaller FM and AM than other outputs.

  • Fundamentalness is defined as having maximum value

when the FM and AM modulation magnitudes are minimum - corresponding to the fundamental component.

  • Although this was developed to analyse single harmonic

series, the concept that a single fundamental produces high fundamentalness is useful:

  • If more than one fundamental is present, interference of the two components will cause

AM and FM modulation, thus decreasing the fundamentalness measure.

slide-9
SLIDE 9

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Data

  • For each classifier, the multivariate GMMs were trained on 1M frames (16 ms) per

class, taken randomly from four ICSI meetings (bro012, bmr006, bed010, bed008).

  • The classifier was evaluated using 1K frames (16 ms) per class, taken randomly from
  • ne ICSI meetings (bmr001).

Note, the crosscorrelation information is incorporated into the feature set as opposed to being a post processing stage as in the ICSI classifier.

slide-10
SLIDE 10

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Selection of best features

  • The parcel algorithm (see Scott, Niranjan and Prager) was used to assess the

classification performance of the different feature combinations.

  • A receiver operating characteristic (ROC) curve shows classification performance.
  • For each feature combination, the GMMs are trained and then evaluated to create a

ROC for each class.

  • Each point on a ROC curve represents the performance of a classifier with a different

decision threshold between two classes (i.e. the class of interest vs all others).

  • Given a number of ROCs (one per feature combination), a maximum realisable ROC

(MRROC) can be calculated by fitting a convex hull over the existing ROCs.

  • Therefore, each point on a MRROC represents the optimum feature combination of

that class for a particular trade-off between true positives and false positives.

slide-11
SLIDE 11

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

ROCs

After initial inspection of the ‘raw’ ROCs, it was determined that only a subset of features should be investigated (thus reducing the number of combinations from 8191 to 127). For example, the performance of the MFCCs was sufficiently low that it was not considered in combination with others. e.g. MFCCs vs crosscorrelation in detecting speaker alone:

0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Correct detection probability (in %) Single feature: mfcc speaker alone crosstalk alone speaker+crosstalk silence 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Correct detection probability (in %) Single feature: max normalised XC speaker alone crosstalk alone speaker+crosstalk silence

~64 % ~78 %

slide-12
SLIDE 12

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

MRROCs

We computed the MRROCs of each combination of:

  • energy
  • kurtosis
  • fundamentalness
  • max XC
  • mean XC
  • max normalised XC
  • mean normalised XC

... and then the MRROCs of those MRROCs !! The final MRROC tells us which feature combination to use in the final classifier.

slide-13
SLIDE 13

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Correct detection probability (in %) MMROC using features: energy, kurtosis, fundamentalness, max XC, mean XC, max normalised XC, mean normalised XC, speaker alone crosstalk alone speaker+crosstalk silence

MRROC

Speaker alone: ~83 %

slide-14
SLIDE 14

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

MRROC discarding energy

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Correct detection probability (in %) MMROC using features: kurtosis, fundamentalness, max XC, mean XC, max normalised XC, mean normalised XC, speaker alone crosstalk alone speaker+crosstalk silence

Speaker alone: ~81 %

slide-15
SLIDE 15

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

MRROC discarding energy and crosscorrelation

(note ~10 % performance drop when not using crosscorrelation)

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Correct detection probability (in %) MMROC using features: kurtosis, fundamentalness, speaker alone crosstalk alone speaker+crosstalk silence

Speaker alone: ~71 %

slide-16
SLIDE 16

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Ergodic HMM performance (preliminary)

  • The results above show the GMM classification performance.
  • When channel 0 of ICSI meeting bmr001 was classified by an EHMM (i.e. GMMs +

transition probabilities) trained using the best features, speaker alone classification increased to 90 % (true positives; false positives: 11 %).

  • However, false positives are an issue, as is silence/noise detection:

Manual frame transcriptions of meeting bmr001 (portion) 500 1000 1500 2000 2500 3000 3500 4000 0.5 1 1.5 EHMM classication of same portion Time (16ms frames, 10ms shift) 500 1000 1500 2000 2500 3000 3500 4000 0.5 1 1.5

speaker alone speaker + crosstalk crosstalk alone silence / noise

slide-17
SLIDE 17

multimodal meeting manager - m4 Speech and Hearing Research Group, University of Sheffield, UK

Summary

Optimum features appear to be

  • energy (but may not want to use this)
  • kurtosis
  • fundamentalness
  • crosscorrelation

Performance rises when transition probabilities are incorporated (ergodic HMM as

  • pposed to pure GMMs).

Next

The results shown are still new and need more analysis. We also intend to look at crosscorrelation in more detail with the aim of using this feature to determine the number of active speakers during crosstalk periods.