Deep complementary Mateusz Budnik 1 features for speaker Ali - - PowerPoint PPT Presentation

deep complementary
SMART_READER_LITE
LIVE PREVIEW

Deep complementary Mateusz Budnik 1 features for speaker Ali - - PowerPoint PPT Presentation

Deep complementary Mateusz Budnik 1 features for speaker Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2 identification in TV broadcast data 1 Univ. Grenoble-Alpes 2 Ozyegin University Agenda Motivation Related work System


slide-1
SLIDE 1

Deep complementary features for speaker identification in TV broadcast data

Mateusz Budnik 1 Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2

1 Univ. Grenoble-Alpes 2 Ozyegin University

slide-2
SLIDE 2

Agenda

  • Motivation
  • Related work
  • System overview
  • Experimental setup and dataset
  • Results
  • Conclusion and perspectives

2

slide-3
SLIDE 3

Motivation

  • To investigate the use of a Convolutional Neural

Network (typical image approach) algorithm for the task

  • f speaker identification.
  • Its fusion with more traditional systems.

3

slide-4
SLIDE 4

Related work

  • In [1] a CNN is trained using spectrograms in order to

identify disguised voices.

  • [2] uses 1D convolutions on filter banks. Surrounding

frames are taken into account and serve as context to reduce noise impact.

4 1. Lior Uzan and Lior Wolf, “I know that voice: Identifying the voice actor behind the voice,” in Biometrics (ICB), 2015 International Conference on. IEEE, 2015, pp. 46– 51. 2. Pavel Matejka, Le Zhang, Tim Ng, HS Mallidi, Ondrej Glembek, Jeff Ma, and Bing Zhang, “Neural network bottleneck features for language identification,” Proc. IEEE Odyssey, pp. 299–304, 2014.

slide-5
SLIDE 5

System overview

5

slide-6
SLIDE 6

Approaches

  • Convolutional Neural Network
  • TVS
  • GMM-UBM
  • PLDA

6

slide-7
SLIDE 7

The network structure

7

slide-8
SLIDE 8

CNN setup

  • Trained for around 12 epochs
  • ReLU and dropout (rate 0.5) after each FC
  • No random cropping or rotation
  • Average pooling instead of max pooling
  • Averaging the scores on spectrograms to get to score for

a given speech segment

8

slide-9
SLIDE 9

GMM-UBM, TVS and PLDA

  • UBM consisting of 1024 gaussians is trained on the

training data

  • Segmentation outputs of conventional BIC-criterion
  • I-vectors dimension is 500
  • Length normalization is used

9

slide-10
SLIDE 10

Fusion

  • Fusion between TVS and CNN
  • Late fusion
  • Duration-based late fusion

○ s = (1 − tanh(d))scnn + sivec

  • Early fusion with SVMs

○ CNN’s last hidden layer with PCA (500) + i-vector (500) ○ Linear SVM

10

slide-11
SLIDE 11

Dataset

  • The REPERE corpus
  • French language, 7 types of videos (news, debates, etc.)
  • Noisy and imbalanced
  • Train set:

○ 821 speakers ○ 9377 speech segments from 148 videos (22h of speech)

  • Test set:

○ 113 speakers ○ 2410 segments from 57 videos (6h of speech)

11

slide-12
SLIDE 12

Dataset

12

Total amount of speech per speaker for speakers present in both train / test sets of REPERE corpus. Speakers are sorted according to total speech duration in training set.

slide-13
SLIDE 13

Experimental setup

  • In the test set:

○ 24.8% of speech segments are shorter than 2 seconds ○ 70.4% are shorter than 10 seconds

  • MFCC:

○ 19 dimensions are extracted every 10 ms with a window length of 20 ms ○ Concatenated with delta and delta-delta coefficients ○ 59 dimensional feature vector after feature warping

13

slide-14
SLIDE 14

Experimental setup

  • Spectrograms:

○ 240 ms duration with a frequency of 25 Hz ○ Overlap of 200 ms between neighboring spectrograms ○ For each spectrogram: ■ Audio segment was windowed every 5 ms with a window length of 20 ms ■ Hamming windowing ■ Log-spectral amplitude values extraction ■ Final resolution: 48x128 pixels

14

slide-15
SLIDE 15

Results

15

slide-16
SLIDE 16

Results

16

slide-17
SLIDE 17

Conclusion and future work

  • CNN + TVS fusion improves over the baseline
  • More data may be needed for CNN (and PLDA)
  • Perspectives:

○ Multimodal CNN (including faces) ○ Vertical and horizontal CNNs for better insight

17