Deep complementary features for speaker identification in TV broadcast data
Mateusz Budnik 1 Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2
1 Univ. Grenoble-Alpes 2 Ozyegin University
Deep complementary Mateusz Budnik 1 features for speaker Ali - - PowerPoint PPT Presentation
Deep complementary Mateusz Budnik 1 features for speaker Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2 identification in TV broadcast data 1 Univ. Grenoble-Alpes 2 Ozyegin University Agenda Motivation Related work System
Mateusz Budnik 1 Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2
1 Univ. Grenoble-Alpes 2 Ozyegin University
2
Network (typical image approach) algorithm for the task
3
identify disguised voices.
frames are taken into account and serve as context to reduce noise impact.
4 1. Lior Uzan and Lior Wolf, “I know that voice: Identifying the voice actor behind the voice,” in Biometrics (ICB), 2015 International Conference on. IEEE, 2015, pp. 46– 51. 2. Pavel Matejka, Le Zhang, Tim Ng, HS Mallidi, Ondrej Glembek, Jeff Ma, and Bing Zhang, “Neural network bottleneck features for language identification,” Proc. IEEE Odyssey, pp. 299–304, 2014.
5
6
7
a given speech segment
8
training data
9
○ s = (1 − tanh(d))scnn + sivec
○ CNN’s last hidden layer with PCA (500) + i-vector (500) ○ Linear SVM
10
○ 821 speakers ○ 9377 speech segments from 148 videos (22h of speech)
○ 113 speakers ○ 2410 segments from 57 videos (6h of speech)
11
12
Total amount of speech per speaker for speakers present in both train / test sets of REPERE corpus. Speakers are sorted according to total speech duration in training set.
○ 24.8% of speech segments are shorter than 2 seconds ○ 70.4% are shorter than 10 seconds
○ 19 dimensions are extracted every 10 ms with a window length of 20 ms ○ Concatenated with delta and delta-delta coefficients ○ 59 dimensional feature vector after feature warping
13
○ 240 ms duration with a frequency of 25 Hz ○ Overlap of 200 ms between neighboring spectrograms ○ For each spectrogram: ■ Audio segment was windowed every 5 ms with a window length of 20 ms ■ Hamming windowing ■ Log-spectral amplitude values extraction ■ Final resolution: 48x128 pixels
14
15
16
○ Multimodal CNN (including faces) ○ Vertical and horizontal CNNs for better insight
17