Multimodal audio-video person recognition using Deep Neural - - PowerPoint PPT Presentation

multimodal audio video
SMART_READER_LITE
LIVE PREVIEW

Multimodal audio-video person recognition using Deep Neural - - PowerPoint PPT Presentation

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student: Conf. Dr. Ing. Horia Cucu Sandu Marian Gabriel 1 Table of contents Introduction CDEP Dataset Development Neural network


slide-1
SLIDE 1

Multimodal audio-video person recognition using Deep Neural Networks

  • Conf. Dr. Ing. Horia Cucu

Sandu Marian Gabriel Thesis advisor: Student:

1

slide-2
SLIDE 2

Table of contents

  • Introduction
  • CDEP Dataset Development
  • Neural network architectures
  • Face and speaker recognition results
  • Multimodal results
  • Personal contributions
  • Conclusions

2

slide-3
SLIDE 3

Introduction

  • Motiva

vati tion

  • n
  • Objectives
  • Implementation steps
  • Initial data specifications

3

slide-4
SLIDE 4

Introduction

  • Motivation
  • Objec

ectives tives

  • Implementation steps
  • Initial data specifications

4

slide-5
SLIDE 5

Introduction

  • Motivation
  • Objectives
  • Implementa

lementatio tion steps

  • Initial data specifications

5

slide-6
SLIDE 6

Introduction

  • Motivation
  • Objectives
  • Implementation steps
  • Initial

ial data ta specifi ificati cation

  • ns

6

  • Videos
  • HTML files corresponding to each video
  • Audio
  • files, each corresponding to a speech
  • Text file which contains every speaker in

the database with an associated unique ID ID

slide-7
SLIDE 7

CDEP* Dataset Development

Image processing Audio processing HTML processing

7

*CDEP = Chamber of Deputies

Already done

slide-8
SLIDE 8

CDEP Dataset Development

Image processing Audio processing HTML processing

8

slide-9
SLIDE 9

CDEP Dataset Development

Extraction of valid frames

9

Examples of faces

slide-10
SLIDE 10

CDEP Dataset Development

Bad cases

  • Bad images
  • More persons in a

folders

Faulty image removal algorithm

10

1.0 0.9 0.7 0.7 0.8 0.9 0.7 0.8 0.9 0.8 0.7 0.8 0.8 0.7 0.8 0.9 1.0 0.7 0.7 0.8 0.9 0.7 0.8 0.9 0.8 0.8 0.8 0.9 0.7 0.9 0.7 0.7 1.0 0.8 0.8 0.7 0.9 0.7 0.7 0.8 0.8 0.8 0.6 0.9 0.7 0.7 0.7 0.8 1.0 0.9 0.8 0.8 0.6 0.7 0.9 0.9 0.8 0.7 0.8 0.6 0.8 0.8 0.8 0.9 1.0 0.8 0.8 0.7 0.7 0.9 0.9 0.9 0.7 0.8 0.7 0.9 0.9 0.7 0.8 0.8 1.0 0.7 0.8 0.9 0.8 0.8 0.8 0.8 0.7 0.8 0.7 0.7 0.9 0.8 0.8 0.7 1.0 0.6 0.7 0.8 0.8 0.8 0.6 0.9 0.6 0.8 0.8 0.7 0.6 0.7 0.8 0.6 1.0 0.9 0.7 0.7 0.7 0.9 0.6 0.9 0.9 0.9 0.7 0.7 0.7 0.9 0.7 0.9 1.0 0.7 0.7 0.8 0.9 0.6 0.9 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.7 1.0 0.9 0.9 0.7 0.7 0.7 0.7 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.7 0.9 1.0 0.9 0.7 0.8 0.7 0.8 0.8 0.8 0.8 0.9 0.8 0.8 0.7 0.8 0.9 0.9 1.0 0.8 0.7 0.7 0.8 0.9 0.6 0.7 0.7 0.8 0.6 0.9 0.9 0.7 0.7 0.8 1.0 0.6 0.9 0.7 0.7 0.9 0.8 0.8 0.7 0.9 0.6 0.6 0.7 0.8 0.7 0.6 1.0 0.6 0.8 0.9 0.7 0.6 0.7 0.8 0.6 0.9 0.9 0.7 0.7 0.7 0.9 0.6 1.0

Similarity matrix

slide-11
SLIDE 11

CDEP Dataset Development

Image processing Audio processing HTML processing

11

slide-12
SLIDE 12

CDEP Dataset Development

Voice Activity Detector [1] Audio pre-processing algorithm

Audio processing algorithms

12

Already done

slide-13
SLIDE 13

CDEP Dataset Development

Database Development conclusions

13

# audio files range # persons 3 - 5 2 6 - 10 23 11 - 20 15 21 - 50 39 51 - 100 250

Audio histogram Image histogram

slide-14
SLIDE 14

CDEP Dataset Development

Database Development conclusions

14

Dataset

  • No. of samples /

class

  • No. of classes
  • No. of training

samples

  • No. of evaluation

samples

  • No. of test

samples Image10 10 257 1542 514 514 Image50 50 132 3960 1320 1320 Image100 100 84 5040 1680 1680 Audio_3s_10 10 257 1542 514 514 Audio_3s_50 50 132 1542 514 514

Derived datasets configuration

slide-15
SLIDE 15

Neural network architectures - Face recognition

VGG GG16 16 Goog

  • ogLeNe

LeNet FaceN eNet et

15

slide-16
SLIDE 16

Neural network architectures - Speaker recognition

VGGVO VOX X [5]

16

slide-17
SLIDE 17

Monomodal results

Face e recogn

  • gniti

tion

  • n result

ults Speake eaker r recogn cognit ition ion resu sults lts

17

Database

  • No. of classes

Architecture Test accuracy Image10 257 FaceNet 93.2% Image10 257 VGG16 81.3% Image10 257 GoogLeNet 67% Image50 132 FaceNet 98.3% Image50 132 VGG16 95% Image50 132 GoogLeNet 95% Image100 84 FaceNet 99.2% Image100 84 VGG16 97% Image100 84 GoogLeNet 97% Database

  • No. of classes

Test accuracy Audio_3s_10 257 98.24% Audio_3s_50 132 99.23%

slide-18
SLIDE 18

Multimodal recognition. Results

Multimodal architecture

Transfer learning High-level feature classification

18

Database Number of classes Batch size Optimizer Test accuracy Image10 Audio_3s_10 257 32 SGD 99.82% Image50 Audio_3s_50 132 32 SGD 99.92%

Results

slide-19
SLIDE 19

Conclusions

  • Final

l results ts

  • Further validation of the model
  • Personal contributions
  • Further development

Test accuracy

19

slide-20
SLIDE 20

Conclusions

  • Final results
  • Fur

urth ther er valida lidatio tion of the e model el

  • Personal contributions
  • Further development

VidTIMIT Database examples VidTIMIT database CDEP database

20

slide-21
SLIDE 21

Conclusions

  • Final results
  • Further validation of the model
  • Pers

rsonal

  • nal contributio

ributions ns

  • Further development

21

  • Developing an algorithm for face

database creation and validation

  • Developing an algorithm for audio

database creation and validation

  • Fine-tuning three state-of-the-art

neural networks for face recognition and choosing the best option, and one state-of-the-art network for audio recognition

  • Creating and evaluating a

multimodal architecture for person identification

slide-22
SLIDE 22

Conclusions

  • Final results
  • Further validation of the model
  • Personal contributions
  • Furth

ther er deve velopmen lopment

  • Model optimization

○ Multimodal parameter tuning

  • Training on a larger dataset

○ Cloud computing for a higher data volume

  • Live recognition

○ API support for video upload ○ Automatic database update

  • Website integration

○ User-friendly interface for identification

22

slide-23
SLIDE 23

Bibliography

23

  • [1] Salishev, Sergey & Barabanov, Andrey & Kocharov, Daniil &

Skrelin, Pavel & Moiseev, Mikhail. (2016). Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 9924. 352-

  • 358. 10.1007/978-3-319-45510-5_40.
  • [2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott

Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9

  • [3] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional

Networks for Large-Scale Image Recognition

  • [4] Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A

Unified Embedding for Face Recognition and Clustering

  • A. Nagrani*, J. S. Chung*, A. Zisserman, VoxCeleb: a large-scale

speaker identification dataset, INTERSPEECH, 2017

slide-24
SLIDE 24

Thank you !

24

slide-25
SLIDE 25

Source code

https://git.speed.pub.ro/diploma/multimodal-person-identification

25