multimodal audio video
play

Multimodal audio-video person recognition using Deep Neural - PowerPoint PPT Presentation

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student: Conf. Dr. Ing. Horia Cucu Sandu Marian Gabriel 1 Table of contents Introduction CDEP Dataset Development Neural network


  1. Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student: Conf. Dr. Ing. Horia Cucu Sandu Marian Gabriel 1

  2. Table of contents ● Introduction ● CDEP Dataset Development ● Neural network architectures ● Face and speaker recognition results ● Multimodal results ● Personal contributions ● Conclusions 2

  3. Introduction Motiva vati tion on ● Objectives ● Implementation steps ● Initial data specifications ● 3

  4. Introduction Motivation ● Objec ectives tives ● Implementation steps ● Initial data specifications ● 4

  5. Introduction Motivation ● Objectives ● Implementa lementatio tion steps ● Initial data specifications ● 5

  6. Introduction Motivation ● Objectives ● Implementation steps ● Initial ial data ta specifi ificati cation ons ● Videos ● HTML files corresponding to each video ● Audio o files, each corresponding to a speech ● Text file which contains every speaker in ● the database with an associated unique ID ID 6

  7. CDEP* Dataset Development Already done HTML processing Image processing *CDEP = Chamber of Deputies Audio processing 7

  8. CDEP Dataset Development HTML processing Image processing Audio processing 8

  9. CDEP Dataset Development Examples of faces Extraction of valid frames 9

  10. CDEP Dataset Development 1.0 0.9 0.7 0.7 0.8 0.9 0.7 0.8 0.9 0.8 0.7 0.8 0.8 0.7 0.8 0.9 1.0 0.7 0.7 0.8 0.9 0.7 0.8 0.9 0.8 0.8 0.8 0.9 0.7 0.9 0.7 0.7 1.0 0.8 0.8 0.7 0.9 0.7 0.7 0.8 0.8 0.8 0.6 0.9 0.7 0.7 0.7 0.8 1.0 0.9 0.8 0.8 0.6 0.7 0.9 0.9 0.8 0.7 0.8 0.6 0.8 0.8 0.8 0.9 1.0 0.8 0.8 0.7 0.7 0.9 0.9 0.9 0.7 0.8 0.7 0.9 0.9 0.7 0.8 0.8 1.0 0.7 0.8 0.9 0.8 0.8 0.8 0.8 0.7 0.8 0.7 0.7 0.9 0.8 0.8 0.7 1.0 0.6 0.7 0.8 0.8 0.8 0.6 0.9 0.6 0.8 0.8 0.7 0.6 0.7 0.8 0.6 1.0 0.9 0.7 0.7 0.7 0.9 0.6 0.9 0.9 0.9 0.7 0.7 0.7 0.9 0.7 0.9 1.0 0.7 0.7 0.8 0.9 0.6 0.9 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.7 1.0 0.9 0.9 0.7 0.7 0.7 0.7 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.7 0.9 1.0 0.9 0.7 0.8 0.7 0.8 0.8 0.8 0.8 0.9 0.8 0.8 0.7 0.8 0.9 0.9 1.0 0.8 0.7 0.7 0.8 0.9 0.6 0.7 0.7 0.8 0.6 0.9 0.9 0.7 0.7 0.8 1.0 0.6 0.9 0.7 0.7 0.9 0.8 0.8 0.7 0.9 0.6 0.6 0.7 0.8 0.7 0.6 1.0 0.6 0.8 0.9 0.7 0.6 0.7 0.8 0.6 0.9 0.9 0.7 0.7 0.7 0.9 0.6 1.0 Similarity matrix Bad cases ● Bad images ● More persons in a Faulty image removal algorithm folders 10

  11. CDEP Dataset Development HTML processing Image processing Audio processing 11

  12. CDEP Dataset Development Audio processing algorithms Already done Voice Activity Detector [1] Audio pre-processing algorithm 12

  13. CDEP Dataset Development Database Development conclusions # audio files # persons range 3 - 5 2 6 - 10 23 11 - 20 15 21 - 50 39 Image histogram 51 - 100 250 Audio histogram 13

  14. CDEP Dataset Development Database Development conclusions Dataset No. of samples / No. of classes No. of training No. of evaluation No. of test class samples samples samples Image10 10 257 1542 514 514 Image50 50 132 3960 1320 1320 Image100 100 84 5040 1680 1680 Audio_3s_10 10 257 1542 514 514 Audio_3s_50 50 132 1542 514 514 Derived datasets configuration 14

  15. Neural network architectures - Face recognition FaceN eNet et Goog oogLeNe LeNet VGG GG16 16 15

  16. Neural network architectures - Speaker recognition VGGVO VOX X [5] 16

  17. Monomodal results Face e recogn ogniti tion on result ults Speake eaker r recogn cognit ition ion resu sults lts Database No. of classes Architecture Test accuracy Database No. of classes Test accuracy Image10 257 FaceNet 93.2% Audio_3s_10 257 98.24% Image10 257 VGG16 81.3% Audio_3s_50 132 99.23% Image10 257 GoogLeNet 67% Image50 132 FaceNet 98.3% Image50 132 VGG16 95% Image50 132 GoogLeNet 95% Image100 84 FaceNet 99.2% Image100 84 VGG16 97% Image100 84 GoogLeNet 97% 17

  18. Multimodal recognition. Results Multimodal architecture High-level feature Transfer learning classification Results Database Number of Batch size Optimizer Test accuracy classes Image10 257 32 SGD 99.82% Audio_3s_10 Image50 132 32 SGD 99.92% Audio_3s_50 18

  19. Conclusions Final l results ts ● Further validation of the model ● Personal contributions ● Test accuracy Further development ● 19

  20. Conclusions Final results ● Fur urth ther er valida lidatio tion of the e model el ● Personal contributions ● Further development ● VidTIMIT Database examples CDEP database VidTIMIT database 20

  21. Conclusions Final results ● Further validation of the model ● Pers rsonal onal contributio ributions ns ● Further development ● ● Developing an algorithm for face database creation and validation ● Developing an algorithm for audio database creation and validation ● Fine-tuning three state-of-the-art neural networks for face recognition and choosing the best option, and one state-of-the-art network for audio recognition ● Creating and evaluating a multimodal architecture for person identification 21

  22. Conclusions Final results ● Further validation of the model ● Personal contributions ● Furth ther er deve velopmen lopment ● ● Model optimization ○ Multimodal parameter tuning ● Training on a larger dataset ○ Cloud computing for a higher data volume ● Live recognition ○ API support for video upload ○ Automatic database update ● Website integration ○ User-friendly interface for identification 22

  23. Bibliography ● [1] Salishev, Sergey & Barabanov, Andrey & Kocharov, Daniil & Skrelin, Pavel & Moiseev, Mikhail. (2016). Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 9924. 352- 358. 10.1007/978-3-319-45510-5_40. ● [2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich ; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9 ● [3] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ● [4] Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering ● A. Nagrani*, J. S. Chung*, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, INTERSPEECH, 2017 23

  24. Thank you ! 24

  25. Source code https://git.speed.pub.ro/diploma/multimodal-person-identification 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend