Multimodal audio-video person recognition using Deep Neural - PowerPoint PPT Presentation

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student: Conf. Dr. Ing. Horia Cucu Sandu Marian Gabriel 1

Table of contents ● Introduction ● CDEP Dataset Development ● Neural network architectures ● Face and speaker recognition results ● Multimodal results ● Personal contributions ● Conclusions 2

Introduction Motiva vati tion on ● Objectives ● Implementation steps ● Initial data specifications ● 3

Introduction Motivation ● Objec ectives tives ● Implementation steps ● Initial data specifications ● 4

Introduction Motivation ● Objectives ● Implementa lementatio tion steps ● Initial data specifications ● 5

Introduction Motivation ● Objectives ● Implementation steps ● Initial ial data ta specifi ificati cation ons ● Videos ● HTML files corresponding to each video ● Audio o files, each corresponding to a speech ● Text file which contains every speaker in ● the database with an associated unique ID ID 6

CDEP* Dataset Development Already done HTML processing Image processing *CDEP = Chamber of Deputies Audio processing 7

CDEP Dataset Development HTML processing Image processing Audio processing 8

CDEP Dataset Development Examples of faces Extraction of valid frames 9

CDEP Dataset Development 1.0 0.9 0.7 0.7 0.8 0.9 0.7 0.8 0.9 0.8 0.7 0.8 0.8 0.7 0.8 0.9 1.0 0.7 0.7 0.8 0.9 0.7 0.8 0.9 0.8 0.8 0.8 0.9 0.7 0.9 0.7 0.7 1.0 0.8 0.8 0.7 0.9 0.7 0.7 0.8 0.8 0.8 0.6 0.9 0.7 0.7 0.7 0.8 1.0 0.9 0.8 0.8 0.6 0.7 0.9 0.9 0.8 0.7 0.8 0.6 0.8 0.8 0.8 0.9 1.0 0.8 0.8 0.7 0.7 0.9 0.9 0.9 0.7 0.8 0.7 0.9 0.9 0.7 0.8 0.8 1.0 0.7 0.8 0.9 0.8 0.8 0.8 0.8 0.7 0.8 0.7 0.7 0.9 0.8 0.8 0.7 1.0 0.6 0.7 0.8 0.8 0.8 0.6 0.9 0.6 0.8 0.8 0.7 0.6 0.7 0.8 0.6 1.0 0.9 0.7 0.7 0.7 0.9 0.6 0.9 0.9 0.9 0.7 0.7 0.7 0.9 0.7 0.9 1.0 0.7 0.7 0.8 0.9 0.6 0.9 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.7 1.0 0.9 0.9 0.7 0.7 0.7 0.7 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.7 0.9 1.0 0.9 0.7 0.8 0.7 0.8 0.8 0.8 0.8 0.9 0.8 0.8 0.7 0.8 0.9 0.9 1.0 0.8 0.7 0.7 0.8 0.9 0.6 0.7 0.7 0.8 0.6 0.9 0.9 0.7 0.7 0.8 1.0 0.6 0.9 0.7 0.7 0.9 0.8 0.8 0.7 0.9 0.6 0.6 0.7 0.8 0.7 0.6 1.0 0.6 0.8 0.9 0.7 0.6 0.7 0.8 0.6 0.9 0.9 0.7 0.7 0.7 0.9 0.6 1.0 Similarity matrix Bad cases ● Bad images ● More persons in a Faulty image removal algorithm folders 10

CDEP Dataset Development HTML processing Image processing Audio processing 11

CDEP Dataset Development Audio processing algorithms Already done Voice Activity Detector [1] Audio pre-processing algorithm 12

CDEP Dataset Development Database Development conclusions # audio files # persons range 3 - 5 2 6 - 10 23 11 - 20 15 21 - 50 39 Image histogram 51 - 100 250 Audio histogram 13

CDEP Dataset Development Database Development conclusions Dataset No. of samples / No. of classes No. of training No. of evaluation No. of test class samples samples samples Image10 10 257 1542 514 514 Image50 50 132 3960 1320 1320 Image100 100 84 5040 1680 1680 Audio_3s_10 10 257 1542 514 514 Audio_3s_50 50 132 1542 514 514 Derived datasets configuration 14

Neural network architectures - Face recognition FaceN eNet et Goog oogLeNe LeNet VGG GG16 16 15

Neural network architectures - Speaker recognition VGGVO VOX X [5] 16

Monomodal results Face e recogn ogniti tion on result ults Speake eaker r recogn cognit ition ion resu sults lts Database No. of classes Architecture Test accuracy Database No. of classes Test accuracy Image10 257 FaceNet 93.2% Audio_3s_10 257 98.24% Image10 257 VGG16 81.3% Audio_3s_50 132 99.23% Image10 257 GoogLeNet 67% Image50 132 FaceNet 98.3% Image50 132 VGG16 95% Image50 132 GoogLeNet 95% Image100 84 FaceNet 99.2% Image100 84 VGG16 97% Image100 84 GoogLeNet 97% 17

Multimodal recognition. Results Multimodal architecture High-level feature Transfer learning classification Results Database Number of Batch size Optimizer Test accuracy classes Image10 257 32 SGD 99.82% Audio_3s_10 Image50 132 32 SGD 99.92% Audio_3s_50 18

Conclusions Final l results ts ● Further validation of the model ● Personal contributions ● Test accuracy Further development ● 19

Conclusions Final results ● Fur urth ther er valida lidatio tion of the e model el ● Personal contributions ● Further development ● VidTIMIT Database examples CDEP database VidTIMIT database 20

Conclusions Final results ● Further validation of the model ● Pers rsonal onal contributio ributions ns ● Further development ● ● Developing an algorithm for face database creation and validation ● Developing an algorithm for audio database creation and validation ● Fine-tuning three state-of-the-art neural networks for face recognition and choosing the best option, and one state-of-the-art network for audio recognition ● Creating and evaluating a multimodal architecture for person identification 21

Conclusions Final results ● Further validation of the model ● Personal contributions ● Furth ther er deve velopmen lopment ● ● Model optimization ○ Multimodal parameter tuning ● Training on a larger dataset ○ Cloud computing for a higher data volume ● Live recognition ○ API support for video upload ○ Automatic database update ● Website integration ○ User-friendly interface for identification 22

Bibliography ● [1] Salishev, Sergey & Barabanov, Andrey & Kocharov, Daniil & Skrelin, Pavel & Moiseev, Mikhail. (2016). Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 9924. 352- 358. 10.1007/978-3-319-45510-5_40. ● [2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich ; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9 ● [3] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ● [4] Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering ● A. Nagrani*, J. S. Chung*, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, INTERSPEECH, 2017 23

Thank you ! 24

Source code https://git.speed.pub.ro/diploma/multimodal-person-identification 25

Multimodal audio-video person recognition using Deep Neural - PowerPoint PPT Presentation

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student: Conf. Dr. Ing. Horia Cucu Sandu Marian Gabriel 1 Table of contents Introduction CDEP Dataset Development Neural network

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

BASICS ON DIGITAL BASICS ON DIGITAL AUDIO AND VIDEO AUDIO AND VIDEO REPRESENTATION

Audio, Video, Film Pathway Skills: High School Credits: 3 CTAE Pathway Credits Students will

Beyond Audio & Video Beyond Audio & Video Beyond Audio & Video the Internet was the

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

www.baconcoach.com/learn Video & Audio Multi Media Content www.baconcoach.com/learn Video

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Measuring Headphone Frequency Response Werner Dahm The Basics

Sound Effect Devices for Musicians Advisors: Dr. Randal Geiger, Dr. Degang Chen By: Ben Reichert,

Incl Inclusi usive Des ve Design ign Dee Deep Lear p Learning ning on on Aud Audio in Azu

Music Information Retrieval State-of-the-art techniques Ladislav Mark Charles University,

Advanced Virgo (Report on Advanced Virgo stay)

AUGER FALLS HERITAGE PARK A CITY LEGACY IN THE MAKING Lee Glaesemann, P.E. Mark Holtzen, P.E. Water

obligation to release publicly any revisions or updates to any forward looking statements to

A NEW SEARCH SPACE IN AUSTRALIAS PREMIER GOLD PROVINCE NE W F RONT IE RS IN Tom Sanders

Multimodal audio-video person recognition using Deep Neural - PowerPoint PPT Presentation

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student: Conf. Dr. Ing. Horia Cucu Sandu Marian Gabriel 1 Table of contents Introduction CDEP Dataset Development Neural network

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

BASICS ON DIGITAL BASICS ON DIGITAL AUDIO AND VIDEO AUDIO AND VIDEO REPRESENTATION

Audio, Video, Film Pathway Skills: High School Credits: 3 CTAE Pathway Credits Students will

Beyond Audio &amp; Video Beyond Audio &amp; Video Beyond Audio &amp; Video the Internet was the

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

www.baconcoach.com/learn Video &amp; Audio Multi Media Content www.baconcoach.com/learn Video

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Measuring Headphone Frequency Response Werner Dahm The Basics

Sound Effect Devices for Musicians Advisors: Dr. Randal Geiger, Dr. Degang Chen By: Ben Reichert,

Incl Inclusi usive Des ve Design ign Dee Deep Lear p Learning ning on on Aud Audio in Azu

Music Information Retrieval State-of-the-art techniques Ladislav Mark Charles University,

Advanced Virgo (Report on Advanced Virgo stay)

AUGER FALLS HERITAGE PARK A CITY LEGACY IN THE MAKING Lee Glaesemann, P.E. Mark Holtzen, P.E. Water

obligation to release publicly any revisions or updates to any forward looking statements to

A NEW SEARCH SPACE IN AUSTRALIAS PREMIER GOLD PROVINCE NE W F RONT IE RS IN Tom Sanders

Beyond Audio & Video Beyond Audio & Video Beyond Audio & Video the Internet was the

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

www.baconcoach.com/learn Video & Audio Multi Media Content www.baconcoach.com/learn Video