multimodal i nform ation fusion
play

MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia - PDF document

18/09/2018 MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia Laboratory & Centre for Interactive Multimedia Information Mining Ryerson University, Toronto, Ontario Canada lguan@ee.ryerson.ca


  1. 18/09/2018 MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia Laboratory & Centre for Interactive Multimedia Information Mining Ryerson University, Toronto, Ontario Canada lguan@ee.ryerson.ca https://www.ryerson.ca/multimedia-research-laboratory/ 9/18/2018 1 Acknow ledgem ent  The presenter would like to thank his former and current students, P. Muneesawang, Y. Wang, R. Zhang, M.T. Ibrahim, L. Gao, C. Liang and N. El Madany for their contributions to this research.  Slides on fusion fundamentals provided by Prof. S-Y Kung of Princeton University are greatly appreciated.  This presentation is supported by  The Canada Research Chair (CRC) Program,  Canada Foundation for Innovations (CFI),  The Ontario Research Fund (ORF), and  Ryerson University Ryerson University – Multimedia Research Laboratory 9/18/2018 2 L. Guan, P. Muneesawang, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki, M.T. Ibrahim, N. Joshi, Z.Xie, L. Gao, N. El Madany 1

  2. 18/09/2018 Relevant Publications Y. Wang and L. Guan, “Combining speech and facial expression for recognition of human 1. emotional state,” IEEE Trans. on Multimedia , 10(5), 936 - 946, Aug 2008 C. Liang, E. Chen, L. Qi and L. Guan, “Heterogeneous features fusion with collaborative 2. representation learning for 3D action recognition,” Proc. IEEE Int. Symposium on Multimedia , pp. 162-168, Taichung, Taiwan, December 2017 (Top Six (5%) Paper Honor). L. Gao, L. Qi, E. Chen and L. Guan, “Discriminative multiple canonical correlation analysis 3. for information fusion,” IEEE Trans. on Image Processin g, vol. 27, no. 4, pp. 1951-1965, Apr 2018. P. Muneesawang, T. Amin and L. Guan, “A new learning algorithms for the fusion of 4. adaptive audio-visual features for the retrieval and classification of movie clips,” J. Signal Processing Systems , vol. 59, no. 2, 177-188, May 2010. T. Amin, M. Zeytinoglu and L. Guan, “Application of Laplacian mixture model for image and 5. video retrieval,” IEEE Transactions on Multimedia , vol. 9, no, 7, pp. 1416-1429, November 2007. P. Muneesawang and L. Guan, “Adaptive video indexing and automatic/semi-automatic 6. relevance feedback,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 15, no. 8, pp. 1032-1046, August 2005 L. Guan, P. Muneesawang, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki and M.T. Ibriham, 7. “Multimedia Multimodal Technologies,” Proc. IEEE Workshop on Multimedia Signal Processing and Novel Parallel Computing (In conjunction with ICME 2009), 1600-1603, NYC, USA, Jul 2009 (Overview Paper). R. Zhang and L. Guan, “Multimodal image retrieval via Bayesian information fusion,” Proc. 8. IEEE Int. Conf. on Multimedia and Expo , pp. 830-833, NYC, USA, Jun/Jul 2009. Ryerson University – Multimedia Research Laboratory 9/18/2018 3 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim W hy Multim edia Multim odal Methodology? ( revisit)  Multimedia is a domain of multi-facets, e.g., audio, visual, text, graphics, etc.  A central aspect of multimedia processing is the coherent integration of media from different sources or multimodalities.  Easy to define each facet individually, but difficult to consider them as a combined identity  Humans are natural and generic multimedia processing machines Can we teach computers/machines to do the same (via fusion technologies)? Ryerson University – Multimedia Research Laboratory 9/18/2018 4 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 2

  3. 18/09/2018 Potential Applications  Human–Computer Interaction  Learning Environments  Consumer Relations  Entertainment  Digital Home, Domestic Helper  Security/Surveillance  Educational Software  Computer Animation  Call Centers Ryerson University – Multimedia Research Laboratory 9/18/2018 5 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim Source of Fusion for Classification Score (Decision) Score (Decision) Representation Representation Data/Feature #1 Data/Feature #2 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 9/18/2018 6 3

  4. 18/09/2018 Feature (Data) level fusion 9/18/2018 7 Direct Data (Feature) Level Fusion V i fea features: tu Prior knowledge can be incorporated into the fusion models by modifying Ryerson University – Multimedia Research Laboratory 9/18/2018 8 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 4

  5. 18/09/2018 Representation level fusion 9/18/2018 9 Representation Level Fusion Fused HMM HMM (Face Model) Ryerson University – Multimedia Research Laboratory 9/18/2018 10 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 5

  6. 18/09/2018 Decision (Score) level fusion 9/18/2018 11 A Different Interpretation E 1 (Human emotion recognition) N E .... y 1 ••• •• • Hierarchical Structure T W R D E 2 • Each Sub-network E r an O E E E M C M R expert system y 2 C O O O .... ••• G I T K D •• N I S I U O • The decision module Z . . . . I N L Y net . . . . E I . . . . O classifies the input vector E D N N as a particular class when E r P U .... Y net = arg max y j ••• y j •• T Modular Netw orks ( Decision Level) 9/18/2018 12 6

  7. 18/09/2018 Score Fusion Architecture (Audio-Visual) The scores are independently obtained, which are then combined. The lower layer contains local experts, each produces a • local score based on a single modality The upper layer combines the score. • Ryerson University – Multimedia Research Laboratory 9/18/2018 13 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim Linear Fusion Non-uniformly weighted Accept Score 2 Reject Score 1 The most prevailing unsupervised approaches estimate the confidence based on prior knowledge or training data. Linear SVM (supervised) Fusion is an appealing alternative. Ryerson University – Multimedia Research Laboratory 9/18/2018 14 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 7

  8. 18/09/2018 Nonlinear Adaptive Fusion (via supervision) (Kernel, SVM) Score 2 Score 1 Ryerson University – Multimedia Research Laboratory 9/18/2018 15 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim Data/ Feature Fusion  Simple and straightforward (Good)  Curse of Dimensionality (Bad)  Normalization issue  Case study: 1. Bimodal Human emotion recognition (also with a score level fusion flavor) 2. 3D human action recognition 3. Fusion by feature mapping Ryerson University – Multimedia Research Laboratory 9/18/2018 16 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 8

  9. 18/09/2018 Bim odal Hum an em otion recognition — Also with a score level fusion flavor Y. Wang and L. Guan, “Recognizing human emotional state from 1. audiovisual signals,” IEEE Transactions on Multimedia , vol. 10, no. 5, pp. 936 - 946, August 2008. Ryerson University – Multimedia Research Laboratory 9/18/2018 17 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim Indicators of emotion Speech Major indicators of emotion Major indicators of emotion Major indicators of emotion Major indicators of emotion Facial expression Body language: highly dependent on personality, gender, age, etc Semantic meaning: two sentences could have the same lexical meaning but different emotional information ...... Ryerson University – Multimedia Research Laboratory 9/18/2018 18 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 9

  10. 18/09/2018 9/18/2018 19 Objective To develop a generic language and cultural background independent system for recognition of hum an em otional state from audiovisual signals Ryerson University – Multimedia Research Laboratory 9/18/2018 20 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 10

  11. 18/09/2018 Audio feature extraction Hamming Prosodic Preprocessing window Input Noise Leading and trailing Windowing MFCC speech reduction edge elimination Audio Wavelet 512 points, feature set Formant thresholding 50% overlap Ryerson University – Multimedia Research Laboratory 9/18/2018 21 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim Visual feature extraction Key Frame Extraction Input Image Sequence Maximum Speech Amplitude Face Detection Visual Features Gabor Feature Mapping Filter Bank Ryerson University – Multimedia Research Laboratory 9/18/2018 22 Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend