MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia - - PDF document

multimodal i nform ation fusion
SMART_READER_LITE
LIVE PREVIEW

MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia - - PDF document

18/09/2018 MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia Laboratory & Centre for Interactive Multimedia Information Mining Ryerson University, Toronto, Ontario Canada lguan@ee.ryerson.ca


slide-1
SLIDE 1

18/09/2018 1

MultiModal I nform ation Fusion

Ling Guan

Ryerson Multimedia Laboratory & Centre for Interactive Multimedia Information Mining Ryerson University, Toronto, Ontario Canada lguan@ee.ryerson.ca https://www.ryerson.ca/multimedia-research-laboratory/

1 9/18/2018

Acknow ledgem ent

 The presenter would like to thank his former

and current students, P. Muneesawang, Y. Wang, R. Zhang, M.T. Ibrahim, L. Gao, C. Liang and N. El Madany for their contributions to this research.

 Slides on fusion fundamentals provided by Prof.

S-Y Kung of Princeton University are greatly appreciated.

 This presentation is supported by

  • The Canada Research Chair (CRC) Program,
  • Canada Foundation for Innovations (CFI),
  • The Ontario Research Fund (ORF), and
  • Ryerson University

Ryerson University – Multimedia Research Laboratory

  • L. Guan, P. Muneesawang, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki, M.T. Ibrahim, N. Joshi, Z.Xie, L. Gao, N. El Madany

2 9/18/2018

slide-2
SLIDE 2

18/09/2018 2

Relevant Publications

1.

  • Y. Wang and L. Guan, “Combining speech and facial expression for recognition of human

emotional state,” IEEE Trans. on Multimedia, 10(5), 936 - 946, Aug 2008

2.

  • C. Liang, E. Chen, L. Qi and L. Guan, “Heterogeneous features fusion with collaborative

representation learning for 3D action recognition,” Proc. IEEE Int. Symposium on Multimedia, pp. 162-168, Taichung, Taiwan, December 2017 (Top Six (5%) Paper Honor).

3.

  • L. Gao, L. Qi, E. Chen and L. Guan, “Discriminative multiple canonical correlation analysis

for information fusion,” IEEE Trans. on Image Processing, vol. 27, no. 4, pp. 1951-1965, Apr 2018.

4.

  • P. Muneesawang, T. Amin and L. Guan, “A new learning algorithms for the fusion of

adaptive audio-visual features for the retrieval and classification of movie clips,” J. Signal Processing Systems, vol. 59, no. 2, 177-188, May 2010.

5.

  • T. Amin, M. Zeytinoglu and L. Guan, “Application of Laplacian mixture model for image and

video retrieval,” IEEE Transactions on Multimedia, vol. 9, no, 7, pp. 1416-1429, November 2007.

6.

  • P. Muneesawang and L. Guan, “Adaptive video indexing and automatic/semi-automatic

relevance feedback,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 15,

  • no. 8, pp. 1032-1046, August 2005

7.

  • L. Guan, P. Muneesawang, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki and M.T. Ibriham,

“Multimedia Multimodal Technologies,” Proc. IEEE Workshop on Multimedia Signal Processing and Novel Parallel Computing (In conjunction with ICME 2009), 1600-1603, NYC, USA, Jul 2009 (Overview Paper).

8.

  • R. Zhang and L. Guan, “Multimodal image retrieval via Bayesian information fusion,” Proc.

IEEE Int. Conf. on Multimedia and Expo, pp. 830-833, NYC, USA, Jun/Jul 2009.

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 3 9/18/2018

W hy Multim edia Multim odal Methodology? ( revisit)

 Multimedia is a domain of multi-facets, e.g., audio,

visual, text, graphics, etc.

 A central aspect of multimedia processing is the

coherent integration of media from different sources

  • r multimodalities.

 Easy to define each facet individually, but difficult to

consider them as a combined identity

 Humans are natural and generic multimedia

processing machines Can we teach computers/machines to do the same (via fusion technologies)?

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 4 9/18/2018

slide-3
SLIDE 3

18/09/2018 3

 Human–Computer Interaction  Learning Environments  Consumer Relations  Entertainment  Digital Home, Domestic Helper  Security/Surveillance  Educational Software  Computer Animation  Call Centers

Potential Applications

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 5 9/18/2018

Source of Fusion for Classification

Data/Feature #1

Representation

Score (Decision)

Representation

Score (Decision) Data/Feature #2

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 6 9/18/2018

slide-4
SLIDE 4

18/09/2018 4

7

Feature (Data) level fusion

9/18/2018

Direct Data (Feature) Level Fusion

Prior knowledge can be incorporated into the fusion models by modifying

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 8

Vi fea tu features:

9/18/2018

slide-5
SLIDE 5

18/09/2018 5

9

Representation level fusion

9/18/2018

Representation Level Fusion

HMM (Face Model) Fused HMM

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 10 9/18/2018

slide-6
SLIDE 6

18/09/2018 6

11

Decision (Score) level fusion

9/18/2018

Modular Netw orks ( Decision Level)

E1

  • ....

R E C O G N I Z E D E M O T I O N

.... ....

D E C I S I O N

. . . . . . . . . . . .

Er y1 yj y2

M O D U L E

Ynet E2

N E T W O R K I N P U T

  • Hierarchical Structure
  • Each Sub-network Er an

expert system

  • The decision module

classifies the input vector as a particular class when Ynet = arg max yj

A Different Interpretation

(Human emotion recognition)

12 9/18/2018

slide-7
SLIDE 7

18/09/2018 7

Score Fusion Architecture (Audio-Visual)

  • The lower layer contains local experts, each produces a

local score based on a single modality

  • The upper layer combines the score.

The scores are independently obtained, which are then combined.

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 13 9/18/2018

Linear Fusion

Score 1 Score 2 Accept Reject Non-uniformly weighted

Linear SVM (supervised) Fusion is an appealing alternative. The most prevailing unsupervised approaches estimate the confidence based on prior knowledge or training data.

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 14 9/18/2018

slide-8
SLIDE 8

18/09/2018 8 (Kernel, SVM)

Score 1 Score 2

Nonlinear Adaptive Fusion (via supervision)

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 15 9/18/2018

 Simple and straightforward (Good)  Curse of Dimensionality (Bad)  Normalization issue  Case study:

1. Bimodal Human emotion recognition (also with a score level fusion flavor) 2. 3D human action recognition 3. Fusion by feature mapping

Data/ Feature Fusion

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 16 9/18/2018

slide-9
SLIDE 9

18/09/2018 9

— Also with a score level fusion flavor

1.

  • Y. Wang and L. Guan, “Recognizing human emotional state from

audiovisual signals,” IEEE Transactions on Multimedia, vol. 10, no. 5,

  • pp. 936 - 946, August 2008.

Bim odal Hum an em otion recognition

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 17 9/18/2018

Indicators of emotion

Major indicators of emotion Major indicators of emotion Major indicators of emotion

Speech Facial expression Body language: highly dependent on personality, gender, age, etc Semantic meaning: two sentences could have the same lexical meaning but different emotional information ......

Major indicators of emotion

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 18 9/18/2018

slide-10
SLIDE 10

18/09/2018 10

19 9/18/2018

Objective

To develop a generic language and cultural background independent system for recognition of hum an em otional state from audiovisual signals

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 20 9/18/2018

slide-11
SLIDE 11

18/09/2018 11

Audio feature extraction

Noise reduction Input speech Windowing Prosodic MFCC Formant Audio feature set Leading and trailing edge elimination Wavelet thresholding 512 points, 50% overlap Preprocessing Hamming window

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 21 9/18/2018

Visual feature extraction

Key Frame Extraction Face Detection Gabor Filter Bank Feature Mapping

Input Image Sequence Visual Features

Maximum Speech Amplitude

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 22 9/18/2018

slide-12
SLIDE 12

18/09/2018 12

The recognition system

  • with Decision Fusion

AN DI FE HA SA SU Decision module Corresponding classifier Feature selection Key frame extraction Face detection Gabor wavelet Recognized emotion Input video Pre- processing Windowing Prosodic MFCC Formant Input speech Audio feature extraction Visual feature extraction Classification scheme

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 23 9/18/2018

Modular Netw orks ( Decision Level Fusion)

E1

  • ....

R E C O G N I Z E D E M O T I O N

.... ....

D E C I S I O N

. . . . . . . . . . . .

Er y1 yj y2

M O D U L E

Ynet E2

N E T W O R K I N P U T

  • Hierarchical Structure
  • Each Sub-network Er an

expert system

  • The decision module

classifies the input vector as a particular class when Ynet = arg max yj

Human emotion recognition

24 9/18/2018

slide-13
SLIDE 13

18/09/2018 13

Experim ental results

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

A cc u ra cy (% ) Audio 88.46 61.90 59.09 48.00 57.14 80.00 66.43 Visual 34.62 47.62 68.18 52.00 47.62 48.00 49.29 Audiovisual 76.92 76.19 77.27 56.00 61.90 72.00 70.00 Stepwise 80.77 61.90 77.27 80.00 76.19 76.00 75.71 Multi-classifier 88.46 80.95 77.27 80.00 80.95 84.00 82.14 Anger Disgust Fear Happiness Sadness Surprise Overall

  • Experiments were performed on 500 video samples from 8 subjects, speaking 6 languages
  • Six emotion labels: Anger, Disgust, Fear, Happiness, Sadness, and Surprise
  • 360 samples (from six subjects) were used for training, and the rest 140 (from the remaining two subjects)

for testing, there is no overlap between training and testing subjects

25 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 9/18/2018

3 D Hum an Action Recognition

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

2.

  • C. Liang, E. Chen, L. Qi and L. Guan, “Heterogeneous features fusion with

collaborative representation learning for 3D action recognition,” Proc. IEEE Int. Symposium on Multimedia, pp. 162-168, Taichung, Taiwan, December 2017 (Top Six (5%) Paper Honor).

26 9/18/2018

slide-14
SLIDE 14

18/09/2018 14

9/18/2018 27

3D Action Recognition Representations Depth based Approaches Skeleton based Approaches Fusion based Approaches RGB (Color) based Approaches

3D Human Action Recognition

The Recognition Fram ew ork

28 9/18/2018

slide-15
SLIDE 15

18/09/2018 15

Experiment on SBU interaction dataset

 A collection of two-person interactive activities  Composed of 8 interactions performed by 21 subjects, total 265

sequences (6822 frames).

 Following 5-fold cross-validation, randomly split the dataset into 5 folds of

4-5 two-actor sets each.

Subject01--Subject07

29 9/18/2018

Results -- SBU Interaction dataset

The energy-based segmentation is better than time- guided segmention

Heterogeneous Features Fusion based on CCA-serial method is better than CCA-sum method

The best recognition accuracy of the proposed: 95.39%

Better than the 6 deep learning based methods:

  • Structured Model [21],
  • ST-LSTM + Trust Gates [22],
  • Hierachical RNN [23]
  • LSTM + co-occurrence [23],
  • SkeletonNet (Skeleton + CNN) [27],
  • Global Context-Aware + LSTM [26],

30 9/18/2018

slide-16
SLIDE 16

18/09/2018 16

Fusion by Feature Mapping

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

3.

  • L. Gao, L. Qi, E. Chen and L. Guan, “Discriminative multiple canonical

correlation analysis for information fusion,” IEEE Trans. on Image Processing,

  • vol. 27, no. 4, pp. 1951-1965, Apr 2018.

31 9/18/2018 Feature 1 Feature 2

:

Feature N Classification

Mapping (generate effective represent ation) Recognition

Feature generation Feature Generation Feature Generation Feature Generation

:

Data Result

Feature mapping

Knowledge discovery 32

Knowledge Discovery (revisit)

9/18/2018

slide-17
SLIDE 17

18/09/2018 17

In the schematic diagram on the previous slide, knowledge discovery includes:

Feature generation block: generate features/descriptors by

  • Identify the key-points
  • Generate features at the key-points

 Classical methods incorporating prior knowledge (hand crafted features)  Deep leaning structure such as CNN (hand crafted architecture) 

The feature mapping block: map the features into more effective representation by

  • Feature selection
  • Explicit m apping by Statistical Machine Learning ( SML) , or
  • Implicit mapping by FFNN, normally including pooling, a statistical

processing step

Feature generation and mapping are two different, but complementary processing steps. Both are critically important in information discovery. SML methods are solidly rooted in mathematics, and the analysis procedure could be clearly and convincingly presented.

33

Knowledge Discovery - 2 (revisit)

9/18/2018 9/18/2018

  • The Purpose:

―Explicitly or implicitly reorganize the information for best possible analysis and recognition performance

  • Advantages

— Provide more complete and discriminatory description

  • f the intrinsic characteristics of different patterns
  • Major Challenges

―How to extract the discriminatory description of the intrinsic characteristics from the data? ―How to design a mapping strategy that can effectively utilize the complementary information presented in different datasets?

34

Feature Mapping (1)

slide-18
SLIDE 18

18/09/2018 18

9/18/2018

  • When multiple features/modalities are involved, it

is in essence multi-modal information fusion [L1]

  • When
  • nly
  • ne

feature set is involved, it is simplified to information transformation.

  • The following presentation focuses on mapping

by LMCCA, DMCCA for the purpose of multimodal information fusion.

  • The common characteristic: generic or feature

independent.

35

Feature Mapping (2)

Discrim inative Multiple Canonical Correlation Analysis ( DMCCA) ( 1 )

 In DMCCA,

  • the correlation among feature sets derived from

multi‐modal, multi- feature, or multiple channels is taken as the metric of the similarity;

  • the within‐class correlation and the between‐class

correlation are considered jointly, leading to a more discriminant space

9/18/2018 36

slide-19
SLIDE 19

18/09/2018 19

A key characteristics of DMCCA (analytically

verified):

 Capable of extracting the discriminatory representation  The number of projected dimension (d) corresponding to the optimal recognition accuracy is smaller than or equals to the number of classes (c) being studied.  Or mathematically:  The property can be graphically illustrated accurately.

d c 

37 9/18/2018

DMCCA ( 2 )

Experim ental Results( 1 )

 We conduct experiments on different applications

  • Handwritten digit recognition (MNIST database)

— http://yann.lecun.com/exdb/mnist/

  • Face Recognition (ORL database)

— http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

  • Image recognition (Caltech 101)

— http://www.vision.caltech.edu/Image_Datasets/Caltech101/

9/18/2018 38

slide-20
SLIDE 20

18/09/2018 20

Experim ental Results( 2 )

 Experimental Settings

9/18/2018 39

  • Table. 1 Experimental settings on different database

Nam e of database Total sam ples Training sam ples Testing sam ples Num ber of classes MNIST 3000 1500 1500 10 ORL 400 (All samples) 200 200 40 Caltech 31 to 800 images/clas s 3030 6084 101

Experim ental Results( 3 )

 Hand-crafted Feature Extraction

  • Handwritten digit recognition
  • 24-dimensional: the mean of the digit images transformed by the

Gabor filters.

  • 24-dimensional: the standard deviation of the digit images

transformed by the Gabor filters.

  • 36-dimensional: Zernike moment features.
  • Face recognition
  • 36-dimensional: The histogram of oriented gradient (HOG) feature.
  • 33-dimensional: The local binary Patterns (LBP) feature.
  • 48-dimensional: Gabor transformation feature with the mean and

standard deviation of the face images transformed by each filter.

9/18/2018 40

slide-21
SLIDE 21

18/09/2018 21

Experim ental Results( 4 )

  • Table. 2 The performance with different methods on

MNIST (Training samples 1500;Testing samples 1500)

9/18/2018 41

Methods Performance Serial Fusion [L6] 70.33% CCA [L7] 74.60% GCCA [L8] 75.53% MCCA [L9] 72.73% CNN [L10] 76.40% PCANet [L11] 79.20% DMCCA 82.60%

9/18/2018

J(η) in handwritten digit recognition on MNIST Database (peak at 9 < 10)

Graphically Selecting Optim al Projection in DMCCA ( MNI ST)

42

slide-22
SLIDE 22

18/09/2018 22

Experim ental Results( 5 )

 Table. 3 The performance with different methods on ORL

9/18/2018 43

Methods Performance

Serial fusion [L6] 77.5% CCA[L7] 94.5% GCCA[L8] 95.5% MCCA[L9] 94.5% Discriminative Sparse Representation (DSR)[L12] 94.5% Collaborative representation classification (CRC)[L13] 88.5% l1-regularized least squares (L1LS)[L14] 92.5% Fast L1-Minimization Algorithms (FLMA)[L15] 90.0% CNN[L10] 76.0% PCANet[L11] 92.0% DMCCA 98.5%

9/18/2018

J(η) in face recognition on ORL Database (peak at 27 < 40)

Graphically Selecting Optim al Projection in DMCCA ( ORL)

44

slide-23
SLIDE 23

18/09/2018 23

 Deep NN based Feature Extraction

 Caltech 101 dataset- AlexNet

9/18/2018 45

Experim ental Results( 6 )

fc7 fc8 fc6

Experim ental Results( 7 )

 The parameters of fc6, fc7 and fc8 and

recognition results are given as follows:

  • fc6: 4096 fully connected layer. Recognition rate: 77.84
  • fc7: 4096 fully connected layer. Recognition rate: 77.65
  • fc8: 1000 fully connected layer. Recognition rate: 73.31

9/18/2018 46

slide-24
SLIDE 24

18/09/2018 24

Experim ental Results( 8 )

9/18/2018 47

 Table. 4 Comparison with AlexNet on Caltech 101

Methods Performance

AlexNet fc6

77.80%

AlexNet fc7

77.65%

AlexNet fc8

77.31% LMCCA [2018] 83.68% LMCCA (new) 87.21/% DMCCA [2018] 89.38% DMCCA+KECA [unpblished] 90.61%

Experim ental Results( 9 )

9/18/2018 48

 Table. 5 Comparison with different methods on Caltech 101

Methods Year Performance

  • B. Du et al. [L16]

2017

78.60%

  • L. Mansourian et al. [L17]

2017 75.37%

  • P. Tang et al. [L18]

2017 82.45%

  • G. Lin et al. [L19]

2017 78.83%

  • W. Xiong et al. [L20]

2017 75.90%

  • S. Kim et al.[L21]

2017 83.00%

  • W. Yu et al.[L22]

2018 77.90%

  • L. Sheng et al. [23]

2018 74.78% DMCCA [L2] 2018 89.38%

slide-25
SLIDE 25

18/09/2018 25

Summary

  • 1. Transformation based feature coding can improve the quality of

visual features, both hand crafted and those obtained by deep NN.

  • 2. Optimal coding by LMCCA, DMCCA and DMCCA+KECA have

been analytically derived and experimentally verified.

49 9/18/2018

Could be straightforward or involving more analysis.

Rigid due to limit on information left

Case study:

  • 1. Video Retrieval based on Audiovisual Cues

Score Level Fusion

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 50 9/18/2018

slide-26
SLIDE 26

18/09/2018 26

Video Retrieval by Audiovisual Cues

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

4.

  • P. Muneesawang, T. Amin and L. Guan, “A new learning algorithms for the fusion of

adaptive audio-visual features for the retrieval and classification of movie clips,” Journal of Signal Processing Systems, vol. 59, no. 2, pp. 177-188, May 2010. 5.

  • T. Amin, M. Zeytinoglu and L. Guan, “Application of Laplacian mixture model for image

and video retrieval,” IEEE Trans. on Multimedia, vol. 9, no, 7, pp. 1416-1429, November 2007. 6.

  • P. Muneesawang and L. Guan, “Adaptive video indexing and automatic/semi-

automatic relevance feedback,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 15, no. 8, pp. 1032-1046, August 2005.

51 9/18/2018

2 . Video Retrieval

The scores are independently obtained, which are then combined SVM Classifier for Audio Channel Classifier for Video Channel Audio Score Visual Score Fused Score

) ( A

X

) (V

X

52 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 52 9/18/2018

slide-27
SLIDE 27

18/09/2018 27

SVM Fusion

53 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 53 9/18/2018

Visual Feature Representation

 Visual

  • Adaptive Video Indexing (AVI)

 Using visual templates  TFxIDF Model  Cosine Distance for Similarity Matching

] [ log ]} [ { max ] [ ] [ j n N j fr j fr j f

to j v

 

       

  2 } 1 ,..., 1 , { ) ( 1

ˆ 1 min arg

j i T j

N l

c i

h h

h

, ˆ 1 min arg

2 } { } 1 ,..., 1 , { ) ( 2

) ( 1

       

   j i l T j

N l

i c i

h h

h

h

54 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 54 9/18/2018

slide-28
SLIDE 28

18/09/2018 28

Visual Feature Representation

55 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 55 9/18/2018

Audio Feature Representation

 Laplacian Mixture Model (LMM) of wavelet

coefficients of audio signal

 Audio feature vector with model parameter

(using EM estimator)

are the model parameters obtained from the i-th high-frequency subband.

1 ) | w ( ) | w ( ) w (

2 1 2 2 2 1 1 1

        b p b p p

i i i

  

  

1 ,..., 2 , 1 , , , , ,

, 2 , 1 , 1

   L i b b m

i i i a

  f

 

i i i

b b

, 2 , 1 , 1

, , 

56 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 56 9/18/2018

slide-29
SLIDE 29

18/09/2018 29

57

LMM vs GMM ( 2 COMPONENTS)

Video Classification Result

 Recognition rate obtained by the SVM based

fusion model, using video database of 6,000 clips

 Five semantic concepts

58 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 58 9/18/2018

slide-30
SLIDE 30

18/09/2018 30

Multim odal Hum an Authentication

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

w ith Signature, I ris and Fingerprint

59 9/18/2018 60

Signature Recognition

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 60 9/18/2018

slide-31
SLIDE 31

18/09/2018 31

Filter-bank Based Minutiae Based

61 Ryerson University - Department of Electrical and Computer Engineering Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

Fingerprint I m age Enhancem ent System Overview

Match Decision Match Decision Fusion Final Decision

61 9/18/2018

Decision

62

I ris Segm entation System Overview

Feature Extraction Feature Extraction Iris Codes Feature Level Fusion Match

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 62 9/18/2018

slide-32
SLIDE 32

18/09/2018 32

63

Fingerprint/ Signature/ I ris Fusion

Final Decision Feature Extraction Feature Extraction Feature Extraction Match Score Match Score Match Score Score Fusion

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 63 9/18/2018

Body Tracking Camera Tracking Movement Recognition Face Tracking Emotion Recognition Hand Gesture Recognition Stereo Vision / Depth Calc.

Robot Applications

64 Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 9/18/2018

slide-33
SLIDE 33

18/09/2018 33

  • 1. Target people group:

Elderly and disabled people at homes or community houses

  • 2. Capable of simple

gestures and body language

  • 3. Capable of simple, and,

sometime, incomplete verbal communications

Robot Application: Dom estic Helper

  • via emotion/ intention recognition

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 65 9/18/2018

Robot Application: Dom estic Helper

  • via emotion/ intention recognition
  • 1. Help the elderly and

the disabled with their daily life.

  • 2. Entertain the people

they look after.

  • 3. Call the nurse or

emergency when in need.

Gesture/Action Emotion Head tracking

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 66 9/18/2018

slide-34
SLIDE 34

18/09/2018 34

Audio cues Visual cues Possible human intention Analyze human emotions Analyze human emotions Analyze human actions Analyze human actions Hand gesture Body movement

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

Multim odal Fusion for Hum an I ntention Recognition

67 9/18/2018

Will fusion help in a problem on hand?

What is the best fusion model for a problem on hand?

  • Data/feature level,
  • Representation level,
  • Score/Decision level,
  • Or multilevel.

New data analysis and information mining tools need to be developed to address the issues. Or the existing tools may be revisited.

Challenges

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 68 9/18/2018

slide-35
SLIDE 35

18/09/2018 35

Challenge 1 : Does fusion help?

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

Fusion Sensor 1 (audio) Sensor 2 (visual) Recognition by audio Recognition by video Recognition by fusion

69 9/18/2018

Challenge 2: Fusion at which level?

Data/Feature #1

Representation

Score (Decision)

Representation

Score (Decision) Data/Feature #2

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 70 9/18/2018

slide-36
SLIDE 36

18/09/2018 36

71

Information Entropy

  • Entropy is a measure of the uncertainty

associated with a random variable

  • Uncertainty is useful information
  • entropy
  • conditional entropy

9/18/2018 72

Multimodal source type

  • Conflict

H(x), H(y) < H(x,y) joint uncertainty is larger negative relevance

  • Redundancy

H(x,y) < H(x), H(y) joint uncertainty is smaller positive relevance

  • Complementary

H(x) ≈ H(x,y) ≈ H(y) joint uncertainty unchanged weak relevance

9/18/2018

slide-37
SLIDE 37

18/09/2018 37

73

Multimodal fusion method

  • A feature has a conflict with other

features

 If the conflict is beyond threshold, eliminate

conflict feature

 keep the total uncertainty at an acceptable

level

  • Redundancy fusion ( A∩B )

 according to ranking of uncertainty value

  • Complementary fusion ( A B )

9/18/2018 74

Mapping from entropy to weight

  • Weights – reversely proportional to

entropies

 high entropy -> low confidence -> low

weight

 feature corrupted -> maximum entropy ->

zero weight

 ∑w = 1

H = 0, then w = 1 H = Hmax, then w = 0 Hi = Hj, then wi = wj = 0.5

9/18/2018

slide-38
SLIDE 38

18/09/2018 38

75

Entropy vs. Correlation in Fusion

  • Developed an entropy based method, Kernel

Entropy Component Analysis (KECA)

  • Fused entropy based method with Discriminative

Analysis (KECA-DMCCA)

  • Compared with correlation based methods:

 KPCA – Kernel Principal Component analysis.

 KCCA – Kernel Canonical correlation analysis

1.

  • Z. Xie and L. Guan“Mutimodal information fusion of audio emotion recognition based on

kernel entropy component analysis,” 7(1), 25-42, Int. J. of Semantic Computing, Aug 2013. 2.

  • L. Gao, L. Qi and L. Guan, “A novel discriminative framework integrating kernel entropy

component analysis and discriminative multiple canonical correlation for information fusion,” IEEE Int. Symposium on Multimedia, San Jose, USA, December 2016.

9/18/2018 76

  • Experimental results

RML database eNTERFACE database KECA KPCA KCCA

9/18/2018

slide-39
SLIDE 39

18/09/2018 39

Fusion – coherent integration of multimedia multimodal information

It is a natural process by human beings, but not straightforward for machines.

It may be carried out at different information levels, but how to choose the right model?

Several case studies are used to demonstrate the power of information fusion

Multiple challenges are waiting to be addressed

Sum m ary

Ryerson University – Multimedia Research Laboratory Ling Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 77 9/18/2018

Thank You

78 9/18/2018