Artificial Neural Networks for Multimodal Information Fusion - - PowerPoint PPT Presentation

artificial neural networks for multimodal information
SMART_READER_LITE
LIVE PREVIEW

Artificial Neural Networks for Multimodal Information Fusion - - PowerPoint PPT Presentation

Artificial Neural Networks for Multimodal Information Fusion Friedhelm Schwenker Institute of Neural Information Processing University of Ulm Cairo University April 9, 2010 Schwenker UUlm ANN Informationfusion Outline Artificial neural


slide-1
SLIDE 1

Artificial Neural Networks for Multimodal Information Fusion

Friedhelm Schwenker Institute of Neural Information Processing University of Ulm Cairo University April 9, 2010

Schwenker UUlm ANN Informationfusion

slide-2
SLIDE 2

Outline

Artificial neural networks (ANN) Recognition of bio-acoustic time series Emotion recognition in human computer interatction

Schwenker UUlm ANN Informationfusion

slide-3
SLIDE 3

Pattern recognition applications at NI

Recognition of visual objects from camera images (OCR, faces recognition) Medical diagnosis and bioinformatics Speaker identification Speech recognition/understanding Recognition of human emotions from speech, and facial expressions Bio-acoustic pattern recognition ....

Schwenker UUlm ANN Informationfusion

slide-4
SLIDE 4
  • 1. Artificial Neural networks

Von Neumann Computer Biologial neural net Processor complex simple high speed low speed 1 or a few large number Computing centralized distributed sequential parallel by programs by learning from data Memory localized distributed addressable by keys addressable by content not faulttolerant faulttolerant

Schwenker UUlm ANN Informationfusion

slide-5
SLIDE 5

Layered Networks

Layered neural networks (single or multilayer perceptrons, radial basis function networks) are widely used in pattern recognition and regression applications.

Input Output Weight matrix Nonlinear transfer function

Schwenker UUlm ANN Informationfusion

slide-6
SLIDE 6

Neural Models

Output Input y x f Weight vector c

Linear neuron y = x, c =

n

  • i=1

xici Threshold neuron y =

  • 1

x, c ≥ θ sonst Sigmoidal neuron y = f(x, c−θ), f(s) = 1 1 + exp(−βs) RBF neuron y = f(x − c2), f(r) = exp(− r 2 2σ2 )

Schwenker UUlm ANN Informationfusion

slide-7
SLIDE 7

Learning in artificial neural nets C x

Input Teacher T Output

y

Mapping FC : X → Y, connectivity matrix C learnt by examples Data x ∈ X or (x, T) ∈ X × Y Different types of target function E(C). Optimising E(C) leads to learning rules for C.

Schwenker UUlm ANN Informationfusion

slide-8
SLIDE 8

Supervised learning

Output Teacher

j i ij j j

c y T

Input

x

Output yj, teaching signal Tj. cij adapted, such that yj ≈ Tj. Example: Delta-rule ∆cij ∼ xi

  • Tj − yj
  • Delta-rule minimises

E(c) = T − y2

Schwenker UUlm ANN Informationfusion

slide-9
SLIDE 9

Unsupervised Learning

Output

j i ij j

c y

Input

x

cij adapted without teaching signal Example: Hebbian learning: ∆cij ∼ xiyj Hebbian learning maximises E(c) = y2

Schwenker UUlm ANN Informationfusion

slide-10
SLIDE 10

Competitive learning

Output + Neigbourhood Winner

j ij i j j

c

Input

x y N

Winner detection Neurons of neighbourhood of the winner are adapted Example: SOM learning or k-means: ∆cij ∼ (xi − cij) · Nj k-means minimises E(c) = c − x2

Schwenker UUlm ANN Informationfusion

slide-11
SLIDE 11

Model complexity and training data

Artificial neural networks can solve complex tasks, e.g. high-dimensional input (many input variables), high-dimensional

  • utput (multi-class problems).

Large networks (with many parameters) are needed to achieve good approximations. Size of the training set grows with the number of free parameters M,δ = O VCdim

  • log 1

+ 1 log(1 δ

  • ,

VCdim = O(W log(K)) W (K) number of weights (units), error, 1 − δ confidence. Typically the training data set is to small. Possible approach: Decomposition of the learning task in combination with information/sensor fusion.

Schwenker UUlm ANN Informationfusion

slide-12
SLIDE 12

Multimodal Informationfusion

feature feature extraction 1 extraction 2 Feature extraction Fusion Sensors decision extraction N feature audio vision

Schwenker UUlm ANN Informationfusion

slide-13
SLIDE 13

Early fusion • Mid-level fusion • Late fusion(MCS)

Schwenker UUlm ANN Informationfusion

slide-14
SLIDE 14

Multiple Classifier Systems architecture

Feature1 Featurei FeatureI Classification

ClassifierLayer FusionLayer

F

C (x )

1 1

C (x)

i i

C (x)

I I

z

: : : :

: : : :

Schwenker UUlm ANN Informationfusion

slide-15
SLIDE 15

Fixed decision fusion mappings

Fusion by Averaging: F(P) := 1 I

I

  • i=1

Ci(xi) (1) Probabilistic Fusion: Pr(ω = l|x1... xI) = 1−

  • 1+αPr(ω = l)

Pr(ω = l)

I

  • i=1

Pr(ω = l|xi) Pr(ω = l|xi) Pr(ω = l) Pr(ω = l

  • −1

(2) with Pr(ω = l|xi) = Ci

l (xi) + l(xi)

Voting, Median-Fusion, ....

Schwenker UUlm ANN Informationfusion

slide-16
SLIDE 16

Examples of trainable fusion mappings

1

Train the classifier layer by

2

Train the fusion mapping Decision templates Bayes rule Behaviour knowledge space (Linear) Associative memory networks (Hebbian Learning, delta learning rulde, pseudo-inverse solution) Artificial neural networks/Kernel methods

Schwenker UUlm ANN Informationfusion

slide-17
SLIDE 17
  • 2. Bio-acoustic pattern recognition

Schwenker UUlm ANN Informationfusion

slide-18
SLIDE 18

Example : Ephippiger

1 2 3 4 5 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 Time (s) Amplitude 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 Time (s) Amplitude 278 280 282 284 286 −0.6 −0.4 −0.2 0.2 0.4 0.6 Time (ms) Amplitude

Schwenker UUlm ANN Informationfusion

slide-19
SLIDE 19

Extraction of Local Features in Time series

window W1 X( ) 1 X( ) 2

  • window W

X()

Signal s(t)T

t=1

I local features X() = (x1()... xI())

Schwenker UUlm ANN Informationfusion

slide-20
SLIDE 20

FCT-Architecture

z

j

z1 zo

Classification

Feature1 Feature i Feature I

Fusion 1 Final Classification

x( ) j

  • x ( )

j

i

x ( ) j

I

X( ) j

z

Temporalfusion j =1,...,

x( ) j

  • j
  • Fusion:

X() = (x1()... xI()) ∈ I RΦ, Φ = I

i=1 di

Classification: C := C(X()) Temporal fusion: Co := F(C1... CJ )

Schwenker UUlm ANN Informationfusion

slide-21
SLIDE 21

CDT-Architecture

Temporalfusion =1,..., j

C

  • Classification

C

1

C

i

CI

Decisionfusion

Feature1 Feature i Feature I

Final Classification 1

j

  • C(x())

x( ) j

  • x ( )

j

i

x ( ) j

I

Classification: Ci : I Rdi → ∆, i = 1... I C1(x1())... CI(xI()) Decision fusion: C := F(C1(x1())... CI(xI())) Temporal fusion: Co := F(C1... CJ )

Schwenker UUlm ANN Informationfusion

slide-22
SLIDE 22

Results for cricket songs

Crossvalidation experiments (mean error rates) of 28 cricket species with 4 to 6 animals per species. Radial-Basis-Function Networks as first level classifiers. Extracted features: pulse length, pulse distance, energy contour, Averaged fusion lead to an error ≥ 0.1 Algorithm

ρ =0.0 ρ =0.2 ρ =0.4 ρ =0.6 ρ =0.8 ρ =1.0

DT 8.61 7.88 8.03 7.74 7.59 7.59 Multiple DT 8.32 8.03 7.15 6.86 6.86 6.72 Cluster DT 8.61 7.30 7.15 7.15 7.30 7.30

Schwenker UUlm ANN Informationfusion

slide-23
SLIDE 23
  • 3. Multimodal pattern recognition of emotions in HCI

Human machine interaction (HCI) Emotion theory and emotional data collection Recognition of facial expressions Audio-Visual Laughter detection

Schwenker UUlm ANN Informationfusion

slide-24
SLIDE 24

Human machine interaction (1)

Schwenker UUlm ANN Informationfusion

slide-25
SLIDE 25

Human machine interaction (2)

In many situations the human machine interaction (HCI) could be improved by having machines naturally adapt to their users. HCI should take into account information of the emotional state of the user, e.g. frustration, confusion, disliking, interest, surprise, anger, ...

Schwenker UUlm ANN Informationfusion

slide-26
SLIDE 26

Ekman’s 6 basic emotions

Based on psychophysical experiments of facial expressions Ekman/Friesen defined 6 basic emotions: Anger Surprise Disgust Sadness Happiness Fear

Schwenker UUlm ANN Informationfusion

slide-27
SLIDE 27

More complex emotion theories

Schwenker UUlm ANN Informationfusion

slide-28
SLIDE 28

Frontal views

Recognition of emotions in facial expressions based on frontal views seem to be easy ...

Schwenker UUlm ANN Informationfusion

slide-29
SLIDE 29

Helmut data set

Three camera views to the user: frontal, back, total Person is labelled to be interested

Schwenker UUlm ANN Informationfusion

slide-30
SLIDE 30

Face detection from the frontal view

Viola-Jones classifier is implemented to detect the region

  • f the user’s face

Sobel edge detector is applied to extract features relevant to classify the user’s emotional state

Schwenker UUlm ANN Informationfusion

slide-31
SLIDE 31

Multimodal emotions

Emotions are expressed through Body movements (head, arms, torso, legs) Hand gestures Gaze Facial expressions Speech Biophysiological measures (e.g. skin conductance, heart rate, blood volume pressure)

Schwenker UUlm ANN Informationfusion

slide-32
SLIDE 32

Multimodal emotional data

Nexus with 24 EEG sensors, 4 EMG sensors, blood pressure and respiration meter 1 camera 1 microphone

Schwenker UUlm ANN Informationfusion

slide-33
SLIDE 33

3.1 Emotion regnition from facial expressions

Cohn-Kanade benchmark data base Basic emotions (anger, disgust, fear, happiness, sadness, surprise) acted by semi-professional actors. 432 sequences (97 individuals) of 30 frames per second; resolution 640 × 480;

Schwenker UUlm ANN Informationfusion

slide-34
SLIDE 34

Feature extraction

4 regions were detected: full face, left/right eye, mouth Principal components, orientation histograms and optical flow histograms per region = ⇒ 12 base classifiers.

Schwenker UUlm ANN Informationfusion

slide-35
SLIDE 35

Orientation histograms

Divide image in n × n subimages Apply Sobel edge detector in each subimage Orientation of dark-light edges are computed predefined number of m bins.

Schwenker UUlm ANN Informationfusion

slide-36
SLIDE 36

Class labels

15 individuals were asked to select the current shown emotion in the video Result were fuzzy labels; unique labels determined through majarity

Schwenker UUlm ANN Informationfusion

slide-37
SLIDE 37

Base Classifiers

Multilayer neural networks (Radial basis functions) Fuzzy SVM (training with Fuzzy teacher; output fuzzified) Hidden Markov Models (forward model)

Schwenker UUlm ANN Informationfusion

slide-38
SLIDE 38

HMM results of single features

10-fold crossvalidation results (10 trials)

feature region

  • no. st/mix

accuracy 1 PCA face 9 / 2 0.764 2 PCA mouth 8 / 2 0.685 3 PCA right eye 7 / 2 0.456 4 PCA left eye 3 / 2 0.451 5 Orientation histograms face 4 / 2 0.704 6 Orientation histograms mouth 4 / 3 0.729 7 Orientation histograms right eye 4 / 2 0.500 8 Orientation histograms left eye 9 / 2 0.479 9 Optical flow face 8 / 2 0.639 10 Optical flow mouth 9 / 2 0.607 11 Optical flow right eye 7 / 3 0.442 12 Optical flow left eye 8 / 4 0.491

Schwenker UUlm ANN Informationfusion

slide-39
SLIDE 39

HMM results - fusion with product rule

  • no. of models

accuracy 1 1 2 5 6 7 9 10 11 0.861 2 1 2 3 5 6 9 10 12 0.859 3 1 2 3 5 6 9 10 12 0.859 4 1 2 5 6 7 9 10 12 0.859 5 1 5 6 9 0.857 6 2 5 6 9 10 12 0.857 7 1 2 4 5 6 7 9 0.857 8 1 2 4 5 6 9 11 0.857 9 1 5 6 9 10 0.854 10 1 2 4 5 6 10 12 0.854

Schwenker UUlm ANN Informationfusion

slide-40
SLIDE 40

3.2 Audio-Visual Laughter Detection

Paralinguisitc dialog elements such as laughs are important factors for human to human interaction. Laughs can be used to measure engagement in interaction. Lively discources are not only important for face to face communication, but also for the acceptance of artificial agents. Many different types of laughs (e.g. nervous social laughs), therefore it is expected that the detection of laughs is difficult.

Schwenker UUlm ANN Informationfusion

slide-41
SLIDE 41

Data

90 minutes of multi-party conversation were analyzed; centrally positioned microphone and 360 degree video camera; Viola-Jones algorithm used for face detection.

Schwenker UUlm ANN Informationfusion

slide-42
SLIDE 42

Feature extraction (audio)

Utterance Frame Conversion and Spectral Analysis Mel-Filter Bank Band 1 Band 8 Secondary Spectral Analysis Secondary Spectral Analysis

... ...

Modulation spectrum 16Khz sampling rate 8 bands Mel-Scaling from 32 Hz to 8 kHz 50 Hz feature extraction frequency 200 msec analysis window (offset 20 msec)

Schwenker UUlm ANN Informationfusion

slide-43
SLIDE 43

Echo State Networks

Input Neurons K Output Neurons L Reservoir M W Wout Win

Schwenker UUlm ANN Informationfusion

slide-44
SLIDE 44

Echo State Networks

1

Given I/O training sequence (U(n), D(n))

2

Generate randomly the matrices (W in, W, W back), scaling the weight matrix W such that it’s maximum eingenvalue |λmax| ≤ 1.

3

Drive the network using the training I/O training data, by computing X(n + 1) = f(W inU(n + 1) + WX(n))

4

Collect at each time the state X(n) as a new row into a state collection matrix S, and collect similarly at each time the sigmoid-inverted teacher output tanh−1D(n) into a teacher collection matrix T.

5

Compute the pseudo inverse S+ of S and put W out = S+T

Schwenker UUlm ANN Informationfusion

slide-45
SLIDE 45

ESN output

20 40 60 80 100 120 140 160 180 200 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 testing: teacher sequence (red) vs predicted sequence (blue) Target ESN output Hard decision

Schwenker UUlm ANN Informationfusion

slide-46
SLIDE 46

Results of fused ESNs

Post- processing Audio Network Video Network

ESN (audio) 13.4 ESN (video) 17.8 ESN (fused) 9.1

Schwenker UUlm ANN Informationfusion

slide-47
SLIDE 47

Summary

ANN and decomposition of learning tasks MCS with static and adaptive fusion mappings. Multi-layer neural networks support flexible fusion schemes (early, mid-level, late, multi-level fusion). Information fusion Human Computer Interfaces (e.g. to estimate the mental/emotional state) HMM or recurrent artificial neural nets to process sequential data Unimodal and multimodal information fusion techniques

Schwenker UUlm ANN Informationfusion