IN5490 Advanced Topics in Artificial Intelligence for Intelligent - - PowerPoint PPT Presentation

in5490 advanced topics in artificial
SMART_READER_LITE
LIVE PREVIEW

IN5490 Advanced Topics in Artificial Intelligence for Intelligent - - PowerPoint PPT Presentation

IN5490 Advanced Topics in Artificial Intelligence for Intelligent Systems Md. Zia Uddin 16/10/2018 Principal Components Analysis Principal Component Analysis (PCA) PCA is a way of identifying patterns in data, and expressing the data in


slide-1
SLIDE 1

IN5490 – Advanced Topics in Artificial Intelligence for Intelligent Systems

  • Md. Zia Uddin

16/10/2018

slide-2
SLIDE 2
slide-3
SLIDE 3

Principal Components Analysis

slide-4
SLIDE 4

Principal Component Analysis (PCA)

PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. It’s a powerful tool to analyze data. Main advantage Compression of the data by reducing the number of dimensions, without much loss of information. This technique used in image compression, as we will see later.

slide-5
SLIDE 5

Original Variable A Original Variable B PC 1 PC 2

  • Orthogonal directions of greatest variance in data
  • Projections along PC1 discriminate the data most along anyone

axis

Principal Component Analysis (PCA)

slide-6
SLIDE 6

6

Pri rincipal Components Analysis (P (PCA)

16.10.2017

slide-7
SLIDE 7
  • First principal component is the direction of greatest variability (covariance) in

the data

  • Second is the next orthogonal (uncorrelated) direction of greatest variability
  • So first remove all the variability along the first component, and then find

the next direction of greatest variability

  • And so on …

Principal Component Analysis (PCA)

slide-8
SLIDE 8

Principal Component Analysis (PCA)

slide-9
SLIDE 9

Principal Components

slide-10
SLIDE 10

Principal Components

slide-11
SLIDE 11

Reconstruction Using PCA

slide-12
SLIDE 12

Silhouettes

slide-13
SLIDE 13

Top 150 Eigenvalues of eigenvectors

Eigenvalues

slide-14
SLIDE 14

Principal Components

slide-15
SLIDE 15

1) Convert each image to a row vector 2) Calculate the mean 3) Subtract the mean 4) Calculate covariance matrix 5) Eigenvalue decomposition 6) Choose top eigenvectors based on eigenvalues 7) Project each image vector to the PCA space

Principal Component Analysis (PCA) Steps

slide-16
SLIDE 16

Linear Discriminant Analysis

slide-17
SLIDE 17

▪ LDA seeks directions along which the classes are best separated. ▪ It takes into consideration the scatter within-classes SW but also the scatter between-classes SB. ▪ LDA computes a transformation that maximizes the between-class scatters while minimizing the within-class scatters. ▪ It can be solved by where is the eigenvalues of .

Linear Discriminant Analysis(LDA)

1

( )( )

c T B i i i i

S J m m m m

=

= − −

1

( )( )

k i

c T W k i k i i m C

S m m m m

= 

= − −

 

= arg max

T B LDA T D W

D S D D D S D

1 −

=

W B

S S D D  

1 − W B

S S

17

slide-18
SLIDE 18

3-D plot of LDA of the binary silhouettes of different activities.

  • 0.2
  • 0.1

0.1 0.2

  • 0.2
  • 0.1

0.1 0.2

  • 0.05

0.05 0.1 0.15

LDC1 LDC2 LDC3

Walking Running Skipping Right hand waving Both hand waving

All activity binary silhouettes

Linear Discriminant Analysis(LDA)

slide-19
SLIDE 19

Independent Components Analysis

slide-20
SLIDE 20

What is ICA?

“Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multi-dimensional) statistical data. What distinguishes ICA from other methods is that it looks for components that are both statistically independent, and nonGaussian.” A.Hyvarinen, A.Karhunen, E.Oja ‘Independent Component Analysis’

slide-21
SLIDE 21

ICA

Blind Signal Separation (BSS) or Independent Component Analysis (ICA) is the identification & separation of mixtures of sources with little prior information.

  • Applications include:
  • Audio Processing
  • Medical data
  • Finance
  • Array processing (beamforming)
  • Coding
  • … and most applications where Factor Analysis and PCA is currently used.
  • While PCA seeks directions that represents data best in a Σ|x0 - x|2 sense, ICA

seeks such directions that are most independent from each other. Often used on Time Series separation of Multiple Targets

slide-22
SLIDE 22

The simple “Cocktail Party” Problem

Sources Observations s1 s2 x1 x2 Mixing matrix A x = As n sources, m=n observations

slide-23
SLIDE 23

ICA

50 100 150 200 250
  • 0.2
  • 0.1
0.0 0.1 0.2 V1 50 100 150 200 250
  • 0.2
  • 0.1
0.0 0.1 0.2 V2 50 100 150 200 250
  • 0.10
  • 0.05
0.00 0.05 0.10 V3

ICA

Observing signals Original source signal

50 100 150 200 250
  • 0.10
  • 0.05
0.00 0.05 0.10 V4
slide-24
SLIDE 24

Motivation

Two Independent Sources Mixture at two Mics

aIJ ... Depend on the distances of the microphones from the speakers

2 22 1 21 2 2 12 1 11 1

) ( ) ( s a s a t x s a s a t x + = + =

slide-25
SLIDE 25

ICA Model

  • Use statistical “latent variables“ system
  • Random variable sk instead of time signal
  • xj = aj1s1 + aj2s2 + .. + ajnsn, for all j

x = As

  • IC‘s s are latent variables & are unknown AND Mixing matrix A is also unknown
  • Task: estimate A and s using only the observeable random vector x
  • Lets assume that no. of IC‘s = no of observable mixtures

and A is square and invertible

  • So after estimating A, we can compute W=A-1 and hence

s = Wx = A-1x

slide-26
SLIDE 26

Illustration of ICA with 2 signals

s1 s2 x1 x2

T t t s a t s a t x t s a t s a t x : 1 ) ( ) ( ) ( ) ( ) ( ) (

2 22 1 21 2 2 12 1 11 1

=  + = + =

Step1: Sphering Step2: Rotatation Original s Mixed signals a2 a1 a1

slide-27
SLIDE 27

ICA

Fixed Point Algorithm Input: X Random init of W Iterate until convergence: Output: W, S

1

) ( ) (

= = = W W W W S X W X W S

T T T

g

slide-28
SLIDE 28

Basic steps of ICA

  • Collect data matrix
  • Whitening
  • eigenvectors and eigenvalue matrix of .
  • Distribute the un-mixing matrix W randomly.
  • Apply iterative procedure on each vector from un-mixing

matrix W on Y to approximate the corresponding basis S until it converges.

Enhanced ICA

▪ Apply PCA first. ▪ Apply ICA on the PCs ▪ Project the silhouette features on IC feature space

ICA Model

slide-29
SLIDE 29

ICs

slide-30
SLIDE 30

▪ The ICA looks for statistically independent basis images. ▪ ICA focuses on the local feature information.

Ten ICs from all activity silhouettes

ICA on Binary Silhouettes

All activity binary silhouettes

30

slide-31
SLIDE 31

Solve pixel correspondence problem

– given a pixel in It-1, look for nearby pixels of the same color in It

Key assumptions

color constancy: a point in It-1 looks the same in I For grayscale images, this is brightness constancy

Optical Flows

How to estimate pixel motion from image It-1 to image It?

I(x, y, t-1) I(x, y, t) = I(x+u, y+v,t-1)

Displacement u, v x+u, y+v

31

slide-32
SLIDE 32

Once optical flows of the silhouettes from two consecutive activity frames are obtained, the flow region is divided into 256 sub-blocks to compute the average flow vector of eac h sub-block where each one becomes a size of

  • 4x4. The average value is calculated as

The flows are augmented and represented as Finally, the averaged optical flow features are extended by PCA and LDA.

Optical Flow Features

,

1

1 n 16 1 p 256

px i j py

p

th p p

K K K n

p sub block

=

       = −    

1 2 256

, ,..., K K K

   

  • 0.2

0.2

  • 0.1
  • 0.05

0.05 0.1

  • 0.15
  • 0.1
  • 0.05

0.05 walking running skipping sitting down standing up

3-D plot of LDA on the optical flows of different activities. Sample optical flows from two (a) walking and (b) running frames.

(a) (b)

32

slide-33
SLIDE 33

▪ LBP features are local binary patterns based on the intensity values of surrounding pixels of a center pixel. Then, the LBP pattern at the given pixel ( xc , yc ) can be represented as an

  • rdered set of the binary comparisons as:

▪ where ge represent the intensity of the given pixel and intensity of the surrounding pixels.

Local Binary Pattern (LBP)

7

1 , ( )

( , ) ( )2i

c c i e i

a f a a

LBP x y f g g

=

  =   

= −

33

slide-34
SLIDE 34

26 85 53 60 45 1 1 1 ` 41 43 25 101 1

11011110=222

1 1

LBP Operator

34

Local Binary Pattern (LBP)

slide-35
SLIDE 35

35

A depth activity image is divided into small regions and the regions’ LBP histogr ams are concatenated to represent features for one image

LBP Features

▪ To reduce the high dimensionality, PCA is applied on LBP

35

Local Binary Pattern (LBP)

slide-36
SLIDE 36

▪ The Local Directional Pattern (LDP) assigns an eight-bit binary code to each pixel of an input depth image. ▪ The Kirsch edge detector detects the edges considering all eight neighbors. ▪ Given a central pixel in the image, the eight directional edge response values {mk}, k=0,1,..,7 are computed by Kirsch masks Mk in eight different orientations centered on its position.

Local Directional Pattern (LDP)

36

slide-37
SLIDE 37

37 1 2 3

3 3 5 3 5 5 5 5 5 5 5 3 3 5 3 5 3 3 5 3 3 3 5 3 3 3 3 3 3 3 3 3 east north ast north north est 5 3 3 3 3 3 3 3 3 5 3 5 3 3 3 5 3 3 5 5 3 5 5 5 S e S S w S − − − −                 − − − − −                 − − − − − − − − − − −         − − − − − − − −            − − − −           − − −      

4 5 6 7

3 3 3 3 5 3 5 5 west south est south south ast S w S S e S − − −      −       −   Kirsch edge masks in eight directions

Local Directional Pattern (LDP)

slide-38
SLIDE 38

▪ It is interesting to know the p most prominent directions in

  • rder to generate the LDP feature for a pixel.

▪ Here, the top-p directional bit responses bk are set to 1. The remaining bits of 8-bit LDP pattern are set to 0. ▪ The Local Directional Pattern (LDP) assigns an eight-bit binary code to each pixel of an input depth image.

38

7

1 ( ) 2 , ( )

k p k k p k k

a LDP B m m B a a

=

  = −  =   

Local Directional Pattern (LDP)

slide-39
SLIDE 39

39

1 1 X 1

m0 m1 m4 m2 m3 m7 m6 m5

1 1 X 1

B0 B1 B4 B2 B3 B7 B6 B5 Edge response to eight directions

LDP binary bit positions

Local Directional Pattern (LDP)

slide-40
SLIDE 40

40

LDP feature example for a pixel considering top 4 positions

85 32 26 10 50 53 60 38 45 313 97 503 393 X 537 161 97 161 1 1 X 1

LDP Binary Code = 00010011 LDP Decimal Code = 19

{Mi} mk

1

LDP Binary Code=00011011 LDP Decimal Code=27

90 60 414 518 122 338 562 146 82 318

Local Directional Pattern (LDP)

slide-41
SLIDE 41

41

A depth expression image is divided into small regions and the regions’ LDP histograms are concatenated to represent features for one image

LDP Features

Local Directional Pattern (LDP)

slide-42
SLIDE 42

42

The image textual feature is presented by the histogram of the LDP map of which the bin can be defined as follows where n=256 normally for an image I. The histogram of the LDP map for a region is presented as bellow. Finally, the whole LDP feature F is expressed as a concatenated sequence of histograms of all regions as bellow where s=number of regions.

 

,

( , ) , 0,1,... 1

x y

q

T I LDP x y q q n = = −

=

1 1

( , ,..., ).

n

H T T T − =

1 2

( , ,,..., )

s

F H H H =

Local Directional Pattern (LDP)

slide-43
SLIDE 43

43

Support Vector Machines (S (SVM): : Background

16.10.2017

SVM is used for extreme classification cases. CA CAT DOG

?

slide-44
SLIDE 44
  • Intro. to Support Vector Machines (SVM)
  • Properties of SVM
  • Applications

➢Gene Expression Data Classification ➢Text Categorization if time permits

  • Discussion

Support Vector Machines (SVM)

slide-45
SLIDE 45

Linear Classifiers

f(x,w,b) = sign(w x + b) How would you classify this data?

w x + b<0 w x + b>0

slide-46
SLIDE 46

Maximum Margin

denotes +1 denotes -1

The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

slide-47
SLIDE 47

◼ Goal: 1) Correctly classify all training data

if yi = +1 if yi = -1 for all i

2) Maximize the Margin same as minimize

We can formulate a Quadratic Optimization Problem and solve for w and b

◼ Minimize

subject to w M 2 =

w w w

t

2 1 ) ( = 

1  + b wxi

1  + b wxi 1 ) (  +b wx y

i i

1 ) (  +b wx y

i i

i 

w wt 2 1

Support Vector Machines (SVM)

slide-48
SLIDE 48

Non-linear SVMs

◼ Datasets that are linearly separable with some noise

work out great:

◼ But what are we going to do if the dataset is just too hard? ◼ How about… mapping data to a higher-dimensional

space:

x x x x2

slide-49
SLIDE 49

Non-linear SVMs: Feature spaces

◼ General idea: the original input space can always be

mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

slide-50
SLIDE 50

Binary ry to multiclass

  • One-vs-all
  • All-vs-all

50

slide-51
SLIDE 51

1. . One-vs vs-all classification

  • Assumption: Each class individually separable from all the others
  • Learning: Given a dataset D = {<xi, yi>},

Note: xi 2 <n, yi 2 {1, 2, , K}

  • Decompose into K binary classification tasks
  • For class k, construct a binary classification task as:
  • Positive examples: Elements of D with label k
  • Negative examples: All other elements of D
  • Train K binary classifiers w1, w2,

wKusing any learning algorithm we have seen

  • Decision: argmaxi wi

Tx

51

slide-52
SLIDE 52

Visualizing One-vs vs-all

From the full dataset, construct three binary classifiers, one for each class wblue

Tx > 0

for blue inputs wred

Tx > 0

for orange inputs wgreen

Tx > 0

for gray inputs Winner Take All will predict the right answer. Only the correct label will have a positive score Notation: Score for blue label

52

slide-53
SLIDE 53

One One-vs vs-all may not always work

Black points are not separable with a single binary classifier The decomposition will not work for these cases! wblue

Tx > 0

for blue inputs wred

Tx > 0

for orange inputs wgreen

Tx > 0

for gray inputs ???

53

slide-54
SLIDE 54

2. . All-vs vs-all classification

  • Assumption: Every pair of classes is separable
  • Learning: Given a dataset D = {<xi, yi>},

Note: xi 2 <n, yi 2 {1, 2, , K}

  • For every pair of labels (j, k), create a binary classifier with:
  • Positive examples: All examples with label j
  • Negative examples: All examples with label k
  • Train classifiers in all
  • Prediction: More complex, each label get K-1 votes
  • How to combine the votes? Many methods
  • Majority: Pick the label with maximum votes
  • Organize a tournament between the labels

54

( 1) 2 K K −

slide-55
SLIDE 55

55

Support Vector Machines (S (SVM): : SVM Examples:

16.10.2017

The SVM learning about a linearly separable dataset (top row) and a dataset that needs two straight lines to separate in 2D (bottom row) with left the linear kernel, middle the polynomial kernel of degree 3, and right the RBF kernel

slide-56
SLIDE 56

Convolutional Neural Network (C (CNN)

  • We know it is good to learn a small model.
  • From this fully connected model, do we really need all the edges?
  • Can some of these be shared?
slide-57
SLIDE 57

A Convolutional Layer

A filter

A CNN is a neural network with some convolutional layers (and some other layers). A convolutional layer has a number

  • f filters that does convolutional operation.

Beak detector

slide-58
SLIDE 58

Convolution

1 1 1 1 1 1 1 1 1 1 1 1 6 x 6 image 1

  • 1
  • 1
  • 1

1

  • 1
  • 1
  • 1

1 Filter 1

  • 1

1

  • 1
  • 1

1

  • 1
  • 1

1

  • 1

Filter 2

… …

These are the network parameters to be learned. Each filter detects a small pattern (3 x 3).

slide-59
SLIDE 59

1 1 1 1 1 1 1 1 1 1 1 1 6 x 6 image 1

  • 1
  • 1
  • 1

1

  • 1
  • 1
  • 1

1 Filter 1 3

  • 1

stride=1

Dot product

Convolution

slide-60
SLIDE 60

1 1 1 1 1 1 1 1 1 1 1 1

image convolution

  • 1

1

  • 1
  • 1

1

  • 1
  • 1

1

  • 1

1

  • 1
  • 1
  • 1

1

  • 1
  • 1
  • 1

1

1

x

2

x

… …

36

x

… …

1 1 1 1 1 1 1 1 1 1 1 1

Fully- connected

Convolution & Fully Connected

slide-61
SLIDE 61

Fully Connected Feedforward network

cat dog ……

Convolution Max Pooling Convolution Max Pooling Flattened

Can repeat many times

slide-62
SLIDE 62

Max Pooling

3

  • 1
  • 3
  • 1
  • 3

1

  • 3
  • 3
  • 3

1 3

  • 2
  • 2
  • 1
  • 1

1

  • 1
  • 1

1

  • 1
  • 1

1

  • 1

Filter 2

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 2

1

  • 1
  • 1
  • 2

1

  • 1
  • 4

3 1

  • 1
  • 1
  • 1

1

  • 1
  • 1
  • 1

1 Filter 1

slide-63
SLIDE 63

1 1 1 1 1 1 1 1 1 1 1 1 6 x 6 image 3 1 3

  • 1

1 3 2 x 2 image

Each filter is a channel New image but smaller Conv Max Pooling

Max Pooling

slide-64
SLIDE 64

Convolution Max Pooling Convolution Max Pooling

Can repeat many times A new image

The number of channels is the number of filters Smaller than the original image

3 1 3

  • 1

1 3

slide-65
SLIDE 65

Fully Connected Feedforward network

cat dog ……

Convolution Max Pooling Convolution Max Pooling

Flattened A new image A new image

slide-66
SLIDE 66

Flattening

3 1 3

  • 1

1 3 Flattened 3 1 3

  • 1

1 3

Fully Connected Feedforward network

slide-67
SLIDE 67

Fully Connected Feedforward network

cat dog ……

Convolution Max Pooling Convolution Max Pooling Flattened

Can repeat many times

slide-68
SLIDE 68
  • Gradient Based Learning Applied To Document

Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998

  • Helped establish how we use CNNs today
  • Replaced manual feature extraction

[LeCun et al., 1998]

LeNet-5

slide-69
SLIDE 69
  • ImageNet Classification with Deep Convolutional

Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012

  • Facilitated by GPUs, highly optimized convolution

implementation and large datasets (ImageNet)

  • Has 60 Million parameter compared to 60k

parameter of LeNet-5

[Krizhevsky et al., 2012]

AlexNet

slide-70
SLIDE 70

AlexNet

. . . 227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . .

This slide is taken from Andrew Ng

Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8

slide-71
SLIDE 71

AlexNet

[Krizhevsky et al., 2012]

slide-72
SLIDE 72

AlexNet

slide-73
SLIDE 73

AlexNet

AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition.

[Krizhevsky et al., 2012]

slide-74
SLIDE 74

GoogleNet

  • 22 layers
  • Efficient “Inception” module - strayed from

the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure

  • No FC layers
  • Only 5 million parameters!

[Szegedy et al., 2014]

slide-75
SLIDE 75

GoogleNet

Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency.

[Szegedy et al., 2014]

slide-76
SLIDE 76

ResNet

  • Deep Residual Learning for Image Recognition -

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015

  • Extremely deep network – 152 layers
  • Deeper neural networks are more difficult to train.
  • Deep networks suffer from vanishing and exploding

gradients.

  • Present a residual learning framework to ease the

training of networks that are substantially deeper than those used previously.

[He et al., 2015]

ResNet

slide-77
SLIDE 77

ResNet

  • ILSVRC’15 classification winner (3.57% top 5

error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.

[He et al., 2015]

slide-78
SLIDE 78

ResNet

  • Hypothesis: The problem is an optimization problem. Very

deep networks are harder to optimize.

  • Solution: Use network layers to fit residual mapping instead
  • f directly trying to fit a desired underlying mapping.
  • We will use skip connections allowing us to take the activation

from one layer and feed it into another layer, much deeper into the network.

  • Use layers to fit residual F(x) = H(x) – x

instead of H(x) directly

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.

[He et al., 2015]

slide-79
SLIDE 79

ResNet

Residual Block Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead

  • f just computing that transformation (straight from x to F(x)),

we’re computing the term that we have to add, F(x), to the input, x.

[He et al., 2015]

slide-80
SLIDE 80

ResNet

Full ResNet architecture:

  • Stack residual blocks
  • Every residual block has two 3x3 conv layers
  • Periodically, double # of filters and

downsample spatially using stride 2 (in each dimension)

  • Additional conv layer at the beginning
  • No FC layers at the end (only FC 1000 to
  • utput classes)
  • Total depths of 34, 50, 101, or 152 layers for

ImageNet

[He et al., 2015]

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.

slide-81
SLIDE 81

ResNet

The best CNN architecture that we currently have and is a great innovation for the idea of residual learning.

[He et al., 2015]

slide-82
SLIDE 82

Emotion Recognition via CNN (2 (2 Classes)

slide-83
SLIDE 83

Emotion Recognition via CNN (4 (4 Classes)

slide-84
SLIDE 84