IN5490 – Advanced Topics in Artificial Intelligence for Intelligent Systems
- Md. Zia Uddin
16/10/2018
IN5490 Advanced Topics in Artificial Intelligence for Intelligent - - PowerPoint PPT Presentation
IN5490 Advanced Topics in Artificial Intelligence for Intelligent Systems Md. Zia Uddin 16/10/2018 Principal Components Analysis Principal Component Analysis (PCA) PCA is a way of identifying patterns in data, and expressing the data in
IN5490 – Advanced Topics in Artificial Intelligence for Intelligent Systems
16/10/2018
PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. It’s a powerful tool to analyze data. Main advantage Compression of the data by reducing the number of dimensions, without much loss of information. This technique used in image compression, as we will see later.
Original Variable A Original Variable B PC 1 PC 2
axis
6
16.10.2017
the data
the next direction of greatest variability
Top 150 Eigenvalues of eigenvectors
1) Convert each image to a row vector 2) Calculate the mean 3) Subtract the mean 4) Calculate covariance matrix 5) Eigenvalue decomposition 6) Choose top eigenvectors based on eigenvalues 7) Project each image vector to the PCA space
▪ LDA seeks directions along which the classes are best separated. ▪ It takes into consideration the scatter within-classes SW but also the scatter between-classes SB. ▪ LDA computes a transformation that maximizes the between-class scatters while minimizing the within-class scatters. ▪ It can be solved by where is the eigenvalues of .
1
( )( )
c T B i i i i
S J m m m m
=
= − −
1
( )( )
k i
c T W k i k i i m C
S m m m m
=
= − −
= arg max
T B LDA T D W
D S D D D S D
1 −
=
W B
S S D D
1 − W B
S S
17
3-D plot of LDA of the binary silhouettes of different activities.
0.1 0.2
0.1 0.2
0.05 0.1 0.15
LDC1 LDC2 LDC3
Walking Running Skipping Right hand waving Both hand waving
All activity binary silhouettes
“Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multi-dimensional) statistical data. What distinguishes ICA from other methods is that it looks for components that are both statistically independent, and nonGaussian.” A.Hyvarinen, A.Karhunen, E.Oja ‘Independent Component Analysis’
Blind Signal Separation (BSS) or Independent Component Analysis (ICA) is the identification & separation of mixtures of sources with little prior information.
seeks such directions that are most independent from each other. Often used on Time Series separation of Multiple Targets
Sources Observations s1 s2 x1 x2 Mixing matrix A x = As n sources, m=n observations
ICA
Observing signals Original source signal
50 100 150 200 250Two Independent Sources Mixture at two Mics
aIJ ... Depend on the distances of the microphones from the speakers
2 22 1 21 2 2 12 1 11 1
) ( ) ( s a s a t x s a s a t x + = + =
x = As
and A is square and invertible
s = Wx = A-1x
s1 s2 x1 x2
T t t s a t s a t x t s a t s a t x : 1 ) ( ) ( ) ( ) ( ) ( ) (
2 22 1 21 2 2 12 1 11 1
= + = + =
Step1: Sphering Step2: Rotatation Original s Mixed signals a2 a1 a1
Fixed Point Algorithm Input: X Random init of W Iterate until convergence: Output: W, S
1
) ( ) (
−
= = = W W W W S X W X W S
T T T
g
Basic steps of ICA
matrix W on Y to approximate the corresponding basis S until it converges.
Enhanced ICA
▪ Apply PCA first. ▪ Apply ICA on the PCs ▪ Project the silhouette features on IC feature space
▪ The ICA looks for statistically independent basis images. ▪ ICA focuses on the local feature information.
Ten ICs from all activity silhouettes
All activity binary silhouettes
30
Solve pixel correspondence problem
– given a pixel in It-1, look for nearby pixels of the same color in It
Key assumptions
color constancy: a point in It-1 looks the same in I For grayscale images, this is brightness constancy
How to estimate pixel motion from image It-1 to image It?
I(x, y, t-1) I(x, y, t) = I(x+u, y+v,t-1)
Displacement u, v x+u, y+v
31
Once optical flows of the silhouettes from two consecutive activity frames are obtained, the flow region is divided into 256 sub-blocks to compute the average flow vector of eac h sub-block where each one becomes a size of
The flows are augmented and represented as Finally, the averaged optical flow features are extended by PCA and LDA.
,
1
1 n 16 1 p 256
px i j py
p
th p p
K K K n
p sub block
=
= −
1 2 256
, ,..., K K K
0.2
0.05 0.1
0.05 walking running skipping sitting down standing up
3-D plot of LDA on the optical flows of different activities. Sample optical flows from two (a) walking and (b) running frames.
(a) (b)
32
▪ LBP features are local binary patterns based on the intensity values of surrounding pixels of a center pixel. Then, the LBP pattern at the given pixel ( xc , yc ) can be represented as an
▪ where ge represent the intensity of the given pixel and intensity of the surrounding pixels.
7
1 , ( )
( , ) ( )2i
c c i e i
a f a a
LBP x y f g g
=
=
= −
33
26 85 53 60 45 1 1 1 ` 41 43 25 101 1
11011110=222
1 1
LBP Operator
34
35
A depth activity image is divided into small regions and the regions’ LBP histogr ams are concatenated to represent features for one image
LBP Features
▪ To reduce the high dimensionality, PCA is applied on LBP
35
▪ The Local Directional Pattern (LDP) assigns an eight-bit binary code to each pixel of an input depth image. ▪ The Kirsch edge detector detects the edges considering all eight neighbors. ▪ Given a central pixel in the image, the eight directional edge response values {mk}, k=0,1,..,7 are computed by Kirsch masks Mk in eight different orientations centered on its position.
36
37 1 2 3
3 3 5 3 5 5 5 5 5 5 5 3 3 5 3 5 3 3 5 3 3 3 5 3 3 3 3 3 3 3 3 3 east north ast north north est 5 3 3 3 3 3 3 3 3 5 3 5 3 3 3 5 3 3 5 5 3 5 5 5 S e S S w S − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −
4 5 6 7
3 3 3 3 5 3 5 5 west south est south south ast S w S S e S − − − − − Kirsch edge masks in eight directions
▪ It is interesting to know the p most prominent directions in
▪ Here, the top-p directional bit responses bk are set to 1. The remaining bits of 8-bit LDP pattern are set to 0. ▪ The Local Directional Pattern (LDP) assigns an eight-bit binary code to each pixel of an input depth image.
38
7
1 ( ) 2 , ( )
k p k k p k k
a LDP B m m B a a
=
= − =
39
1 1 X 1
m0 m1 m4 m2 m3 m7 m6 m5
1 1 X 1
B0 B1 B4 B2 B3 B7 B6 B5 Edge response to eight directions
LDP binary bit positions
40
LDP feature example for a pixel considering top 4 positions
85 32 26 10 50 53 60 38 45 313 97 503 393 X 537 161 97 161 1 1 X 1
LDP Binary Code = 00010011 LDP Decimal Code = 19
{Mi} mk
1
LDP Binary Code=00011011 LDP Decimal Code=27
90 60 414 518 122 338 562 146 82 318
41
A depth expression image is divided into small regions and the regions’ LDP histograms are concatenated to represent features for one image
LDP Features
42
The image textual feature is presented by the histogram of the LDP map of which the bin can be defined as follows where n=256 normally for an image I. The histogram of the LDP map for a region is presented as bellow. Finally, the whole LDP feature F is expressed as a concatenated sequence of histograms of all regions as bellow where s=number of regions.
,
( , ) , 0,1,... 1
x y
q
T I LDP x y q q n = = −
=
1 1
( , ,..., ).
n
H T T T − =
1 2
( , ,,..., )
s
F H H H =
43
16.10.2017
SVM is used for extreme classification cases. CA CAT DOG
➢Gene Expression Data Classification ➢Text Categorization if time permits
f(x,w,b) = sign(w x + b) How would you classify this data?
w x + b<0 w x + b>0
denotes +1 denotes -1
The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
◼ Goal: 1) Correctly classify all training data
if yi = +1 if yi = -1 for all i
2) Maximize the Margin same as minimize
◼
We can formulate a Quadratic Optimization Problem and solve for w and b
◼ Minimize
subject to w M 2 =
w w w
t
2 1 ) ( =
i i
1 ) ( +b wx y
i i
i
w wt 2 1
◼ Datasets that are linearly separable with some noise
work out great:
◼ But what are we going to do if the dataset is just too hard? ◼ How about… mapping data to a higher-dimensional
space:
x x x x2
◼ General idea: the original input space can always be
mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
50
Note: xi 2 <n, yi 2 {1, 2, , K}
wKusing any learning algorithm we have seen
Tx
51
From the full dataset, construct three binary classifiers, one for each class wblue
Tx > 0
for blue inputs wred
Tx > 0
for orange inputs wgreen
Tx > 0
for gray inputs Winner Take All will predict the right answer. Only the correct label will have a positive score Notation: Score for blue label
52
Black points are not separable with a single binary classifier The decomposition will not work for these cases! wblue
Tx > 0
for blue inputs wred
Tx > 0
for orange inputs wgreen
Tx > 0
for gray inputs ???
53
Note: xi 2 <n, yi 2 {1, 2, , K}
54
( 1) 2 K K −
55
16.10.2017
The SVM learning about a linearly separable dataset (top row) and a dataset that needs two straight lines to separate in 2D (bottom row) with left the linear kernel, middle the polynomial kernel of degree 3, and right the RBF kernel
A filter
A CNN is a neural network with some convolutional layers (and some other layers). A convolutional layer has a number
Beak detector
1 1 1 1 1 1 1 1 1 1 1 1 6 x 6 image 1
1
1 Filter 1
1
1
1
Filter 2
… …
These are the network parameters to be learned. Each filter detects a small pattern (3 x 3).
1 1 1 1 1 1 1 1 1 1 1 1 6 x 6 image 1
1
1 Filter 1 3
stride=1
Dot product
1 1 1 1 1 1 1 1 1 1 1 1
image convolution
1
1
1
1
1
1
1
x
2
x
… …
36
x
… …
1 1 1 1 1 1 1 1 1 1 1 1
Fully- connected
Fully Connected Feedforward network
cat dog ……
Convolution Max Pooling Convolution Max Pooling Flattened
Can repeat many times
3
1
1 3
1
1
1
Filter 2
1
1
3 1
1
1 Filter 1
1 1 1 1 1 1 1 1 1 1 1 1 6 x 6 image 3 1 3
1 3 2 x 2 image
Each filter is a channel New image but smaller Conv Max Pooling
Convolution Max Pooling Convolution Max Pooling
Can repeat many times A new image
The number of channels is the number of filters Smaller than the original image
3 1 3
1 3
Fully Connected Feedforward network
cat dog ……
Convolution Max Pooling Convolution Max Pooling
Flattened A new image A new image
3 1 3
1 3 Flattened 3 1 3
1 3
Fully Connected Feedforward network
Fully Connected Feedforward network
cat dog ……
Convolution Max Pooling Convolution Max Pooling Flattened
Can repeat many times
Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
[LeCun et al., 1998]
Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012
implementation and large datasets (ImageNet)
parameter of LeNet-5
[Krizhevsky et al., 2012]
. . . 227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . .
This slide is taken from Andrew Ng
Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
[Krizhevsky et al., 2012]
AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition.
[Krizhevsky et al., 2012]
the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure
[Szegedy et al., 2014]
Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency.
[Szegedy et al., 2014]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015
gradients.
training of networks that are substantially deeper than those used previously.
[He et al., 2015]
error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
[He et al., 2015]
deep networks are harder to optimize.
from one layer and feed it into another layer, much deeper into the network.
instead of H(x) directly
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
[He et al., 2015]
Residual Block Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead
we’re computing the term that we have to add, F(x), to the input, x.
[He et al., 2015]
Full ResNet architecture:
downsample spatially using stride 2 (in each dimension)
ImageNet
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
The best CNN architecture that we currently have and is a great innovation for the idea of residual learning.
[He et al., 2015]