Data Mining II Neural Networks and Deep Learning
Heiko Paulheim
Data Mining II Neural Networks and Deep Learning Heiko Paulheim - - PowerPoint PPT Presentation
Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent hype topic 03/26/19 Heiko Paulheim 2 Deep Learning Just the same as artificial neural networks with a new buzzword? 03/26/19 Heiko Paulheim
Data Mining II Neural Networks and Deep Learning
Heiko Paulheim
03/26/19 Heiko Paulheim 2
Deep Learning
03/26/19 Heiko Paulheim 3
Deep Learning
03/26/19 Heiko Paulheim 4
Deep Learning
– Recap of neural networks – The backpropagation algorithm – Auto Encoders – Deep Learning – Network Architectures – “Anything2Vec”
03/26/19 Heiko Paulheim 5
Revisited Example: Credit Rating
– and try to build a model – which is as small as possible (recall: Occam's Razor) Person Employed Owns House Balanced Account Get Credit Peter Smith yes yes no yes Julia Miller no yes no no Stephen Baker yes no yes yes Mary Fisher no no yes no Kim Hanson no yes yes yes John Page yes no no no
03/26/19 Heiko Paulheim 6
Revisited Example: Credit Rating
– if at least two of Employed, Owns House, and Balanced Account are yes → Get Credit is yes
– as we know them (attribute-value conditions) Person Employed Owns House Balanced Account Get Credit Peter Smith yes yes no yes Julia Miller no yes no no Stephen Baker yes no yes yes Mary Fisher no no yes no Kim Hanson no yes yes yes John Page yes no no no
03/26/19 Heiko Paulheim 7
Revisited Example: Credit Rating
– if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes
Employed=yes and OwnsHouse=yes => yes Employed=yes and BalanceAccount=yes => yes OwnsHouse=yes and BalanceAccount=yes => yes => no
– at least m out of n attributes need to be yes => yes – this requires rules, i.e., – e.g., “5 out of 10 attributes need to be yes” requires more than 15,000 rules!
( n m ) n! m!⋅(n−m)!
03/26/19 Heiko Paulheim 8
Artificial Neural Networks
– one of the most powerful super computers in the world
03/26/19 Heiko Paulheim 9
Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
X1 X2 X3 Y Black box
Output Input
Output Y is 1 if at least two of the three inputs are equal to 1.
03/26/19 Heiko Paulheim 10
Example: Credit Rating
– if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes
– if(Employed + Owns House + Balance Acount)>1.5 → Get Credit is yes
03/26/19 Heiko Paulheim 11
Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
X1 X2 X3 Y Black box
0.3 0.3 0.3 t=0.4 Output node Input nodes
true is if 1 ) ( where ) 4 . 3 . 3 . 3 . (
3 2 1
z z I X X X I Y
03/26/19 Heiko Paulheim 12
Artificial Neural Networks (ANN)
inter-connected nodes and weighted links
each of its input value according to the weights
against some threshold t
X1 X2 X3 Y Black box
w1 t Output node Input nodes w2 w3
) ( t X w I Y
i i i
Perceptron Model
) ( t X w sign Y
i i i
03/26/19 Heiko Paulheim 13
General Structure of ANN
Activation function g(Si )
Si Oi
I1 I2 I3 wi1 wi2 wi3 Oi Neuron i Input Output threshold, t
Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y
Training ANN means learning the weights of the neurons
03/26/19 Heiko Paulheim 14
Algorithm for Learning ANN
with class labels of training examples
– Objective function: – Find the weights wi’s that minimize the above objective function
2
) , (
i i i i
X w f Y E
03/26/19 Heiko Paulheim 15
Backpropagation Algorithm
that the output of ANN is consistent with class labels of training examples
– Objective function: – Find the weights wi’s that minimize the above objective function
perceptron
Yi is not known
Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y
2
) , (
i i i i
X w f Y E
03/26/19 Heiko Paulheim 16
Backpropagation Algorithm
– Present an example to the ANN – Compute error at the output layer – Distribute error to hidden layer according to weights
– Adjust weights so that the error is minimized
– Repeat until input layer is reached
03/26/19 Heiko Paulheim 17
Backpropagation Algorithm
– Predictions are pushed forward through the network (“feed-forward neural network”) – Errors are pushed backwards through the network (“backpropagation”)
03/26/19 Heiko Paulheim 18
Backpropagation Algorithm
– Predictions are pushed forward through the network (“feed-forward neural network”) – Errors are pushed backwards through the network (“backpropagation”)
03/26/19 Heiko Paulheim 19
Backpropagation Algorithm – Gradient Descent
g(w1i1...wnin) – y
– the value where g’ is maximal
Activation function g(Si )
Si Oi
I1 I2 I3 wi1 wi2 wi3 Oi Neuron i Input Output threshold, t
03/26/19 Heiko Paulheim 20
Backpropagation Algorithm – Gradient Descent
– the value where g’ is maximal
function
true is if 1 ) ( where ) 4 . 3 . 3 . 3 . (
3 2 1
z z I X X X I Y
03/26/19 Heiko Paulheim 21
Alternative Differentiable Activation Functions
03/26/19 Heiko Paulheim 22
Properties of ANNs and Backpropagation
– May approximate any arbitrary function, even with one hidden layer
– Convergence may take time – Higher learning rate: faster convergence
– Danger of ending in local optima
– Lower learning rate: higher probability of finding global optimum
03/26/19 Heiko Paulheim 23
Learning Rate, Momentum, and Local Minima
– 0: no adaptation, use previous weight – 1: forget everything we have learned so far, simply use weights that are best for current example
03/26/19 Heiko Paulheim 24
Learning Rate, Momentum, and Local Minima
– Small: very small steps – High: very large steps
03/26/19 Heiko Paulheim 25
Dynamic Learning Rates
– Search coarse-grained first, fine-grained later – Allow bigger jumps in the beginning
– Patterns in weight change differ – Allow local learning rates e.g., RMSProp, AdaGrad, Adam
03/26/19 Heiko Paulheim 26
ANNs vs. SVMs
– and keep the data as it is
– and transform the data first
03/26/19 Heiko Paulheim 27
Recap: Feature Subset Selection & PCA
– Focus on relevant attributes
– Create new attributes
– We assume that the data can be described with fewer variables – Without losing much information
03/26/19 Heiko Paulheim 28
What Happens at the Hidden Layer?
smaller than the input layer
– Input: x1...xn – Hidden: h1...hm – n>m
from the values at the hidden layer
– m features should be sufficient to predict y!
Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y
03/26/19 Heiko Paulheim 29
What Happens at the Hidden Layer?
representation of the dataset
– Hidden: h1...hm – Which still conveys the information needed to predict y
sparse datasets
– The resulting representation is usually dense
Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y
03/26/19 Heiko Paulheim 30
Auto Encoders
– i.e., they train a model for predicting an example from itself – using fewer variables
– But PCA provides only a linear transformation – ANNs can also create non-linear parameter transformations
03/26/19 Heiko Paulheim 31
Denoising Auto Encoders
– Add random noise to input – Keep output clean
– A model that learns to remove noise from an instance
03/26/19 Heiko Paulheim 32
Stacked (Denoising) Auto Encoders
– Hidden layers capture more complex hidden variables and/or denoising patterns – They are often trained consecutively: – First: train an auto encoder with one hidden layer – Second: train a second one-layer neural net:
(noisy) input hidden 1
hidden 2
hidden 1
03/26/19 Heiko Paulheim 33
Footnote: Auto Encoders for Outlier Detection
(Hawkins et al., 2002)
– That captures the patterns in the data
– Deviation is a measure for outlier score
03/26/19 Heiko Paulheim 34
From Classifiers to Feature Detectors
Some of the following slides are borrowed from https://www.macs.hw.ac.uk/~dwcorne/Teaching/
03/26/19 Heiko Paulheim 35
From Classifiers to Feature Detectors
What does a particular neuron do?
03/26/19 Heiko Paulheim 36
What Happens at the Hidden Layer?
… 1 63 1 5 10 15 20 25 … high weight low/zero weight strong signal for a horizontal line in the top row, ignoring everywhere else
03/26/19 Heiko Paulheim 37
What Happens at the Hidden Layer?
… 1 63 1 5 10 15 20 25 … high weight low/zero weight strong signal for a dark area in the top left corner
03/26/19 Heiko Paulheim 38
Is that enough? What Features do we Need?
Vertical Lines Horizontal Lines Circles
03/26/19 Heiko Paulheim 39
Is that enough? What Features do we Need?
– Line at the top – Dark area in the top left corner – …
– Vertical Line – Horizontal Line – Circle
– Positional variance – Color variance
03/26/19 Heiko Paulheim 40
On the Quest for Higher Level Features
etc …
detect lines in specific positions
v
Higher level detetors (horizontal line, RHS vertical line, upper loop, etc…
etc …
03/26/19 Heiko Paulheim 41
Regularization with Dropout
– big object – grey – trunk – tail – ears
– expect all five to fire elephant
03/26/19 Heiko Paulheim 42
Regularization with Dropout
– Randomly deactivate hidden neurons when training an example – E.g., factor α=0.4: deactivate neurons randomly with probability 0.4
– big object – grey – trunk – tail – ears
X X
elephant
03/26/19 Heiko Paulheim 43
Regularization with Dropout
– Randomly deactivate hidden neurons when training an example – E.g., factor α=0.4: deactivate neurons randomly with probability 0.4
– Learned model is more robust, less overfit
– use all hidden neurons
– Multiply each output with 1/(1+α)
03/26/19 Heiko Paulheim 44
Regularization with Dropout
– use all hidden neurons
– Correction: multiply each output with 1/(1+α)
– big object – grey – trunk – tail – ears elephant
0.4 0.7 1.0 0.3 0.3 >1.3 without correction: 0.4+0.7+0.3+0.3 = 1.7>1.3
03/26/19 Heiko Paulheim 45
Regularization with Dropout
– use all hidden neurons
– Correction: multiply each output with 1/(1+α)
– big object – grey – trunk – tail – ears elephant
0.4 0.7 1.0 0.3 0.3 >1.3 With correction: (5/7)*(0.4+0.7+0.3+0.3) = 1.21<1.3
03/26/19 Heiko Paulheim 46
Regularization with Dropout
– use all hidden neurons
– Correction: multiply each output with 1/(1+α)
– big object – grey – trunk – tail – ears elephant
0.4 0.7 1.0 0.3 0.3 >1.3 (5/7)*(0.4+1.0+0.3+0.3) = 1.43>1.3
03/26/19 Heiko Paulheim 47
Architectures: Convolutional Neural Networks
– Treating each pixel as an input: 8M input neurons – Connecting that to a hidden layer of the same size: 8M² = 64 trillion weights to learn – This is hardly practical…
– Convolutional neural networks
03/26/19 Heiko Paulheim 48
Architectures: Convolutional Neural Networks
– Convolution layer – Pooling layer
03/26/19 Heiko Paulheim 49
Architectures: Convolutional Neural Networks
– Each neuron is connected to a small n x n square of the input neurons – i.e., number of connections is linear, not quadratic
– They can share their weights – (intuition: a horizontal line looks the same everywhere)
03/26/19 Heiko Paulheim 50
Architectures: Convolutional Neural Networks
– Use only the maximum value of a neighborhood of neurons – Think: downsizing a picture – Number of neurons is divided by four with each pooling layer
03/26/19 Heiko Paulheim 51
Architectures: Convolutional Neural Networks
– With each pooling/subsampling step: 4 times less neurons – After a few layers, we have a decent number of inputs – Feed those into a fully connected ANN for the actual classification
03/26/19 Heiko Paulheim 52
Architectures: Convolutional Neural Networks
– Treating each pixel as an input: 8M input neurons – Connecting that to a hidden layer of the same size: 8M² = 64 trillion weights to learn
convolutional layer:
– Assume each hidden neuron is connected to a 16x16 square – and we learn 256 hidden features (i.e., 256 layers of convolutional neurons) – 16x16x256x8M = still 526 billion weights
– Thus, it’s just 16x16x256 = 65k weights
03/26/19 Heiko Paulheim 53
Architectures: Convolutional Neural Networks
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
03/26/19 Heiko Paulheim 54
Architectures: Convolutional Neural Networks
– Google’s GoogLeNet (Inception) – Current state of the art in image classification
03/26/19 Heiko Paulheim 55
Turning a Neural Network Upside Down
– Reverse application: given label, synthesize image – Additional constraint (prior): neighboring pixels correlate https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
03/26/19 Heiko Paulheim 56
Turning a Neural Network Upside Down
https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
03/26/19 Heiko Paulheim 57
Making a Neural Network Daydream
label as additional training example)
https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
03/26/19 Heiko Paulheim 58
Making a Neural Network Daydream
label as additional training example)
https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
03/26/19 Heiko Paulheim 59
Neural Networks for Arts
features
https://arxiv.org/pdf/1508.06576.pdf
03/26/19 Heiko Paulheim 60
Neural Networks for Arts
https://arxiv.org/pdf/1508.06576.pdf
03/26/19 Heiko Paulheim 61
Reusing Pre-trained Networks
to yet another classifier (neural network or other)
– Using an image classifier trained for object recognition http://demo.caffe.berkeleyvision.org/
03/26/19 Heiko Paulheim 62
What does an Artificial Neural Network Learn?
03/26/19 Heiko Paulheim 63
What does an Artificial Neural Network Learn?
– changing small pixels barely noticed by humans Goodfellow et al.: Explaining and Harnessing Adverserial Examples, 2015
03/26/19 Heiko Paulheim 64
Possible Implications
Sharif et al.: Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition, 2016
03/26/19 Heiko Paulheim 65
Possible Implications
Papernot et al.: Practical Black-Box Attacks against Machine Learning, 2017
03/26/19 Heiko Paulheim 66
Using ANNs for Time Series Prediction
– Long term trends – Seasonal effects – Random fluctuation – …
– let’s use the last five values as features (3-window) input hidden 1
T-5 T-4 T-3 T-2 T-1 T
03/26/19 Heiko Paulheim 67
Using ANNs for Time Series Prediction
– we will always just use the last five examples – we cannot detect longer term trends
– introduce a memory – lmplementation: backward loops input hidden 1
T-5 T-4 T-3 T-2 T-1 T
03/26/19 Heiko Paulheim 68
Long Short Term Memory Networks (LSTM)
– A folded deep neural network – Note: influence of the past decays over time
03/26/19 Heiko Paulheim 69
CNNs for Time Series Prediction
– Think: trends, seasonal variation, ... Zheng et al.: Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks, 2014
03/26/19 Heiko Paulheim 70
word2vec
– Continuous bag of words (CBOW): predict a word from the surrounding words – Skip-Gram: predicts surrounding words of a word Xin Rong: word2vec parameter learning explained
03/26/19 Heiko Paulheim 71
word2vec
– Similar words are positioned to each other – Relations have the same direction
03/26/19 Heiko Paulheim 72
word2vec
– king – man + woman ≈ queen
– king:man ↔ queen:woman – knee:leg ↔ elbow:forearm – Hillary Clinton:democrat ↔ Donald Trump:Republican
03/26/19 Heiko Paulheim 73
word2vec
– e.g., on Google News Corpus or Wikipedia
03/26/19 Heiko Paulheim 74
From word2vec to anything2vec
– Basically, everything that can be expressed as sequences can be processed by the word2vec pipeline
– Graph2vec (social graphs) – Doc2vec (entire documents) – RDF2Vec (RDF graphs) – Chord2Vec (music chords) – Audio2vec – Video2vec – Gene2vec (Amino acid sequences) – Emoji2vec – ...
03/26/19 Heiko Paulheim 75
Summary
– Are a powerful learning tool – Can approximate universal functions / decision boundaries
– ANNs with multiple hidden layers – Hidden layers learn to identify relevant features – Many architectural variants exist
– e.g,. for image recognition – word embeddings – ...
03/26/19 Heiko Paulheim 76
Questions?