Data Mining II Neural Networks and Deep Learning Heiko Paulheim

Deep Learning • A recent hype topic 03/26/19 Heiko Paulheim 2

Deep Learning • Just the same as artificial neural networks with a new buzzword? 03/26/19 Heiko Paulheim 3

Deep Learning • Contents of this Lecture – Recap of neural networks – The backpropagation algorithm – Auto Encoders – Deep Learning – Network Architectures – “Anything2Vec” 03/26/19 Heiko Paulheim 4

Revisited Example: Credit Rating • Consider the following example: – and try to build a model – which is as small as possible (recall: Occam's Razor) Person Employed Owns House Balanced Account Get Credit Peter Smith yes yes no yes Julia Miller no yes no no Stephen Baker yes no yes yes Mary Fisher no no yes no Kim Hanson no yes yes yes John Page yes no no no 03/26/19 Heiko Paulheim 5

Revisited Example: Credit Rating • Smallest model: – if at least two of Employed, Owns House, and Balanced Account are yes → Get Credit is yes • Not nicely expressible in trees and rule sets – as we know them (attribute-value conditions) Person Employed Owns House Balanced Account Get Credit Peter Smith yes yes no yes Julia Miller no yes no no Stephen Baker yes no yes yes Mary Fisher no no yes no Kim Hanson no yes yes yes John Page yes no no no 03/26/19 Heiko Paulheim 6

Revisited Example: Credit Rating • Smallest model: – if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes • As rule set: Employed=yes and OwnsHouse=yes => yes Employed=yes and BalanceAccount=yes => yes OwnsHouse=yes and BalanceAccount=yes => yes => no • General case: – at least m out of n attributes need to be yes => yes n! – this requires rules, i.e., ( n ) m! ⋅( n − m ) ! m – e.g., “5 out of 10 attributes need to be yes” requires more than 15,000 rules! 03/26/19 Heiko Paulheim 7

Artificial Neural Networks • Inspiration – one of the most powerful super computers in the world 03/26/19 Heiko Paulheim 8

Artificial Neural Networks (ANN) Black box X 1 X 2 X 3 Y Input 1 0 0 0 X 1 1 0 1 1 Output 1 1 0 1 1 1 1 1 X 2 Y 0 0 1 0 0 1 0 0 X 3 0 1 1 1 0 0 0 0 Output Y is 1 if at least two of the three inputs are equal to 1. 03/26/19 Heiko Paulheim 9

Example: Credit Rating • Smallest model: – if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes • Given that we represent yes and no by 1 and 0, we want – if(Employed + Owns House + Balance Acount)>1.5 → Get Credit is yes 03/26/19 Heiko Paulheim 10

Artificial Neural Networks (ANN) Input nodes Black box X 1 X 2 X 3 Y Output 1 0 0 0 X 1 node 0.3 1 0 1 1 1 1 0 1 0.3 1 1 1 1 X 2 Y  0 0 1 0 0 1 0 0 X 3 0 1 1 1 0.3 t=0.4 0 0 0 0      Y I ( 0 . 3 X 0 . 3 X 0 . 3 X 0 . 4 0 ) 1 2 3  1 if z is true  where I ( z )  0 otherwise  03/26/19 Heiko Paulheim 11

Artificial Neural Networks (ANN) • Model is an assembly of Input nodes Black box inter-connected nodes Output X 1 and weighted links node w 1 w 2 X 2 Y  • Output node sums up w 3 X 3 each of its input value t according to the weights of its links Perceptron Model    Y I ( w X t ) or • Compare output node i i i against some threshold t    Y sign ( w X t ) i i i 03/26/19 Heiko Paulheim 12

General Structure of ANN x 1 x 2 x 3 x 4 x 5 Input Layer Input Neuron i Output I 1 w i1 Activation w i2 S i O i I 2 O i function w i3 Hidden g(S i ) I 3 Layer threshold, t Output Training ANN means learning Layer the weights of the neurons y 03/26/19 Heiko Paulheim 13

Algorithm for Learning ANN • Initialize the weights (w 0 , w 1 , …, w k ), e.g., usually randomly • Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples    2   – Objective function: E Y f ( w , X ) i i i i – Find the weights w i ’s that minimize the above objective function 03/26/19 Heiko Paulheim 14

Backpropagation Algorithm x 1 x 2 x 3 x 4 x 5 • Adjust the weights in such a way that the output of ANN is consistent Input with class labels of training examples Layer – Objective function:    2   E Y f ( w , X ) i i i i – Find the weights w i ’s that minimize Hidden Layer the above objective function • This is simple for a single layer perceptron • But for a multi-layer network, Output Y i is not known Layer y 03/26/19 Heiko Paulheim 15

Backpropagation Algorithm • Sketch of the Backpropagation Algorithm: – Present an example to the ANN – Compute error at the output layer – Distribute error to hidden layer according to weights • i.e., the error is distributed according to the contribution of the previous neurons to the result – Adjust weights so that the error is minimized • Adjustment factor: learning rate • Use gradient descent – Repeat until input layer is reached 03/26/19 Heiko Paulheim 16

Backpropagation Algorithm • Important notions: – Predictions are pushed forward through the network (“feed-forward neural network”) – Errors are pushed backwards through the network (“backpropagation”) 03/26/19 Heiko Paulheim 17

Backpropagation Algorithm • Important notions: – Predictions are pushed forward through the network (“feed-forward neural network”) – Errors are pushed backwards through the network (“backpropagation”) 03/26/19 Heiko Paulheim 18

Backpropagation Algorithm – Gradient Descent • Output of a neuron: o = g(w 1 i 1 ...w n i n ) • Assume the desired output is y, the error is o – y = g(w 1 i 1 ...w n i n ) – y • We want to minimize the error, i.e., minimize g(w 1 i 1 ...w n i n ) – y • We follow the steepest descent of g, i.e., – the value where g’ is maximal Input Neuron i Output I 1 w i1 Activation w i2 S i O i I 2 O i function w i3 g(S i ) I 3 threshold, t 03/26/19 Heiko Paulheim 19

Backpropagation Algorithm – Gradient Descent • Hey, wait… – the value where g’ is maximal • To find the steepest gradient, we have to differentiate the activation function      Y I ( 0 . 3 X 0 . 3 X 0 . 3 X 0 . 4 0 ) 1 2 3  1 if z is true  where I ( z )  0 otherwise  • But I(z) is not differentiable! 03/26/19 Heiko Paulheim 20

Alternative Differentiable Activation Functions • Sigmoid Function (classic ANNs): 1/(1+e^x) • Rectified Linear Unit (ReLU, since 2010s): max(0,x) 03/26/19 Heiko Paulheim 21

Properties of ANNs and Backpropagation • Non-linear activation function: – May approximate any arbitrary function, even with one hidden layer • Convergence: – Convergence may take time – Higher learning rate: faster convergence • Gradient Descent Strategy: – Danger of ending in local optima • Use momentum to prevent getting stuck – Lower learning rate: higher probability of finding global optimum 03/26/19 Heiko Paulheim 22

Learning Rate, Momentum, and Local Minima • Learning rate: how much do we adapt the weights with each step – 0: no adaptation, use previous weight – 1: forget everything we have learned so far, simply use weights that are best for current example • Smaller: slow convergence, less overfitting • Higher: faster convergence, more overfitting 03/26/19 Heiko Paulheim 23

Learning Rate, Momentum, and Local Minima • Momentum: how much do we adapt the weights – Small: very small steps – High: very large steps • Smaller: better convergence, sticks in local minimum • Higher: worse convergence, does not get stuck 03/26/19 Heiko Paulheim 24

Dynamic Learning Rates • Adapting learning rates over time – Search coarse-grained first, fine-grained later – Allow bigger jumps in the beginning • Local learning rates – Patterns in weight change differ – Allow local learning rates e.g., RMSProp, AdaGrad, Adam 03/26/19 Heiko Paulheim 25

ANNs vs. SVMs • ANNs have arbitrary decision boundaries – and keep the data as it is • SVMs have linear decision boundaries – and transform the data first 03/26/19 Heiko Paulheim 26

Recap: Feature Subset Selection & PCA • Idea: reduce the dimensionality of high dimensional data • Feature Subset Selection – Focus on relevant attributes • PCA – Create new attributes • In both cases – We assume that the data can be described with fewer variables – Without losing much information 03/26/19 Heiko Paulheim 27

What Happens at the Hidden Layer? x 1 x 2 x 3 x 4 x 5 • Usually, the hidden layer is smaller than the input layer Input – Input: x 1 ...x n Layer – Hidden: h 1 ...h m – n>m Hidden • The output can be predicted Layer from the values at the hidden layer • Hence: Output – m features should be sufficient Layer to predict y! y 03/26/19 Heiko Paulheim 28

Data Mining II Neural Networks and Deep Learning Heiko Paulheim - PowerPoint PPT Presentation

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent hype topic 03/26/19 Heiko Paulheim 2 Deep Learning Just the same as artificial neural networks with a new buzzword? 03/26/19 Heiko Paulheim

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Data Mining Lecture Notes for Chapter 4 Artificial Neural Networks Introduction to Data Mining ,

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks - Deep Learning Artificial Intelligence @ Allegheny College Janyl Jumadinova

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Air showers and cosmic rays through the eyes of digital radio telescopes Anna Nelles University of

CSE5390 & 7390 Special Topics in Ubiquitous Computing lecture four, long live the death of

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Measure Applications Partnership (MAP) Coordinating Committee In-Person Meeting January 15, 2020

Reimagining a Classic: The Design Challenges of Deus Ex: Human Revolution Franois Lapikas

IN5060 Performance in distributed systems Simulations Introduction What is simulation?

Human Machine Interaction based on the lectures by Stefan Kopp / / www.techfak.uni-bielefeld.de

Electromagnetic NDE Peter B. Nagy Research Centre for NDE Imperial College London 2011 Aims