Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Artificial Neural Networks Deep Learning Q-learning Opening the Black Box (Part 2) Overview Artificial neural networks Deep learning (conceptual)
Overview
Artificial neural networks Deep learning (conceptual) Q-learning Opening the black box (part 2)
2
Artificial neural networks
3
The basic idea
Very loosely based on how (we think) the human brain works A collection of software “neurons” are created and connected together, allowing them to send messages to each other Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure
4
A brief history
https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html
5
A brief history
An electronic brain (1940s): since the dawn of computing, researchers have been thinking about the idea of an “intelligent”, perhaps even “conscious” machine
Alan Turing laid out several criteria to assess whether a machine could be said be intelligent in Computing Machinery and Intelligence (Turing test)
The perceptron (1950s): early work in machine learning was inspired by the (then) working theories on the human brain
Frank Rosenblatt kickstarts the field by introducing the "perceptron": simplified mathematical representation of a neuron Convinced that this would quickly lead to true AI
Into the AI winter (1960s-80s): Marvin Minsky, considered as one of the fathers of AI, is less convinced
Along with Seymor Papert, Minksy wrote a book entitled Perceptrons that in fact ended the optimism around the
- perceptron. They showed that the perceptron was incapable of learning the simple exclusive-or (XOR) function
Backpropagation to the rescue (1980s-90s): interest returns to neural networks
Geoff Hinton shows that neural networks with many hidden layers (i.e. consisting of more than one perceptron) could be effectively trained by a relatively simple procedure, called “backpropagation” Such networks have the ability to learn any function, a result known as the “universal approximation theorem”, and with that, neural networks were hot again The idea of using multiple perceptrons in a layered fashion was not new, though it was unclear how exactly such networks could be “trained” The backpropagation algorithm works by taking starting from a network’s “error” and “back-propagating” these throughout all its layers to adjust their parameters Leads to some early successes: Multi-layer perceptrons (MLP), the first convolutional neural networks (CNNs) to recognize handwritten digits by Yann Lecun at AT&T Bell Labs (“LeNet”)
6
A brief history
A second AI winter (1990s-early 2000s): the MLP approach didn’t scale well to larger problems
Computing power lacking for larger networks Meanwhile, by the 90s, the support vector machine (SVM) was rapidly taking the center-stage as the method of choice Neural networks were left behind once again The fields shifts its angle to be a lot more theoretical in nature
Deep learning (early 2000s): around 2006, Hinton introduces the idea of “unsupervised pretraining” and “deep belief nets”
Train a simple 2-layer unsupervised model, freeze all its parameters, add a new layer on top and just train the parameters for the new layer Keep adding and training layers until you have a “deep network”
Deep learning unleashed (2010s): based on Hinton’s work, more and more research papers began to take form
In 2010, a large database known as “Imagenet” containing millions of labeled images was created and published by a research group at Stanford. This database was coupled with the annual Large Scale Visual Recognition Challenge (LSVRC), where contestants would build computer vision models In the first two years of the contest, the top models had error rates of about 25%. In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton entered a submission that would halve the error rate! Combined several critical components that would go one to become mainstays in deep learning models: the use of graphics processing units (GPUs) to train the model, a method to reduce overfitting known as dropout and the rectified linear activation unit (ReLU) The network went on to become known as “Alexnet” and the paper describing it has been cited nearly 10000 times since it was published
7
A brief history
Deep learning unleashed (2010s-today): many innovations would follow after this result
Appearance of large, high-quality labeled datasets Massively parallel computing with GPUs, TPUs New activation functions Improved architectures: CNNs, RCNNs, GANs, RNNs, and many others Software support: Tensorflow, Theano, Keras, Mxnet, CNTK, PyTorch, and many others New regularization techniques: dropout, batch normalization, data-augmentation to protect against overfitting New optimizers: from stochastic gradient descent (SGD) to RMSprop, ADAM and others Focus returns on practice, experiments, empirical approaches
8
Foundations: the perceptron
Models one neuron taking a number of inputs and providing one output
Every input unit’s output is multiplied with a weight and summarized by the perceptron unit The final output (or activation) of the perceptron unit is equal to the result of an “activation function” over the weighted sum of the inputs
Along with Seymor Papert, Minksy wrote a book entitled Perceptrons that in fact ended the optimism around the perceptron
“ “
9
Foundations: activation functions
Logistic (sigmoid): f(x) = 1/(1 + e
)
Output between 0 and 1
Hyperbolic tangent (tanh): f(x) =
Output between -1 and 1
Rectified Linear Unit (ReLU): f(x) = max(0, x)
Output between 0 and +∞ Many modifications exist, very common method
Others
Linear: f(x) = x Exponential: f(x) = e Radial Basis Function (RBF)
−x e +e
x −x
e −e
x −x
x
10
Foundations: the full picture
Combination (transfer) functions
Most neural networks use a linear combination over the inputs, like weighted sum Though other approaches exist as well, as we will see later on Note: the difference between “transfer function”, the “net input” and “activation function” is often not explicitly stated Most references simply talk about the “activation function” Sometimes, both operations are described as separate “layers” (e.g. two neurons, one performing the weighted sum and the other applying the activation function)
Bias (threshold)
Very often, an additional weight is added as a bias Basically: an additional input fixed to 1 and with its own weight
11
Foundations: the perceptron
12
So how to train it?
Feedforward is easy: just plug in the inputs and feed them through the network, obtaining an
- utput
To train, we'll iteratively adjust the weights based on the error (loss) function defined over the predicted output and desired target
E.g. in the case of a simple perceptron: w = w + η ∗ (y −
) ∗ x
We’re nudging the weights in
- rder to minimize the error
The learning rate η determines the “speed” of the convergence
Higher: quicker towards minimum, but risk of
- vershooting
Lower: slower towards minimum, risk of getting trapped in local minima Adaptive learning rate: start high and decrease over time Momentum based: also prevents overshooting A lot of research in this field!
i i
y ^
i
13
So how to train it?
More generally
The loss is a function of the weights given a piece of training data Minimize error using the gradient of the loss function Gradient descent is the process of minimizing a function by following the gradients of the cost function This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move downhill towards the minimum value
Stochastic gradient descent
One iteration: one instance fed-forward, weights are updated after each instance One epoch: one full pass over all instances in the training set Non-stochastic gradient descent: first iterate over all instances, average the error, then update the weights (takes longer, but more stable)
14
So how to train it?
import math X = [[0, 1], [1, 0], [2, 2], [3, 4], [4, 2], [5, 2], [4, 1], [5, 0]] y = [0, 0, 0, 0, 1, 1, 1, 1] weights = [0, 0, 0] # Initialize the weights (first weight is the bias) def sigmoid(x): return 1 / (1 + math.exp(-x)) def predict(instance, weights):
- utput = weights[0]
for i in range(len(weights)-1):
- utput += weights[i+1] * instance[i]
return sigmoid(output) def train(instance, weights, y_true, l_rate=0.01): prediction = predict(instance, weights) error = y_true - prediction weights[0] = weights[0] + l_rate * error for i in range(len(weights)-1): weights[i+1] = weights[i+1] + l_rate * error * instance[i] return weights
15
Without training: Instance y [0, 1] 0.5 [1, 0] 0.5 [2, 2] 0.5 [3, 4] 0.5 [4, 2] 1 0.5 [5, 2] 1 0.5 [4, 1] 1 0.5 [5, 0] 1 0.5 After 20 epochs: Instance y [0, 1] 0.36 [1, 0] 0.57 [2, 2] 0.49 [3, 4] 0.42 [4, 2] 1 0.71 [5, 2] 1 0.79 [4, 1] 1 0.78 [5, 0] 1 0.89 After 2000 epochs: Instance y [0, 1] 0.00 [1, 0] 0.10 [2, 2] 0.04 [3, 4] 0.01 [4, 2] 1 0.94 [5, 2] 1 0.99 [4, 1] 1 0.99 [5, 0] 1 0.99
So how to train it?
y ^ y ^ y ^
16
However...
x1 x2 y 1 1 1 1 1 1
[0.004697241052453581,
- 0.009743527387551375, -0.00476408160440969]
[0, 0] 0 -> 0.5011743081039458 [0, 1] 1 -> 0.4999832898620173 [1, 0] 1 -> 0.49873843109337934 [1, 1] 0 -> 0.4975474276853999
This is the XOR problem 17
So far, not very spectacular...
One neuron on its own is hardly a brain Multilayer Perceptron (MLP): stack different neurons in layers
Input layer Hidden layer Output layer Connect all outputs with all inputs of next layer ("fully connected" or "dense" architecture)
The question is now: how to train? 18
Backpropagation
We can’t use the same approach as we did for a single perceptron as we don’t know what the “true outcome” should be for the lower layers
This issue took quite some time to solve Eventually, a method called “backpropagation” was devised to overcome this
19
Backpropagation
Note that feedforward is still easy
http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
20
Backpropagation
We can also still compare the predicted output with the expected one, from which we can derive a loss value
http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
21
Backpropagation
The idea of backpropagation is to “back propagate” the error through the network
Using the chain rule of partial derivatives
http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
22
Backpropagation
Using this, we know how to shift the weights
http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
Further information:
https://victorzhou.com/blog/intro-to-neural-networks/ http://www.emergentmind.com/neural-network https://www.youtube.com/watch? v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
23
Further aspects
Here, we have used one output neuron, but more than one output is possible as well (e.g. for a multi-class problem) Multiple hidden layers can be added in as well Note that the neurons in the hidden layers commonly use different activation functions than the output neurons
E.g. ReLU is common for the hidden layers. The activation function of the output layer depends
- n the task (regression, binary classification, multiclass)
For multiclass, a "softmax" layer is added on top of the output neurons to summarize their
- utputs to 1
The discussion regarding backpropagation reveals that the error function used and activation functions should be differentiable For the perceptron model, a naïve error function was used (absolute difference)
Many other error (or “loss”) functions exist as well…
24
Further aspects: loss functions
Common choices include For regression:
Mean squared error (MSE): E =
( − y )
For classification:
Cross entropy: E = −
(y × ln( ))
Cross entropy for binary classification:
E = − (y × ln( ) + (1 − y ) × ln(1 − ))
Note that a great deal in research regarding finding new architectures consists of finding appropriate loss functions
2 1 ∑i=1 N
y ^i
i 2
∑i=1
N ∑c=1 C ic
y ^ic ∑i=1
N i
y ^i
i
y ^i
25
Further aspects: gradient descent
Recall:
Normal gradient descent ("batch" gradient descent) presents all training instances to the network
One update of the weights follows based on averaged gradients over the whole trainng set Very precise, but very time-consuming
Stochastic gradient descent updates weights after every instance
Quicker, but more sensitive to particular examples (looks like a "drunk walk" towards the minimum) Might need to shuffle instances every epoch
Most implementations hence use a "mini-batch" approach
Shuffle the training set, present in small batches Update weights after each mini-batch
26
Further aspects: initialization
In our simple example, we have initialized all weights to 0, though it is also common to initialize the weights randomly
A possible approach here as well is preliminary training Good starting values for the weights are essential for getting good solutions (and not getting stuck in local minima) Preliminary training uses a small number of random starting weights and takes a few iterations from each Use the best of the final values as the new starting value and continue with those
27
Further aspects: backpropagation alternatives
Most implementations will use backpropagation, though other approaches to train an artificial neural network exist as well
Advanced nonlinear optimization algorithms Hessian based Newton based methods Conjugate gradient Levenberg-Marquardt Genetic algorithm based …
28
http://cs231n.github.io/neural-networks-3/
Further aspects: beyond stochastic gradient descent
Even when using backpropagation, different optimization strategies exist
- ther then stochastic gradient descent
Momentum Rmsprop Adagrad Adadelta Eve Adabound ... Adaptive learning rate tuning, see e.g. the learning rate finder (https://github.com/surmenok/keras_lr_finder) and cyclical training (bouncing the learning rate back and forth, https://arxiv.org/abs/1506.01186)
A lot of research is being put in this field 29
Further aspects: ReLU
Another interesting aspect to note is the popularity of ReLU (
f(x) = max(0, x)) as an activation function in the hidden nodes instead of
the previously used tanh or sigmoid functions
ReLU reduces the likelihood of vanishing gradient The problem of vanishing gradients happens for activation functions which gradient becomes increasingly small as the absolute value of x increases, causing that updates in “lower” layers happen very slowly and get “vanished out” The constant gradient of ReLUs results in faster learning Variants such as noisy or leaky ReLU’s also commonly used
https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f
30
Further aspects: preventing overfitting
Continuous training will continue to lower the error on the training set, but will eventually lead to overfitting (memorizing the training data) As such, validation is crucial (commonly with a validation split)
Early stopping: stop training when validation error has reached its minimum level Regularization (penalizing large weights) is another approach, as larger weights generally are a sign that overfitting is occurring
31
Further aspects: preventing overfitting
Dropout is another method: at each training stage, individual nodes are either "dropped out" of the net with a given probability, so that a reduced network is left: incoming and outgoing edges to a dropped-out node are also removed
Improves training and reduces node interactions Forces the network to learn alternative pathways -- e.g. enforces redudancy Leading to better genrealization
Batch normalization has also become popular: one often normalizes the input layer by adjusting and scaling the activations; if the input layer is benefiting from it, why not do the same thing also for the values in the hidden layers, that are changing all the time?
Batch normalization reduces the amount by what the hidden unit values shift around (covariance shift) Allows to use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really extreme Reduces overfitting because it has a slight regularization effects. Similar to dropout, it adds some noise to each hidden layer’s activations Some people have recently argued against Dropout altogether, in favor of (heavily) using Batch Normalization
- nly
Though not all: http://nyus.joshuawise.com/batchnorm.pdf (Batch Normalization for Improved DNN Performance, My Ass)
32
Example
Let's summarize a bit... https://playground.tensorflow.org/ 33
MLPs are already powerful
import keras from keras.models import Sequential from keras.layers import Dense, Flatten from keras.datasets import mnist from matplotlib import pyplot as plt import numpy as np from PIL import Image (X_train, y_train), (X_test, y_test) = mnist.load_data() print(X_train.shape) # (60000, 28, 28) plt.imshow(X_train[0], cmap='gray'); plt.show() print(y_train[0]) # 5 X_train, X_test = X_train.astype('float32') /= 255, # 60000 train samples X_test.astype('float32') /= 255 # 10000 test samples num_classes = 10 y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)
34
MLPs are already powerful
model = Sequential() model.add(Flatten(input_shape=(28, 28))) model.add(Dense(8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(num_classes, activation='softmax')) model.summary() model.compile(loss='categorical_crossentropy',
- ptimizer='adam', metrics=['accuracy'])
35
MLPs are already powerful
batch_size = 128 epochs = 20 model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(X_test, y_test)) score = model.evaluate(X_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # Epoch 1/20 # - 3s - loss: 0.9510 - acc: 0.7003 - val_loss: 0.4914 - val_acc: 0.8681 # Epoch 2/20 # - 2s - loss: 0.4345 - acc: 0.8772 - val_loss: 0.3707 - val_acc: 0.8979 # Epoch 20/20 # - 2s - loss: 0.2511 - acc: 0.9295 - val_loss: 0.2701 - val_acc: 0.9262 # Test loss: 0.27007879534959794 # Test accuracy: 0.9262
36
MLPs are already powerful
Layer (type) Output Shape Param # ================================================================= flatten_1 (Flatten) (None, 784) 0 dense_1 (Dense) (None, 8) 6280 = 784 * 8 + 8 dense_2 (Dense) (None, 8) 72 = 8 * 8 + 8 dense_3 (Dense) (None, 10) 90 = 8 * 10 + 10 Total params: 6 442
37
MLPs are already powerful, but how do they learn?
38
MLPs are already powerful, but how do they learn?
[[0 0 0.016035 *0.983951* 0 0.000015 0 0 0 0]] # This is a three?
39
Deep learning
40
Deep what?
The deep in deep learning isn’t a reference to any kind of deeper understanding achieved by the approach, but stands for the idea of successive layers of representations
Other appropriate names for the field could have been:
Layered representations learning Hierarchical representations learning Differential function learning
Modern deep learning often involves tens or even hundreds of successive layers
- f representations
Enabled by computational power rise Main contributions follow from architecture and loss functions
The goal is to create algorithms that can take in very unstructured data, like images, audio waves or text blocks (things traditionally very hard for computers to process) and predict the properties of those inputs – Andrew Ng
“ “
41
We'll look at the following types
Convolutional neural networks Recurrent neural networks Generative adversarial networks Reinforcement learning Embeddings and representational learning: when we discuss text mining
42
Convolutional neural networks (CNNs)
Our “deep” MLP already does pretty well on a simple data set
Black-white image Small Only 10 classes
How about a data set with pictures of 1000 classes? (Cats, dogs, cars, boats, …)
Increase number of layers? Hidden units? Lots of weights to train!
43
Convolutional neural networks (CNNs)
In 2010, a large database known as “Imagenet” containing millions of labeled images was created and published by a research group at Stanford In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton entered a submission that would halve the error rate This model combined several critical components
Probably the most important piece was the use of graphics processing units (GPUs) to train the model They also introduced a method to reduce overfitting known as dropout and used the rectified linear activation unit (ReLU)
The network went on to become known as “Alexnet” and the paper describing it has been cited nearly 10000 times since it was published
And even before this
First convolutional neural networks (CNNs) to recognize handwritten digits by Yann Lecun at AT&T Bell Labs (“LeNet”)
44
Convolutional neural networks (CNNs)
Series of convolutional, pooling layers, followed by fully connected layers, and a softmax output layer
width × height × 3 input layer for colored images
Convolutional layer does most of the heavy lifting: learns a number of "filters" by retaining spacial topology
I.e. don't fully connect everything
Pooling layer applies simple downsampling (i.e. downsizing the image)
45
Convolutional neural networks (CNNs)
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Look at every w × h × 3 window and apply the filter: spatially local weighted sum Do this for every such window by moving it n pixels (stride) Paddings can be defined for the edges of an image Every window leads to one output: a convolved image The same kernel is used for all positions in the image (kernel): parameter sharing! A convolutional layer learns multiple filters (the depth)
f f
46
Convolutional neural networks (CNNs)
See example: http://scs.ryerson.ca/~aharley/vis/ 47
Convolutional neural networks (CNNs)
Many variations have been developed
Newer architectures remove the pooling layers Use dropout, batch normalization Data augmentation: add in variety (e.g. see https://keras.io/preprocessing/image/, https://github.com/aleju/imgaug)
Prevents overfitting Can also be applied at prediction time (test-time augmentation)
LeNet: the first successful applications of CNNs, developed by Yann LeCun in 1990’s AlexNet: the first work that popularized CNNs in Computer Vision ZF Net: a Convolutional Network from Matthew Zeiler and Rob Fergus; an improvement on AlexNet by tweaking the architecture hyperparameters GoogLeNet: main contribution was the development of an “Inception Module” that dramatically reduced the number of parameters in the network VGGNet: showed that the depth of the network is a critical component for good performance (140M parameters) ResNet and ResNeXt: features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network SqueezeNet: achieves AlexNet performance levels with 50x fewer parameters, leading to a very small model that is easy to deploy on e.g. smart devices
48
Transfer learning
While data is a critical part of creating the network, the idea of transfer learning has helped to lessen the data demands
Transfer learning is the process of taking a pre-trained model and “fine-tuning” the model with your own dataset The idea is that this pre-trained model will act as a feature extractor: you remove the last layer(s) of the network and replace it with your own classifier, and only retrain those weights while keeping the rest frozen Or simply keep as is but only retrain last layers When we think about the lower layers of the network, we know that they will detect features like edges and curves Rather than training the whole network through a random initialization of weights, we can use the weights of the pre-trained model and focus on the more important layers (ones that are higher up) for training "Clever" example: https://teachablemachine.withgoogle.com/
Recently also heavily applied in the textual domain! 49
Convolutional neural networks (CNNs) and
- ther image tasks
Basic CNNs are easy to set up for image classification
Taking an input image and outputting a class number out of a set of categories
For object localization, the goal is not only to produce a class label but also a bounding box that describes where the object is in the picture
RCNN, Fast RCNN, Faster RCNN, MultiBox, Bayesian Optimization, Multi-region, RCNN Minus R, Image Windows
For object segmentation, the task is to output a class label as well as an outline
- f every object in the input image
Semantic Seg, Unconstrained Video, Shape Guided, Object Regions, Shape Sharing
50
Convolutional neural networks (CNNs) and
- ther image tasks
A basic CNN setup can also be used to localize
- bjects of interest by preprocessing the data
appropriately At prediction, the model is queried for each slice
- ver the image
51
Convolutional neural networks (CNNs) and
- ther tasks
One dimensional CNNs have been used for text and time series analysis as well Capsule networks (Geoffrey Hinton) try to remove standing issues of the traditional CNN architecture
Standard CNNs focus heavily on small texture and edge based filters but have difficulty with pose and overall composition
52
Convolutional neural networks (CNNs) and
- ther tasks
Image classification, segmentation, detection Face recognition and classification (from proper to “bad” science)
Alibaba launches ‘smile to pay’ facial recognition system at KFC in China: https://www.cnbc.com/2017/09/04/alibaba-launches-smile-to-pay-facial-recognition-system-at-kfc-china.html Beijing KFC is pioneering technology to try to predict and remember people’s fast food choices: https://www.theguardian.com/technology/2017/jan/11/china-beijing-first-smart-restaurant-kfc-facial-recognition New AI can guess whether you're gay or straight from a photograph: https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or- straight-from-a-photograph Facial Recognition Is Accurate, if You’re a White Guy: https://www.nytimes.com/2018/02/09/technology/facial- recognition-race-artificial-intelligence.html
Pose and gait detection Business applications, e.g. in insurance (take a picture of your car to file a damage claim), fraud detection (forged signatures), etc. Stylistic and artistic use cases, e.g. photo editing and processing
53
Convolutional neural networks (CNNs) and style transfer
https://mspoweruser.com/popular-ios-app-prisma-coming-windows-10-month/
54
Convolutional neural networks (CNNs) and style transfer
Uses a pre-trained network with three input images: original, style, and combined image A simple optimizer is used to minimize a custom loss by tweaking the combined image (starting from the original or a random image) Content loss (difference original and combination), style loss (difference style and combination) and variance loss (keep generated image smooth)
55
Convolutional neural networks (CNNs) and deep dreaming
https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
One way to visualize what goes on is to turn the network upside down and ask it to enhance an input image in such a way as to elicit a particular
- interpretation. Say you want to know what sort of image would result in
“Banana. Start with an image full of random noise, then gradually tweak the image towards what the neural net considers a banana. By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar statistics to natural images, such as neighboring pixels needing to be correlated.
“ “
56
Convolutional neural networks (CNNs) and deep dreaming
Convolutional layer outputs attain higher values when the corresponding pattern has been detected Therefore, we should choose some layers in the network, and aim to maximize the intensity of their output The selection of the layers to maximize depends primarily on whether we want to focus on lower or higher level feature representations (or perhaps a combination) A continuity loss (total variation loss): to give the image local coherence and avoid messy blurs A L2 norm loss on the resulting image in order to prevent pixels from taking very high values (otherwise, the image overall would become too bright)
57
One-shot learning
Deep neural networks are really good at learning from high dimensional data like images or spoken language, but only when they have huge amounts of labelled examples to train on
Humans on the other hand, are capable of one-shot learning Take a human who’s never seen a tomato before, and show them a single picture of a tomato, they will probably be able to distinguish tomatoes from other fruits with astoundingly high precision Trivial to us, but not so much for a computer
58
One-shot learning
1 nearest neighbor (take the nearest known sample based on Euclidean distance)
Very low accuracy, but still better than random
Hierarchical Bayesian Learning (Lake et al.)
Better results, but inputs modified or annotated
Naïve deep neural network approach
Would horribly overfit
Transfer learning
Works better, makes sense
Siamese networks (Koch et al.)
Provide two images and train the network to predict whether they have the same category During prediction-time, the network can be used to compare a new image to each in the support set and pick the best matching category based on this We want an architecture that takes two inputs and outputs the probability of sharing the same class Symmetry: p(x1, x2) = p(x2, x1) – which means we cannot just “join” both images together to one large image Siamese network: shared parameters for identical convnets, then joined by a distance function
Also possible: zero-shot learning (no examples for some classes), student- teacher networks (alternative transfer learning approach) 59
http://openaccess.thecvf.com/content_cvpr_2018/CameraReady/2406.pdf
60
Recurrent neural networks (RNNs)
The basic idea behind RNNs is to make use of sequential information
In a traditional neural network we assume that all inputs (and outputs) are independent of each
- ther
However, if you want to predict the next word in a sentence, it makes sense to know the words that came before it RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations RNNs have a “memory” which captures information about what has been calculated so far Similar to human reasoning: humans don’t start their thinking from scratch for every input. Your thoughts (previously seen instances) have persistence
61
Recurrent neural networks (RNNs)
RNNs have shown great success in many NLP tasks
Text classification Language modeling and generating text Machine translation Question-answering Chatbots Speech recognition Generating image descriptions
RNN + CNN
RCNN: object detection
62
Recurrent neural networks (RNNs)
The most well known variant of RNN is probably the LSTM (long-short term memory)
Sometimes, we only need to look at recent information to perform the present task But there are also cases where we need more context, from further back Standard RNNs don’t remember this context so far back Long Short Term Memory networks solve this issue by learning long-term dependencies Introduced by Hochreiter and Schmidhuber Work very on a large variety of problems, and are still widely used Instead of having a single neural network layer per repeating block as in RNN, there are multiple layers, interacting in a special way: having an input gate, a “forget gate” and an output gate
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
63
Recurrent neural networks (RNNs)
https://www.altumintelligence.com/articles/a/Time-Series-Prediction-Using-LSTM-Deep-Neural-Networks
64
Generative adversarial networks (GANs)
Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other
The adversarial network tries to fool the discriminator The discriminator tries to spot fooling attempts
GANs were introduced in a paper by Ian Goodfellow et al. https://arxiv.org/abs/1406.2661
Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML”
65
Generative adversarial networks (GANs)
Traditional discriminative algorithms try to classify input data; that is, given the features of an instance, they predict the class to which that instance belongs, or P(Y=1|X)
Discriminative algorithms map features to labels. They are concerned solely with that correlation
One way to think about generative algorithms is that they do the opposite. Instead of predicting a label given certain features, they attempt to predict features given a certain label, or P(X|Y)
The question a generative algorithm tries to answer is: assuming this label, how likely are these features? While discriminative models care about the relation between y and x, generative models care about “how you get x” In other words: discriminative models learn the boundary between classes, generative models model the distribution of individual classes
66
Auto-encoders
67
Generative adversarial networks (GANs)
https://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html
68
Generative adversarial networks (GANs)
See example: https://poloclub.github.io/ganlab/
Text to image generation: https://arxiv.org/abs/1605.05396 Image to image translation: https://arxiv.org/abs/1611.07004 Increasing image resolution: https://arxiv.org/abs/1609.04802 Predicting next video frames: https://arxiv.org/abs/1511.06380
See example: https://affinelayer.com/pixsrv/ 69
Generative adversarial networks (GANs)
https://thispersondoesnotexist.com/ https://thisrentaldoesnotexist.com/ https://www.thiswaifudoesnotexist.net/ We're getting better at this: DCGAN, StyleGAN, ... "Deep fakes": http://fortune.com/2019/01/31/what-is-deep-fake-video/, http://fortune.com/2018/09/11/deep-fakes-obama-video/, https://www.theguardian.com/technology/2018/nov/12/deep-fakes-fake-news- truth 70
Generative adversarial networks (GANs)
In May, a video appeared on the internet of Donald Trump offering advice to the people of Belgium on the issue of climate change. The video was created by a Belgian political party, sp.a, and posted on sp.a’s Twitter and Facebook. It provoked hundreds of comments, many expressing outrage that the American president would dare weigh in on Belgium’s climate policy. But this anger was misdirected. The speech, it was later revealed, was nothing more than a hi- tech forgery. It was a small-scale demonstration of how this technology might be used to threaten our already vulnerable information ecosystem – and perhaps undermine the possibility of a reliable, shared
- reality. Fake videos can now be created using a machine learning technique called a “generative
adversarial network”, or a GAN. The use of this machine learning technique was mostly limited to the AI research community until late 2017, when a Reddit user who went by the moniker “Deepfakes” – a portmanteau of “deep learning” and “fake” – started posting digitally altered pornographic videos. He was building GANs using TensorFlow, Google’s free open source machine learning software, to superimpose celebrities’ faces on the bodies of women in pornographic movies. When Danielle Citron, a professor of law at the University of Maryland, first became aware of the fake porn movies, she was initially struck by how viscerally they violated these women’s right to
- privacy. But once she started thinking about deep fakes, she realized that if they spread beyond
the trolls on Reddit they could be even more dangerous. They could be weaponized in ways that weaken the fabric of democratic society itself. "What would’ve happened if a deep fake emerged
- f the police chief saying something racist?" In particular, they could foresee deep fakes being
exploited by purveyors of “fake news”.
“ “
71
Reinforcement learning
Reinforcement learning allows to create AI agents that learn from the environment by interacting with it
Learns by trial and error The environment exposes a state to the agent, with a number of possible actions the agent can perform After each action, the agent receives the feedback The feedback consists of the reward and next state of the environment
See: http://projects.rajivshah.com/rldemo/ 72
Q-learning
Given one run of the agent through an environment (one episode), we can easily calculate the total reward for that episode: R = r + r + ... + r
The total future reward from time point t onward can be expressed as:
R = r + r + r + ... + r
Because the environment is stochastic it is common to use discounted future reward instead:
R = r + γr + γ r + ... + γ r R = r + γ(r + γ(r + ...)) = r + γR
If we set the discount factor γ=0, our strategy will be short-sighted and we rely only on the immediate rewards
Balance between immediate and future rewards with e.g. γ=0.9 If our environment is fully deterministic and the same actions always result in same rewards: γ=1
A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward
1 2 n
t t t+1 t+2 n
t t t+1 2 t+2 n−t n t t t+1 t+2 t t+1
73
Q-learning
Define a function Q(s, a) representing the maximum discounted future reward when performing action a in state s and continuing optimally from the point
- nwards:
Q(s , a ) = max R
The best possible score at the end of the game after performing action a in state s Quality of a certain action in a certain state
t t t+1
74
Q-learning
But: how can we estimate the score at the end of game?
We know just the current state and action, and not the actions and rewards coming after that We can’t, Q is just a theoretical construct
If we could find an estimate for Q, we could deterime a policy as follows: just pick the action with the highest Q value in a certain state:
π(s) = argmax Q(s, a)
Here π represents the policy, the rule how we choose an action in each state
a
75
Q-learning
Say we have one action: <s, a, r, s’>
Just like with discounted future rewards, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’:
Q(s, a) = r + γ × Q(s , π(s ))
This is called the Bellman equation The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns
′ ′
76
Q-learning
initialize Q[num_states,num_actions] arbitrarily
- bserve initial state s
repeat select action a according to policy execute action a and obtain new state s' select action a' in s' accordingly to policy Q[s,a] = Q[s,a] + α(r + γ * Q[s',a'] - Q[s,a]) move to new state s' until termination
α is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then the update is exactly the same as the Bellman equation Q[s',a'] with a' that we use to update Q[s,a] is only an approximation and in early stages of learning it may be completely wrong However the approximation gets more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value
77
Q-learning: example
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
78
Q-learning: example
Reward matrix
Indicates possible actions from a certain state And the reward per action
Q matrix
The “brain” of our agent Initially all values are 0
79
Q-learning: example
for each episode, do start from a random initial state as the current state while the current state is not the goal state, do select a random action from the possible ones see to which new state that action leads get the max Q value in that new state update Q[state][action] = Q[state][action] + alpha * (R[state][action] + gamma * max_Q - Q[state][action]) set the new state as the current one end while end for
80
Q-learning: example
Start from a random state (let’s say 3) and see which actions are possible In state 3, we can do actions: go to 1, go to 2, go to 4 Pick a random action to explore, e.g. go to 4 This would bring us to a new state (4) Check the actions which are possible there and determine the max Q value 0, 3 and 5 are possible, max(Q[4][0], Q[4][3], Q[4][5]) = 0
81
Q-learning: example
We are in state 3, and are exploring action 4, and have determined max_Q Q[state][action] = Q[state][action] + alpha * (R[state][action] + gamma * max_Q - Q[state] [action]) If alpha = 1, the formula is easy: Q[3][4] = R[3][4] + 0.8 * 0 = 0 (gamma = 0.8) We move now to state 4 and continue
82
Q-learning: example
We are now in state 4, and can go to 0, 3 or 5 from here, let’s randomly pick 5 to explore The immediate reward R[4][5] = 100 Determine max_Q in state 5: max(Q[5][1], Q[5][4], Q[5][5]) = 0 Q[state][action] = Q[state][action] + alpha * (R[state][action] + gamma * max_Q - Q[state] [action]) If alpha = 1, the formula is easy: Q[4][5] = R[4][5] + 0.8 * 0 = 100 We move now to state 5. This is the end goal, so we start a new episode
83
[[0, 0, 0, 0, 80, 0], [0, 0, 0, 64, 0, 100], [0, 0, 0, 64, 0, 0], [0, 80, 51, 0, 80, 0], [64, 0, 0, 64, 0, 100], [0, 0, 0, 0, 0, 0]]
Q-learning: example
The approximation gets more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value
84
Q-learning
This means 10 rows in our Q-table, more than the number of atoms in the known universe One could argue that states never occur, we could possibly represent it as a sparse table containing only visited states Even so, most of the states are very rarely visited and it would take a lifetime of the universe for the Q-table to converge Ideally, we would also like to have a good guess for Q-values for states we have never seen before
If we apply the same preprocessing to game screens as in the DeepMind paper – take the four last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels – we would have 256 ≈
10
possible game states
“
84×84×4 67970
“
67970
85
Deep Q-learning
This is the point where deep learning steps in We could represent our Q-function with a neural network, that takes the state and action as input and outputs the corresponding Q-value “According to the network, which action leads to the highest payoff in a given state?” 86
Deep Q-learning
Estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network It turns out that approximation of Q-values using non-linear functions is not very stable Not easy to converge and takes a long time, almost a week on a single GPU Hence, experience replay is applied. During gameplay all the experiences <s, a, r, s’> are stored in a replay memory
When training the network, random minibatches from the replay memory are used instead of the most recent transition This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum Help avoid the neural network to overly adjust its weights for the most recent state which may affect the action output of other states
87
Deep Q-learning
Q-learning attempts to solve the credit assignment problem – it propagates rewards back in time, until it reaches the crucial decision point which was the actual cause for the obtained reward
When a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration” As a Q-function converges, it returns more consistent Q-values and the amount of exploration decreases
But this exploration is “greedy”, it settles with the first effective strategy it
- finds. We need a tradeoff between exploration and exploitation
A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q- value In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate
88
Reinforcement learning
Deep Q Learning (DQN)
https://arxiv.org/abs/1312.5602 Used to play simple Atari games together with a CNN
Double DQN
https://arxiv.org/abs/1509.06461 The Q-learning algorithm is known to overestimate action values under certain conditions
Deep Deterministic Policy Gradient (DDPG)
https://arxiv.org/abs/1509.02971
Asynchronous Advantage Actor-Critic (A3C)
https://arxiv.org/abs/1602.01783
Continuous DQN (CDQN or NAF)
https://arxiv.org/abs/1603.00748
Cross-Entropy Method (CEM)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf
Dueling network DQN (Dueling DQN)
https://arxiv.org/abs/1511.06581
Deep SARSA
http://ieeexplore.ieee.org/document/7849837/
89
Reinforcement learning
https://www.alexirpan.com/2018/02/14/rl-hard.html
90
Conclusions
CNNs: “using filters to construct topologically relevant abstractions” RNNs: “order matters: dealing with sequences, temporal aspect” GANs: “if I can generate something, I understand it” RL: “learning how to behave optimally in an environment”
Artificial neural networks are back
Powerful But: require a lot of tuning, configuration, risk of overfitting, and require huge amounts of samples for non-explored problems So many architectures What are best practices? Black box! Probability versus uncertainty: https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/ The last 10%: deep learning gets you quickly to 90% of good results, but the last 10% is still very hard to reach
CNN and RNN pretty “stable”, RL and GANs still face a lot of open questions 91
Conclusions
Tooling support has definitely improved:
PyTorch and Torch
Torch is a computational framework with an API written in Lua that supports machine-learning algorithms Its powerful functionality saw interest from e.g. Facebook and Twitter Though the use of Lua was a bit of a drawback to spur wide adoption A Python version of Torch, known as PyTorch, was open-sourced by Facebook in January 2017 PyTorch offers dynamic computation graphs, which let you process variable-length inputs and outputs, instead of being limited to a fixed neural net architecture: very powerful concept! PyTorch has quickly become a favorite among researchers, because it allows complex architectures to be built easily Adoption in industry slowly growing
Caffe and Caffe2
Caffe2 is the long-awaited successor to the original Caffe Its creator Yangqing Jia now works at Facebook The main difference with Torch is that Caffe2 is somewhat more light-weight Though not much in use these days
92
Conclusions
Tooling support has definitely improved:
TensorFlow and Theano
Theano is the grand-daddy of deep-learning frameworks Written in Python, focus on fast handling of multidimensional arrays GPU support not perfect, speed not perfect, but solid for experimentation, learning and research Yoshua Bengio announced on September 2017, that development on Theano would cease, so not viable anymore Google created TensorFlow to replace Theano Some of the creators of Theano, such as Ian Goodfellow, went on to create Tensorflow at Google before leaving for OpenAI TensorFlow is written with a Python API over a C/C++ engine Java API In October 2017, Google introduced Eager, a dynamic computation graph module for TensorFlow, to compete with PyTorch Very popular with strong industry adoption
93
Conclusions
Tooling support has definitely improved:
Keras
Keras is a deep-learning library that sits on top of Theano, TensorFlow, or CNTK Provides a high-level, easy API inspired by Torch above these Created by Francois Chollet, a software engineer at Google Chosen as an official high-level Tensorflow API by Google On its way to become a standard wrapper around different “engines” For newcomers: easy way to get started! Relatively easy to install Lots of tutorials, code,… available High level, pre-made layers and neurons Less preferred by expert-level coders
94
Conclusions
https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a
95
Conclusions
“Lessons from Optics, The Other Deep Learning” http://www.argmin.net/2018/01/25/optics/
“There’s a mass influx of newcomers to our field and we’re equipping them with little more than folklore and pre-trained deep nets, then asking them to innovate. We can barely agree on the phenomena that we should be explaining away. I think we’re far from teaching this stuff in high schools.” “It would be nice if we could provide mental models, at various layers of abstraction, of the action of the layers of a deep net. What could be our equivalent of refraction, dispersion, and diffraction? Maybe you already think in terms of these actions, but we just haven’t standardized
- ur language around these concepts?”
Don’t believe the short-term hype, but do believe in the long-term vision. It may take a while for AI to be deployed to its true potential—a potential the full extent of which no one has yet dared to dream—but AI is coming, and it will transform our world in a fantastic way. - Francois Chollet
“ “
96
Conclusions
97
Conclusions
Traditional algorithms Deep learning Accuracy Fair to good (on structured data) Good to excellent Training time Short (seconds) to medium (hours) Medium to (very) long (weeks) Data requirements Limited (a couple of hundred rows of “small” data) High (many thousands of e.g. images, though “transfer learning” possible in some cases) Feature engineering Manual trend features, windowing, aggregations, domain-specific approaches Automatic, done “by the model” Hyperparameters Few to some (depending on the algorithm) Many (architecture, number of hidden layers, activation functions, optimizer, …) Interpretability High (white-box models) to reasonable Low (black-box model, though some explanations can be extracted) Cost and
- perational
efficiency Low to reasonable Reasonable to high (GPU, cloud, parallel computational requirements)
98
Opening the black box (part 2)
99
Adversarial learning: a motivating example
NIPS 2017
100
Adversarial learning: a motivating example
Su et al, 2017
101
Adversarial learning: a motivating example
Eykholt et al., 2018
102
Adversarial learning: a motivating example
Brown et al., 2018
103
Opening the black box
How do you inspect something which has not thousands but easily millions of parameters?
“ “
104
Layer activations
The most straight-forward visualization technique is to show the activations of the network during the forward pass
For ReLU networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activations usually become more sparse and localized http://cs231n.github.io/understanding-cnn/
See: https://distill.pub/2019/activation-atlas/ 105
Hinton diagrams
The size of the square indicates the size of the weight The color of the square indicates the sign of the weight
Suitable for simple MLP models 106
Convolutional filters
Another strategy to visualize the weights
These are usually most interpretable on the first CONV layer which is looking directly at the raw pixel data, but it is possible to also show the filter weights deeper in the network http://cs231n.github.io/understanding-cnn/
107
Maximally activating inputs
Another visualization technique is to take a large dataset of images, feed them through the network and keep track of which images maximally activate some neuron
We can then visualize the images to get an understanding of what the neuron is looking for in its receptive field
108
Dimensionality reduction
CNNs can be interpreted as gradually transforming the images into a representation in which the classes are separable by a linear classifier
We can get a rough idea about the topology of this space by embedding images into two dimensions so that their low-dimensional representation has approximately equal distances than their high-dimensional representation. To produce an embedding, we can take a set of images and use the CNN to extract the vector of
- utputs right before the final softmax (classifier) layer
We can then plug these into t-SNE and get 2-dimensional vector for each image
109
Input occlusion
Suppose that a CNN classifies an image as a dog. How can we be certain that it’s actually picking up on the dog in the image as opposed to some contextual cues from the background or some other miscellaneous object?
One way of investigating which part of the image some classification prediction is coming from is by plotting the probability of the class of interest (e.g. “dog” class) as a function of the position of an occluder object We iterate over regions of the image, set a patch of the image to be all zero, and look at the probability of the class. We can visualize the probability as a 2-dimensional heat map http://cs231n.github.io/understanding-cnn/
Can also be used for simple MLP models:
Train the neural network Prune the input where the input-to-hidden layer weights are closest to zero and retrain the network If the predictive power increases (or stays the same), then repeat the process If not, reconnect the input and stop
110
Rule extraction
Decompositional rule extraction algorithms
Are closely intertwined with the internal workings of the neural network Analyze weights, biases, and activation values
Pedagogical rule extraction algorithms
Consider the neural network as a black box Use the neural network as an oracle to label and generate additional training observations
111
Decompositional rule extraction
Extract rules that describe the network outputs in terms of the discretized hidden unit activation values Generate rules that describe the discretized hidden unit activation values in terms of the network inputs Merge the two sets of rules to obtain a set of rules that relate the inputs and outputs of the network
112
Pedagogical rule extraction
Use the neural network to relabel the training data Build a simple model (e.g. decision tree) on the relabeled data Use the neural network as an oracle to generate additional training data when the data becomes too partitioned (for example, less than S observations for deciding upon splits, with S as a user- defined parameter)
113
Two-stage models
114
LIME
https://github.com/marcotcr/lime Local Interpretable Model-agnostic Explanations Explaining individual predictions for text classifiers
- r classifiers that act on tables (numpy arrays of
numerical or categorical data) An explanation is a local linear approximation of the model's behavior. While the model may be very complex globally, it is easier to approximate around the vicinity of a particular instance. While treating the model as a black box, the instance we want to explain is perturbed and a sparse linear model is constructed around it, serving as an explanation
115
SHAP
https://github.com/slundberg/shap SHapley Additive exPlanations A unified approach to explain the output of any machine learning model Connects game theory with local explanations
116
Conclusion
Also keep in mind the other interpretabily approaches we've seen before
Many of these are model-agnostic (surrogate models, feature importance plots, partial dependence plots)
Data scientists are automating themselves... ... though interpretability will become crucial!
https://searchenterpriseai.techtarget.com/feature/Interpretable-AI-has-benefits-beyond- compliance https://www.fastcompany.com/90317658/we-need-an-algorithmic-bill-of-rights https://www.wired.com/story/inside-black-box-of-neural-network/ https://www.forbes.com/sites/tomtaulli/2019/03/09/deep-learning-when-should-you-use- it/#2cf1dc954e36 https://bdtechtalks.com/2019/02/04/explainable-ai-gan-dissection-ibm-mit/
The work of IBM and MIT is one of several efforts collectively explainable AI, projects that aim to create tools that can interpret AI decisions or create artificial intelligence models that are more transparent and open to investigation.