An introduction to Neural Networks and Deep Learning Talk given at - - PowerPoint PPT Presentation

an introduction to neural networks and deep learning
SMART_READER_LITE
LIVE PREVIEW

An introduction to Neural Networks and Deep Learning Talk given at - - PowerPoint PPT Presentation

An introduction to Neural Networks and Deep Learning Talk given at the Department of Mathematics of the University of Bologna February 20, 2018 Andrea Asperti DISI - Department of Informatics: Science and Engineering University of Bologna


slide-1
SLIDE 1

An introduction to Neural Networks and Deep Learning

Talk given at the Department of Mathematics

  • f the University of Bologna

February 20, 2018 Andrea Asperti

DISI - Department of Informatics: Science and Engineering University of Bologna Mura Anteo Zamboni 7, 40127, Bologna, ITALY andrea.asperti@unibo.it

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 1

slide-2
SLIDE 2

A branch of Machine Learning

What is Machine Learning?

There are problems that are difficult to address with traditional programming techniques:

◮ classify a document according to some criteria (e.g. spam,

sentiment analysis, ...)

◮ compute the probability that a credit card transaction is

fraudulent

◮ recognize an object in some image (possibly from an inusual

viewpoint, in new lighting conditions, in a cluttered scene)

◮ ...

Typically the result is a weighted combination of a large number of parameters, each one contributing to the solution in a small degree.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 2

slide-3
SLIDE 3

The Machine Learning approach

Suppose to have a set of input-output pairs (training set) {xi, yi} the problem consists in guessing the map xi → yi The M.L. approach:

  • describe the problem with a model depending on some

parameters Θ (i.e. choose a parametric class of functions)

  • define a loss function to compare the results of the model

with the expected (experimental) values

  • optimize (fit) the parameters Θ to reduce the loss to a

minimum

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 3

slide-4
SLIDE 4

Why Learning?

Machine Learning problems are in fact optimization problems! So, why talking about learning?

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 4

slide-5
SLIDE 5

Why Learning?

Machine Learning problems are in fact optimization problems! So, why talking about learning? The point is that the solution to the optimization problem is not given in an analytical form (often there is no closed form solution).

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 5

slide-6
SLIDE 6

Why Learning?

Machine Learning problems are in fact optimization problems! So, why talking about learning? The point is that the solution to the optimization problem is not given in an analytical form (often there is no closed form solution). So, we use iterative techniques (typically, gradient descent) to progressively approximate the result.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 6

slide-7
SLIDE 7

Why Learning?

Machine Learning problems are in fact optimization problems! So, why talking about learning? The point is that the solution to the optimization problem is not given in an analytical form (often there is no closed form solution). So, we use iterative techniques (typically, gradient descent) to progressively approximate the result. This form of iteration over data can be understood as a way of progressive learning of the objective function based on the experience of past observations.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 7

slide-8
SLIDE 8

Using gradients

The objective is to minimize some loss function over (fixed) training samples, e.g. Θ(w) =

  • i

E(o(w, xi), yi) by suitably adjusting the parameters w. See how it changes according to small perturbations ∆(w) of the parameters w: this is the gradient ∇w[θ] = [ ∂Θ

∂w1 , . . . , ∂Θ ∂wn ]

  • f Θ w.r.t. w.

The gradient is a vector pointing in the direction of steepest ascent.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 8

slide-9
SLIDE 9

Gradient descent

Goal: minimize some loss function Θ(w) by suitably adjusting the parameters. We can reach a minimal configuration for Θ(w) by iteratively taking small steps in the direction opposite to the gradient (gradient descent). This is a general technique. Warning: not guaranteed to work:

◮ may end up in local minima ◮ may get lost in plateau

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 9

slide-10
SLIDE 10

Next arguments

A bit of taxonomy

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 10

slide-11
SLIDE 11

Different types of Learning Tasks

  • supervised learning:

inputs + outputs (labels)

  • classification
  • regression
  • unsupervised learning:

just inputs

  • clustering
  • component analysis
  • autoencoding
  • reinforcement learning

actions and rewards

  • learning long-term gains
  • planning

supervised unsupervised reinforcement

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 11

slide-12
SLIDE 12

Classification vs. Regression

Two forms of supervised learning: {xi, yi}

input New Probably a cat!

classification

New input Expected value

regression

y is discete: y ∈ {•, +} y is (conceptually) continuous

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 12

slide-13
SLIDE 13

Many different techniques

  • Different ways to

define the models:

  • decision trees
  • linear models
  • neural networks
  • ...
  • Different error

(loss) functions:

  • mean squared errors
  • logistic loss
  • cross entropy
  • cosine distance
  • maximum margin
  • ...

Sunny Overcast Rain High Strong Normal Weak No Yes Yes No Yes Outlook Humidity Wind

decision tree neural net mean squared errors maximum margin

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 13

slide-14
SLIDE 14

Next argument

Neural Networks

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 14

slide-15
SLIDE 15

Neural Network

A network of (artificial) neurons

Artificial neuron

Each neuron takes multiple inputs and produces a single output (that can be passed as input to many other neurons).

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 15

slide-16
SLIDE 16

The artificial neuron

w1

  • utput

x x b +1

Σ

inputs function activation bias

1

x2

2

w

n n

w

The purpose of the activation function is to introduce a thresholding mechanism (similar to the axon-hillock of cortical neurons).

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 16

slide-17
SLIDE 17

Different activation functions

The activation function is responsible for threshold triggering.

1

threshold: if x > 0 then 1 else 0 logistic function:

1 1+e−x 1

hyperbolic tangent:

ex −e−x ex +e−x

rectified linear (RELU): if x > 0 then x else 0 Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 17

slide-18
SLIDE 18

A comparison with the cortical neuron

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 18

slide-19
SLIDE 19

Next argument

Networks typology/topology

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 19

slide-20
SLIDE 20

Layers

A neural network is a collection of artificial neurons connected together. Neurons are usually organized in layers. If there is more than one hidden layer the network is deep,

  • therwise it is called a shallow network.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 20

slide-21
SLIDE 21

Feed-forward networks

If the network is acyclic, it is called a feed-forward network. Feed-forward networks are (at present) the commonest type of networks in practical applications. Important Composing linear transformations makes no sense, since we still get a linear transformation. What is the source of non linearity in Neural Networks?

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 21

slide-22
SLIDE 22

Feed-forward networks

If the network is acyclic, it is called a feed-forward network. Feed-forward networks are (at present) the commonest type of networks in practical applications. Important Composing linear transformations makes no sense, since we still get a linear transformation. What is the source of non linearity in Neural Networks? The activation function

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 22

slide-23
SLIDE 23

Dense networks

The most typical feed-forward network is a dense network where each neuron at layer k − 1 is connected to each neuron at layer k. The network is defined by a matrix of parameters (weights) Wk for each layer (+ biases). The matrix Wk has dimension Lk × Lk+1 where Lk is the number

  • f neurons at layer k.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 23

slide-24
SLIDE 24

Parameters and hyper-parameters

The weights Wk are the parameters of the model: they are learned during the training phase. The number of layers and the number of neurons per layer are hyper-parameters: they are chosen by the user and fixed before training may start. Other important hyper-parameters govern training such as learning rate, batch-size, number of ephocs an many others.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 24

slide-25
SLIDE 25

Convolutional networks

Convolutional networks are used with inputs with a topological structure: signal sequences (e.g. sound), or images. They repeteadly apply a (small) uniform linear transformation, called kernel, shifting it over the whole input image.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 25

slide-26
SLIDE 26

Example

  • −1

1

  −1 1  

− →

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 26

slide-27
SLIDE 27

Computing features

Many interesting kernels (filters) known from Image Processing:

◮ first and second order derivatives, image gradients ◮ sobel, prewitt, . . .

In Neural Networks, kernels are learned by training. Since kernels are small and weights are shared training is relatively fast.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 27

slide-28
SLIDE 28

Recurrent Networks

In a recurrent net-woks you may have cycles:

  • dynamics is very complex

not even clear it stabilizes

  • difficult to train
  • biologically more realistic

Restricted models:

◮ Long-Short Term Memory models (LSTM), ◮ Gated Recurrent Unit (GRU)

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 28

slide-29
SLIDE 29

LSTM and GRU

LSTM are useful to model sequences:

◮ equivalent to very deep nets with one hidden layer per time

slice (net unrolling)

◮ weights are shared between different time slices ◮ they can keep information for a long time in an internal state

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 29

slide-30
SLIDE 30

Symmetrically connected networks

Similar to recurrent networks, but connections between units are symmetrical (they have the same weight in both directions). They have stable configurations corresponding to local minima of a suitable energy function. Hopfield nets: symmetrically connected nets without hidden units Boltzmann machines: symmetrically connected nets with hidden units:

◮ more powerful models than Hopfield nets ◮ less powerful than general recurrent networks ◮ have a nice and simple learning algorithm

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 30

slide-31
SLIDE 31

How a real network looks like

VGG 16 (Simonyan e Zisserman). 92.7 accuracy (top-5) in ImageNet.

Picture by Davi Frossard: VGG in TensorFlow

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 31

slide-32
SLIDE 32

How do we implement a neural net?

Neural nets looks complicated. How do we implement them? There exist suitable languages:

◮ Theano, University of Montreal ◮ TensorFlow, Google Brain ◮ Caffe, Berkeley Vision ◮ Keras, F.Chollet ◮ PyTorch, Facebook ◮ . . .

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 32

slide-33
SLIDE 33

VGG 16 in Keras

From GitHub def VGG_16(weights_path=None): model = Sequential() model.add(ZeroPadding2D((1,1),input_shape=(3,224,224))) model.add(Convolution2D(64, 3, 3, activation=’relu’)) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, 3, 3, activation=’relu’)) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation=’relu’)) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation=’relu’)) model.add(MaxPooling2D((2,2), strides=(2,2))) ...

The whole model is defined in 50 lines of code.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 33

slide-34
SLIDE 34

But what about training?

So complex ... fit(x, y, batch_size=32, epochs=10)

◮ x: input, an array of data (hence, typically, an array of arrays) ◮ y: labels, an array of target categories ◮ batch size: integer, number of samples per gradient update. ◮ epochs: integer, the number of epochs (passes) to train the

model.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 34

slide-35
SLIDE 35

Next arguments

Features and deep features

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 35

slide-36
SLIDE 36

Features

Any individual measurable property of data useful for the solution

  • f a specific task is called a feature.

Examples:

◮ Emergency C-section: age, first pregnancy, anemia, fetus

malpresentation, previous premature birth, anomalous ultrasound, ...

◮ Meteo: humidity, pression, temperature, wind, rain, snow, ... ◮ Expected lifetime: age, health, annual income, kind of work,

...

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 36

slide-37
SLIDE 37

Derived (inner) features

New interesting features may be derived as a combination of input features. Suppose for instance that we are interested to model some phenomenon with a cubic function f (x) = ax3 + bx2 + cx + d We can use x as input or . . .

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 37

slide-38
SLIDE 38

Derived (inner) features

New interesting features may be derived as a combination of input features. Suppose for instance that we are interested to model some phenomenon with a cubic function f (x) = ax3 + bx2 + cx + d We can use x as input or . . . we can precompute x, x2 and x3 reducing the problem to a linear model!

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 38

slide-39
SLIDE 39

Traditional Image Processing

In

  • rder

to process an image we start computing interesting derived features

  • n the image:
  • first order derivatives
  • second order derivatives
  • difference of gaussians
  • laplacian
  • ...
  • riginal

gaussian blur 25 gaussian difference gaussian blur 10

Then we use these derived features to get the desired output.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 39

slide-40
SLIDE 40

Deep learning, in deeper sense

Discovering good features is a complex task. Why not delegating the task to the machine, learning them? Deep learning exploits a hierarchical organization of the learning model, allowing complex features to be computed in terms of simpler ones, through non-linear transformations.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 40

slide-41
SLIDE 41

AI, machine learning, deep learning

  • Knowledge-based systems: take an expert, ask him how he

solves a problem and try to mimic his approach by means of logical rules

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 41

slide-42
SLIDE 42

AI, machine learning, deep learning

  • Knowledge-based systems: take an expert, ask him how he

solves a problem and try to mimic his approach by means of logical rules

  • Traditional Machine-Learning: take an expert, ask him

what are the features of data relevant to solve a given problem, and let the machine learn the mapping

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 42

slide-43
SLIDE 43

AI, machine learning, deep learning

  • Knowledge-based systems: take an expert, ask him how he

solves a problem and try to mimic his approach by means of logical rules

  • Traditional Machine-Learning: take an expert, ask him

what are the features of data relevant to solve a given problem, and let the machine learn the mapping

  • Deep-Learning: get rid of the expert

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 43

slide-44
SLIDE 44

Relations between research areas

Deep learning Example: MLPs autoencoders Example: Representation learning Machine learning Example: logistic regression Artificial Intelligence bases knowledge Example:

Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MIT Press.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 44

slide-45
SLIDE 45

Components trained to learn

Input Input Input Rule−based systems Input Output Output Output Output Hand− designed program features Hand− designed Mapping from features Mapping from features Features Features Mapping from features features complex More Classic Machine Learning Learning Representation Deep Learning learning components

Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MIT Press.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 45

slide-46
SLIDE 46

Next arguments

Some successful applications

  • MINST and ImageNet
  • Speech Recognition
  • Lip reading
  • Text generation
  • Deep dreams and Inceptionism
  • Mimicking style
  • Robot navigation
  • Game simulation

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 46

slide-47
SLIDE 47

MNIST

Modified National Institute of Standards and Technology database

◮ grayscale images of handwritten digits, 20 × 20 pixels each ◮ 60,000 training images and 10,000 testing images

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 47

slide-48
SLIDE 48

MNIST

A comparison of different techniques Classifier Error rate Linear classifier 7.6 K-Nearest Neighbors 0.52 SVM 0.56 Shallow neural network 1.6 Deep neural network 0.35 Convolutional neural network 0.21 See LeCun’s page the mnist database for more data.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 48

slide-49
SLIDE 49

ImageNet

ImageNet (@Stanford Vision Lab)

◮ high resolution color images covering 22K object classes ◮ over 15 million labeled images from the web

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 49

slide-50
SLIDE 50

ImageNet competition

Annual competition of image classification (since 2010).

◮ 1.2 Million images (30K categories) ◮ make five guesses about image label, ordered by confidence

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 50

slide-51
SLIDE 51

ImageNet samples

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 51

slide-52
SLIDE 52

ImageNet results

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 52

slide-53
SLIDE 53

Speech recognition

Several stages (similar to optical character recognition):

◮ Segmentation. Convert the sound wave into a vector of

acoustic coefficients. Typical sampling: 10 milliseconds.

◮ The acoustic model Use adjacent vectors of acoustic

coefficients to associate probabilities with phonemes.

◮ Decoding Find the sequence of phonemes that best fit the

acoustic data, and a model of expected sentences. Deep neural networks, pioneered by George Dahl and Abdel-rahman Mohamed, are replacing previous machine learning methods.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 53

slide-54
SLIDE 54

Speech recognition

Major industries are investing lot of money on speech recognition: Amazon (with Intel), Google, Microsoft, ... Achieving Human Parity in Conversational Speech Recognition. Speech & Dialog research group at Microsoft, 2016 R.Zweig (project manager) attributes the accomplishment to the systematic use of the latest neural network technology in all aspects of the system.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 54

slide-55
SLIDE 55

Lip reading

Google’s DeepMind AI can lip-read TV shows better than a professional

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 55

slide-56
SLIDE 56

Text Generation

See Andrej Karpathy’s blog The Unreasonable Effectiveness of Recurrent Neural Networks

Examples of fake algebraic documents automatically generated by a RNN.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 56

slide-57
SLIDE 57

Deep dreams

Visit Deep dreams generator

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 57

slide-58
SLIDE 58

Mimicking style

A neural algorithm of artistic style L.A. Gatys, A.S. Ecker, M. Bethge

Similar to inceptionism, but with “style” (texture) instead of content.

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 58

slide-59
SLIDE 59

More examples

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 59

slide-60
SLIDE 60

More examples

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 60

slide-61
SLIDE 61

Mimicking style: a different approach

Image-to-image translation with Cycle Generative Adversarial Networks

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 61

slide-62
SLIDE 62

Robot navigation

Quadcopter Navigation in the Forest using Deep Neural Networks

Robotics and Perception Group, University of Zurich, Switzerland & Institute for Artificial Intelligence (IDSIA), Lugano Switzerland

Based on Imitation Learning

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 62

slide-63
SLIDE 63

Atari Games and Q-learning

Google DeepMind’s system playing Atari games (2013) Recently extended to Augmented Imagination (2017) video Based on:

◮ deep neural networks ◮ an innovative reinforcement learning technique called

Q-learning

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 63

slide-64
SLIDE 64

Atari Games and Q-learning

The same network architecture was applied to all games Input are screen frames Works well for reactive games, not for planning

Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 64