 
              An introduction to Neural Networks and Deep Learning Talk given at the Department of Mathematics of the University of Bologna February 20, 2018 Andrea Asperti DISI - Department of Informatics: Science and Engineering University of Bologna Mura Anteo Zamboni 7, 40127, Bologna, ITALY andrea.asperti@unibo.it Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 1
A branch of Machine Learning What is Machine Learning? There are problems that are difficult to address with traditional programming techniques: ◮ classify a document according to some criteria (e.g. spam, sentiment analysis, ...) ◮ compute the probability that a credit card transaction is fraudulent ◮ recognize an object in some image (possibly from an inusual viewpoint, in new lighting conditions, in a cluttered scene) ◮ ... Typically the result is a weighted combination of a large number of parameters, each one contributing to the solution in a small degree. Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 2
The Machine Learning approach Suppose to have a set of input-output pairs (training set) {� x i , y i �} the problem consists in guessing the map x i �→ y i The M.L. approach: • describe the problem with a model depending on some parameters Θ (i.e. choose a parametric class of functions) • define a loss function to compare the results of the model with the expected (experimental) values • optimize (fit) the parameters Θ to reduce the loss to a minimum Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 3
Why Learning? Machine Learning problems are in fact optimization problems ! So, why talking about learning? Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 4
Why Learning? Machine Learning problems are in fact optimization problems ! So, why talking about learning? The point is that the solution to the optimization problem is not given in an analytical form (often there is no closed form solution). Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 5
Why Learning? Machine Learning problems are in fact optimization problems ! So, why talking about learning? The point is that the solution to the optimization problem is not given in an analytical form (often there is no closed form solution). So, we use iterative techniques (typically, gradient descent) to progressively approximate the result. Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 6
Why Learning? Machine Learning problems are in fact optimization problems ! So, why talking about learning? The point is that the solution to the optimization problem is not given in an analytical form (often there is no closed form solution). So, we use iterative techniques (typically, gradient descent) to progressively approximate the result. This form of iteration over data can be understood as a way of progressive learning of the objective function based on the experience of past observations. Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 7
Using gradients The objective is to minimize some loss function over (fixed) training samples, e.g. � Θ( w ) = E ( o ( w , x i ) , y i ) i by suitably adjusting the parameters w . See how it changes according to small perturbations ∆( w ) of the parameters w : this is the gradient ∇ w [ θ ] = [ ∂ Θ ∂ w 1 , . . . , ∂ Θ ∂ w n ] of Θ w.r.t. w . The gradient is a vector pointing in the direction of steepest ascent. Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 8
Gradient descent Goal: minimize some loss function Θ( w ) by suitably adjusting the parameters. We can reach a minimal configuration for Θ( w ) by iteratively taking small steps in the direction opposite to the gradient (gradient descent). This is a general technique . Warning: not guaranteed to work: ◮ may end up in local minima ◮ may get lost in plateau Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 9
Next arguments A bit of taxonomy Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 10
Different types of Learning Tasks • supervised learning : inputs + outputs (labels) - classification - regression supervised • unsupervised learning : just inputs - clustering - component analysis - autoencoding unsupervised • reinforcement learning actions and rewards - learning long-term gains - planning reinforcement Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 11
Classification vs. Regression Two forms of supervised learning: {� x i , y i �} Probably a cat! Expected New value input New input classification regression y is discete: y ∈ {• , + } y is (conceptually) continuous Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 12
Many different techniques • Different ways to define the models : Outlook - decision trees Sunny Overcast Rain - linear models Humidity Yes Wind - neural networks High Normal Strong Weak - ... Yes No Yes No decision tree neural net • Different error (loss) functions : - mean squared errors - logistic loss - cross entropy - cosine distance - maximum margin mean squared errors maximum margin - ... Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 13
Next argument Neural Networks Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 14
Neural Network A network of (artificial) neurons Artificial neuron Each neuron takes multiple inputs and produces a single output (that can be passed as input to many other neurons). Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 15
The artificial neuron inputs x 1 activation w 1 function w x 2 2 Σ output b w n bias x n +1 The purpose of the activation function is to introduce a thresholding mechanism (similar to the axon-hillock of cortical neurons). Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 16
Different activation functions The activation function is responsible for threshold triggering. 1 0 1 threshold: if x > 0 then 1 else 0 logistic function: 1+ e − x 1 0 ex − e − x hyperbolic tangent: rectified linear (RELU): if x > 0 then x else 0 ex + e − x Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 17
A comparison with the cortical neuron Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 18
Next argument Networks typology/topology Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 19
Layers A neural network is a collection of artificial neurons connected together. Neurons are usually organized in layers. If there is more than one hidden layer the network is deep, otherwise it is called a shallow network. Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 20
Feed-forward networks If the network is acyclic, it is called a feed-forward network. Feed-forward networks are (at present) the commonest type of networks in practical applications. Important Composing linear transformations makes no sense, since we still get a linear transformation. What is the source of non linearity in Neural Networks? Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 21
Feed-forward networks If the network is acyclic, it is called a feed-forward network. Feed-forward networks are (at present) the commonest type of networks in practical applications. Important Composing linear transformations makes no sense, since we still get a linear transformation. What is the source of non linearity in Neural Networks? The activation function Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 22
Dense networks The most typical feed-forward network is a dense network where each neuron at layer k − 1 is connected to each neuron at layer k . The network is defined by a matrix of parameters (weights) W k for each layer (+ biases). The matrix W k has dimension L k × L k +1 where L k is the number of neurons at layer k . Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 23
Parameters and hyper-parameters The weights W k are the parameters of the model: they are learned during the training phase. The number of layers and the number of neurons per layer are hyper-parameters: they are chosen by the user and fixed before training may start. Other important hyper-parameters govern training such as learning rate, batch-size, number of ephocs an many others. Andrea Asperti Universit` a di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 24
Recommend
More recommend