Deep Networks Andrea Passerini passerini@disi.unitn.it Machine - PowerPoint PPT Presentation

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning) Solution is linear combination of kernels Deep Networks

Need for Deep Networks Multilayer Perceptron (MLP) Network of interconnected neurons layered architecture: neurons from one layer send outputs to the following layer Input layer at the bottom (input features) One or more hidden layers in the middle (learned features) Output layer on top (predictions) Deep Networks

Multilayer Perceptron (MLP) Deep Networks

Activation Function ∑ Perceptron: threshold activation f ( x ) = sign ( w T x ) Derivative is zero everywhere apart from zero (where it’s not differentiable) Impossible to run gradient-based optimization Deep Networks

Activation Function 1 0.9 0.8 0.7 0.6 1/(1+exp(-x)) 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 x Sigmoid 1 f ( x ) = σ ( w T x ) = 1 + exp ( − w T x ) Smooth version of threshold approximately linear around zero saturates for large and small values Deep Networks

Output Layer Binary classification One output neuron o ( x ) Sigmoid activation function: 1 f ( x ) = σ ( o ( x )) = 1 + exp ( − o ( x )) Decision threshold at 0.5 y ∗ = sign ( f ( x ) − 0 . 5 ) Deep Networks

Output Layer Multiclass classification One output neuron per class (called logits layer): [ o 1 ( x ) , . . . , o c ( x )] Softmax activation function: exp o i ( x ) f i ( x ) = � c j = 1 exp o j ( x ) Decision is highest scoring class: y ∗ = arg max f i ( x ) i ∈ [ 1 , c ] Deep Networks

Output layer Regression One output neuron o ( x ) Linear activation function Decision is value of output neuron: f ( x ) = o ( x ) Deep Networks

Representational power of MLP Representable functions boolean functions any boolean function can be represented by some network with two layers of units continuous functions every bounded continuous function can be approximated with arbitrary small error by a network with two layers of units (sigmoid hidden activation, linear output activation) arbitrary functions any function can be approximated to arbitrary accuracy by a network with three layers of units (sigmoid hidden activation, linear output activation) Deep Networks

Shallow vs deep architectures: Boolean functions Conjunctive normal form (CNF) One neuron for each clause (OR gate), with negative weights for negated literals One neuron at the top (AND gate) PB: number of gates Some functions require an exponential number of gates!! (e.g. parity function) Can be expressed with linear number of gates with a deep network (e.g. combination of XOR gates) Deep Networks

Training MLP Loss functions (common choices) Cross entropy for binary classification ( y ∈ { 0 , 1 } ) ℓ ( y , f ( x )) = − ( y log f ( x ) + ( 1 − y ) log ( 1 − f ( x ))) Cross entropy for multiclass classification ( y ∈ [ 1 , c ] ) ℓ ( y , f ( x )) = − log f y ( x ) Mean squared error for regression ℓ ( y , f ( x )) = ( y − f ( x )) 2 Note Minimizing cross entropy corresponds to maximizing likelihood Deep Networks

Training MLP Stochastic gradient descent Training error for example ( x , y ) (e.g. regression): E ( W ) = 1 2 ( y − f ( x )) 2 Gradient update ( η learning rate) w lj = w lj − η∂ E ( W ) ∂ w lj Deep Networks

Training MLP Backpropagation Use chain rule for derivation: ∂ E ( W ) = ∂ E ( W ) ∂ a l = δ l φ j ∂ w lj ∂ a l ∂ w lj � �� δ l Deep Networks

Training MLP Output units Delta is easy to compute on output units. E.g. for regression with linear outputs: = ∂ 1 2 ( y − f ( x )) 2 δ o = ∂ E ( W ) ∂ a o ∂ a o = ∂ 1 2 ( y − a o ) 2 = − ( y − a o ) ∂ a o Deep Networks

Training MLP Hidden units Consider contribution to error through all outer connections (sigmoid activation): δ l = ∂ E ( W ) ∂ E ( W ) ∂ a k � = ∂ a l ∂ a k ∂ a l k ∈ ch [ l ] ∂ a k ∂φ l � = δ k ∂φ l ∂ a l k ∈ ch [ l ] ∂σ ( a l ) � = δ k w kl ∂ a l k ∈ ch [ l ] � = δ k w kl σ ( a l )( 1 − σ ( a l )) k ∈ ch [ l ] Deep Networks

Training MLP Derivative of sigmoid ∂σ ( x ) = ∂ 1 ∂ x ∂ x 1 + exp ( − x ) = − ( 1 + exp ( − x )) − 2 ∂ ∂ x ( 1 + exp ( − x )) = − ( 1 + exp ( − x )) − 2 − exp ( − 2 x ) ∂ exp ( x ) ∂ x = ( 1 + exp ( − x )) − 2 exp ( − 2 x ) exp ( x ) 1 exp ( − x ) = 1 + exp ( − x ) 1 + exp ( − x ) 1 1 = 1 + exp ( − x )( 1 − 1 + exp ( − x )) = σ ( x )( 1 − σ ( x )) Deep Networks

Deep architectures: modular structure y E Loss( φ 3 , y ) φ j = F j ( φ j − 1 , W j ) ∂ E φ 3 ∂φ 3 ∂ E = ∂ E ∂φ j = ∂ E ∂ F j ( φ j − 1 , W j ) F 3 ( φ 2 , W 3 ) ∂ W j ∂φ j ∂ W j ∂φ j ∂ W j W 3 ∂ E φ 2 ∂φ 2 ∂φ j ∂ F j ( φ j − 1 , W j ) ∂ E = ∂ E = ∂ E F 2 ( φ 1 , W 2 ) ∂φ j − 1 ∂φ j ∂φ j − 1 ∂φ j ∂φ j − 1 W 2 ∂ E φ 1 ∂φ 1 F 1 ( x , W 1 ) W 1 x Deep Networks

Remarks on backpropagation Local minima The error surface of a multilayer neural network can contain several minima Backpropagation is only guaranteed to converge to a local minimum Heuristic attempts to address the problem: use stochastic instead of true gradient descent train multiple networks with different random weights and average or choose best many more.. Note Training kernel machines requires solving quadratic optimization problems → global optimum guaranteed Deep networks are more expressive in principle, but harder to train Deep Networks

Stopping criterion and generalization Stopping criterion How can we choose the training termination condition? Overtraining the network increases possibility of overfitting training data Network is initialized with small random weights ⇒ very simple decision surface Overfitting occurs at later iterations, when increasingly complex surfaces are being generated Error Use a separate validation set to estimate performance of the validation network and choose when t te st r a i n i n g to stop training e pochs 1 2 3 4 5 6 7 8 9 10 1 1 Deep Networks

Training deep architectures PB: Vanishing gradient Error gradient is backpropagated through layers At each step gradient is multiplied by derivative of sigmoid: very small for saturated units Gradient vanishes in lower layers Difficulty of training deep networks!! Deep Networks

Tricks of the trade Few simple suggestions Do not initialize weights to zero, but to small random values around zero Standardize inputs ( x ′ = ( x − µ x ) /σ x ) to avoid saturating hidden units Randomly shuffle training examples before each training epoch Deep Networks

Tricks of the trade: activation functions Rectifier f ( x ) = max ( 0 , w T x ) Linearity is nice for learning Saturation (as in sigmoid) is bad for learning (gradient vanishes → no weight update) Neuron employing rectifier activation called rectified linear unit (ReLU) Deep Networks

Tricks of the trade: regularization W 2 E ( W ) W 1 || W || 2 2-norm regularization J ( W ) = E ( W ) + λ || W || 2 Penalizes weights by Euclidean norm Weights with less influence on error get smaller values Deep Networks

Tricks of the trade: regularization W 2 E ( W ) W 1 | W | 1-norm regularization J ( W ) = E ( W ) + λ | W | Penalizes weights by sum of absolute values Encourages less relevant weights to be exactly zero (sparsity inducing norm) Deep Networks

Tricks of the trade: initialization Suggestions Randomly initialize weights (for breaking symmetries between neurons) Carefully set initialization range (to preserve forward and backward variance) √ √ 6 6 W ij ∼ U ( − √ n + m , √ n + m ) n and m number of inputs and outputs Sparse initialization: enforce a fraction of weights to be non-zero (encourages diversity between neurons) Deep Networks

Tricks of the trade: gradient descent Batch vs Stochastic Batch gradient descent updates parameters after seeing all examples → too slow for large datasets Fully stochastic gradient descent updates parameters after seeing each example → objective too different from true one Minibatch gradient descent: update parameters after seeing a minibach of m examples ( m depends on many factors, e.g. size of GPU memory) Deep Networks

Tricks of the trade: gradient descent Momentum α v ji − η∂ E ( W ) v ji = ∂ w lj = w ji + v ji w ji 0 ≤ α < 1 is called momentum Tends to keep updating weights in the same direction Think of a ball rolling on an error surface Possible effects: roll through small local minima without stopping traverse flat surfaces instead of stopping there increase step size of search in regions of constant gradient Deep Networks

Tricks of the trade: adaptive gradient Decreasing learning rate � ( 1 − t τ ) η 0 + t τ η τ if t < τ η t = η τ otherwise Larger learning rate at the beginning for faster convergence towards attraction basin Smaller learning rate at the end to avoid oscillation close to the minimum Deep Networks

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine - PowerPoint PPT Presentation

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Arne Naess Founder of Deep Ecology: biospheric egalitarianism Coined term deep

Multiplexer Goals Selectors: DEF: A MUX -gate (also known as a (2 : 1) -multiplexer) is a

Formal Verification Methods 2: Symbolic Simulation John Harrison Intel Corporation

And Haman went out that day joyful and glad of heart. But when Haman saw Mordecai in the king's

A #SAT Algorithm for Small Constant-Depth Circuits with PTF gates. Nutan Limaye Compuer Science

Genetic Improvement and Approximation: From Hardware to Software Luk Sekanina Brno

CSE 421 P vs NP / NP Completeness Shayan Oveis Gharan 1 Decision Problems A decision problem

Example MDP 3 + 1 Complex decisions 2 1 0.8 0.1 0.1 1 START 1 2 3 4 Chapter 17,

Lecture 19: Introduction to NP-Completeness Steven Skiena Department of Computer Science State

Sambuz

Useful Links

Newsletter

Mail Us