Deep Learning Mich` ele Sebag TAO Universit e Paris-Saclay Jan. - PowerPoint PPT Presentation

Deep Learning Mich` ele Sebag TAO Universit´ e Paris-Saclay Jan. 21st, 2016 Credit for slides: Yoshua Bengio, Yann Le Cun, Nando de Freitas, Christian Perone, Honglak Lee, Ronan Collobert, Tomas Mikolov, Rich Caruana

Overview Neural Nets Main ingredients Invariances and convolutional networks Deep Learning Deep Learning Applications Computer vision Natural language processing Deep Reinforcement Learning The importance of being deep, revisited Take-home message

Neural Nets (C) David McKay - Cambridge Univ. Press History 1943 A neuron as a computable function y = f ( x ) Pitts, McCullough Intelligence → Reasoning → Boolean functions 1960 Connexionism + learning algorithms Rosenblatt 1969 AI Winter Minsky-Papert 1989 Back-propagation Amari, Rumelhart & McClelland, LeCun 1995 Winter again Vapnik 2005 Deep Learning Bengio, Hinton

One neuron: input, weights, activation function . x w w 1 1 x w 2 2 w x y i i i . w d x d R d z = � x ∈ I i w i x i f ( z ) ∈ I R Activation functions ◮ Thresholded 0 if z < threshold , 1 otherwise ◮ Linear z ◮ Sigmoid 1 / (1 + e − z ) e z − e − z ◮ Tanh e z + e − z e − z 2 /σ 2 ◮ Radius-based ◮ Rectified linear (ReLU) max(0 , z )

Learning the weights An optimization problem: Define a criterion ◮ Supervised learning classification, regression R d , y i ∈ I E = { ( x i , y i ) , x i ∈ I R , i = 1 . . . n } ◮ Reinforcement learning R d �→ Action space I R d ′ π : State space I Mnih et al., 2015 Main issues ◮ Requires a differentiable / continuous activation function ◮ Non convex optimization problem

Back-propagation, 1 Notations Input x = ( x 1 , . . . x d ) From input to the first hidden layer = � w jk x k z (1) j x (1) = f ( z (1) ) j j From layer i to layer i + 1 = � w ( i ) z ( i +1) jk x ( i ) j k x ( i +1) = f ( z ( i +1) ) j j ( f : e.g. sigmoid)

Back-propagation, 2 R d , y ∈ {− 1 , 1 } Input ( x , y ), x ∈ I Phase 1 Propagate information forward ◮ For layer i = 1 . . . ℓ For every neuron j on layer i z ( i ) k w ( i ) j , k x ( i − 1) = � j k x ( i ) = f ( z ( i ) j ) j Phase 2 Compare the target output ( y ) to what you get ( x ( ℓ ) 1 ) assuming scalar output for simplicity y = x ( ℓ ) ◮ Error: difference between ˆ and y . 1 Define e output = f ′ ( z ℓ 1 )[ˆ y − y ] where f ′ ( t ) is the (scalar) derivative of f at point t .

Back-propagation, 3 Phase 3 retro-propagate the errors e ( i − 1) = f ′ ( z ( i − 1) w ( i ) kj e ( i ) � ) j j k k Phase 4 : Update weights on all layers ∆ w ( k ) = α e ( k ) x ( k − 1) ij i j where α is the learning rate < 1 . Adjusting the learning rate is a main issue

Properties of NN Good news ◮ MLP, RBF: universal approximators For every decent function f (= f 2 has a finite integral on every compact of I R d ) for every ǫ > 0, there exists some MLP/RBF g such that || f − g || < ǫ . Bad news ◮ Not a constructive proof (the solution exists, so what ?) ◮ Everything is possible → no guarantee (overfitting). Very bad news ◮ A non convex (and hard) optimization problem ◮ Lots of local minima ◮ Low reproducibility of the results

The curse of NNs Le Cun 2007 http://videolectures.net/eml07 lecun wia/

Old Key Issues (many still hold) Model selection ◮ Selecting number of neurons, connexion graph More �⇒ Better ◮ Which learning criterion avoid overfitting Algorithmic choices a difficult optimization problem ◮ Enforce stability through relaxation W neo ← (1 − α ) W old + α W neo ◮ Decrease the learning rate α with time ◮ Stopping criterion early stopping Tricks ◮ Normalize data ◮ Initialize W small !

Toward deeper representations Invariances matter ◮ The label of an image is invariant through small translation, homothety, rotation... ◮ Invariance of labels → Invariance of model y ( x ) = y ( σ ( x )) → h ( x ) = h ( σ ( x )) Enforcing invariances ◮ by augmenting the training set: � E = { ( x i , y i ) } { ( σ ( x i ) , y i ) } ◮ by structuring the hypothesis space Convolutional networks

Hubel & Wiesel 1968 Visual cortex of the cat ◮ cells arranged in such a way that ◮ ... each cell observes a fraction of the visual field receptive field ◮ ... their union covers the whole field ◮ Layer m : detection of local patterns (same weights) ◮ Layer m + 1: non linear aggregation of output of layer m

Ingredients of convolutional networks 1. Local receptive fields (aka kernel or filter) 2. Sharing weights through adapting the gradient-based update: the update is averaged over all occurrences of the weight. Reduces the number of parameters by several orders of magnitude

Ingredients of convolutional networks, 2 3. Pooling: reduction and invariance ◮ Overlapping / non-overlapping regions ◮ Return the max / the sum of the feature map over the region ◮ Larger receptive fields (see more of input)

Convolutional networks, summary LeCun 1998 Properties ◮ Invariance to small transformations (over the region) ◮ Reducing the number of weights

Convolutional networks, summary LeCun 1998 Kryzhevsky et al. 2012 Properties ◮ Invariance to small transformations (over the region) ◮ Reducing the number of weights ◮ Usually many convolutional layers

Manifesto for Deep Bengio, Hinton 2006 1. Grand goal: AI 2. Requisites ◮ Computational efficiency ◮ Statistical efficiency ◮ Prior efficiency: architecture relies on human labor 3. Abstraction is mandatory

Manifesto for Deep Bengio, Hinton 2006 1. Grand goal: AI 2. Requisites ◮ Computational efficiency ◮ Statistical efficiency ◮ Prior efficiency: architecture relies on student labor 3. Abstraction is mandatory

Manifesto for Deep Bengio, Hinton 2006 1. Grand goal: AI 2. Requisites ◮ Computational efficiency ◮ Statistical efficiency ◮ Prior efficiency: architecture relies on student labor 3. Abstraction is mandatory 4. Compositionality principle:

Manifesto for Deep Bengio, Hinton 2006 1. Grand goal: AI 2. Requisites ◮ Computational efficiency ◮ Statistical efficiency ◮ Prior efficiency: architecture relies on student labor 3. Abstraction is mandatory 4. Compositionality principle: build skills on the top of simpler skills Piaget

The importance of being deep A toy example: n -bit parity Hastad 1987 Pros: efficient representation Deep neural nets are (exponentially) more compact Cons: poor learning ◮ More layers → more difficult optimization problem ◮ Getting stuck in poor local optima.

Overcoming the learning problem Long Short Term Memory ◮ Jurgen Schmidhuber (1997). Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability. Neural Networks . Deep Belief Networks ◮ Geoff. Hinton and S. Osindero and Yeh Weh Teh (2006). A fast learning algorithm for deep belief nets. Neural Computation . Auto-Encoders ◮ Yoshua Bengio and P. Lamblin and P. Popovici and H. Larochelle (2007). Greedy Layer- Wise Training of Deep Networks. Advances in Neural Information Processing Systems

Auto-encoders R d , y i ∈ I E = { ( x i , y i ) , x i ∈ I R , i = 1 . . . n } First layer x − → h 1 − → ˆ x ◮ An auto-encoder: �� Find W ∗ = arg min || W t oW ( x i ) − x i || 2 W i (*) Instead of min squared error, use cross-entropy loss: � x i , j log ˆ x i , j + (1 − x i , j ) log (1 − ˆ x i , j ) j

Auto-encoders, 2 First layer x − → h 1 − → ˆ x Second layer same, replacing x with h 1 → ˆ h 1 − → h 2 − h 1

Discussion Layerwise training ◮ Less complex optimization problem (compared to training all layers simultaneously) ◮ Requires a local criterion: e.g. reconstruction ◮ Ensures that layer i encodes same information as layer i + 1 ◮ But in a more abstract way: layer 1 encodes the patterns formed by the (descriptive) features layer 2 encodes the patterns formed by the activation of the previous patterns ◮ When to stop ? trial and error.

Discussion Layerwise training ◮ Less complex optimization problem (compared to training all layers simultaneously) ◮ Requires a local criterion: e.g. reconstruction ◮ Ensures that layer i encodes same information as layer i + 1 ◮ But in a more abstract way: layer 1 encodes the patterns formed by the (descriptive) features layer 2 encodes the patterns formed by the activation of the previous patterns ◮ When to stop ? trial and error. Now pre-training is almost obsolete Gradient problems better understood ◮ Initialization ◮ New activation ReLU ◮ Regularization ◮ Mooore data ◮ Better optimization algorithms

Deep Learning Mich` ele Sebag TAO Universit e Paris-Saclay Jan. - PowerPoint PPT Presentation

Deep Learning Mich` ele Sebag TAO Universit e Paris-Saclay Jan. 21st, 2016 Credit for slides: Yoshua Bengio, Yann Le Cun, Nando de Freitas, Christian Perone, Honglak Lee, Ronan Collobert, Tomas Mikolov, Rich Caruana Overview Neural Nets

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Natural language processing and weak supervision L eon Bottou COS 424 4/27/2010

Modeling end-to-end internet delays using mixtures of Weibull distributions Iain W. Phillips and

Some Essentials of Data Analysis with Wavelets Slid Slides for the wavelet lectures of the course

Multi-Megabit Channel Decoder MPSoC03 N. Wehn UMTS standard: 2 Mbit/s throughput requirements

of Automotive Aftermarket and Supply Chain Presentation by: Sarwant Singh Senior Partner 1

Samsung Memory Solution for HPC - The leverage of right choice of DRAM in improving performance

UWB Non-Coher UWB Non- Coherent High Data ent High Data UWB Non-Coher UWB Non- Coherent High

HDD: the Evolution What high-tech product advances the fastest ? It's probably the hard drive

Deep Learning Mich` ele Sebag TAO Universit e Paris-Saclay Jan. - PowerPoint PPT Presentation

Deep Learning Mich` ele Sebag TAO Universit e Paris-Saclay Jan. 21st, 2016 Credit for slides: Yoshua Bengio, Yann Le Cun, Nando de Freitas, Christian Perone, Honglak Lee, Ronan Collobert, Tomas Mikolov, Rich Caruana Overview Neural Nets

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Natural language processing and weak supervision L eon Bottou COS 424 4/27/2010

Modeling end-to-end internet delays using mixtures of Weibull distributions Iain W. Phillips and

Some Essentials of Data Analysis with Wavelets Slid Slides for the wavelet lectures of the course

Multi-Megabit Channel Decoder MPSoC03 N. Wehn UMTS standard: 2 Mbit/s throughput requirements

of Automotive Aftermarket and Supply Chain Presentation by: Sarwant Singh Senior Partner 1

Samsung Memory Solution for HPC - The leverage of right choice of DRAM in improving performance

UWB Non-Coher UWB Non- Coherent High Data ent High Data UWB Non-Coher UWB Non- Coherent High

HDD: the Evolution What high-tech product advances the fastest ? It's probably the hard drive

Deep learning for natural language processing A short primer on deep learning Benoit Favre <