Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM - - PowerPoint PPT Presentation
Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM - - PowerPoint PPT Presentation
Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM 2015.07.22 Overview What has been done? Applications Main Challenges Empirical Analysis Theoretical Analysis What is to be done? Deep Learning What has
Overview
- What has been done?
- Applications
- Main Challenges
- Empirical Analysis
- Theoretical Analysis
- What is to be done?
Deep Learning
What has been done?
Applications
- Computer Vision
- ConvNets, dominating
- Speech Recognition
- Deep Nets, Recurrent Neural Networks (RNNs), dominating, industrial deployment
- Natural Language Processing
- Matched previous state-of-the-art, but no revolutionized results yet
- Reinforcement Learning, Structured Prediction, Graphical Models, Unsupervised Learning, …
- “Unrolling” iteration as NN layer
Image Classification
- Imagenet Large Scale Visual Recognition Challenge
(ILSVRC)
http://image-net.org/challenges/LSVRC/
- Tasks
- Classification: 1000-way multiclass learning
- Detection: classify and locate (bounding box)
- State-of-the-art
- ConvNets since 2012
- Olga Russakovsky, . . ., Andrej Karpathy and Li Fei-Fei et.al.
ImageNet Large Scale Visual Recognition Challenge. arXiv: 1409.0575 [cs.CV].
Surpassing “Human Level” Performance
- Try it yourself: http://cs.stanford.edu/people/karpathy/ilsvrc/
- For human
- Difficult & painful task (1000 classes)
- One guy trained himself with 500 images and tested on 1500 (!!) images
- ~ 1 minute to classify 1 images: ~ 25 hours…
- ~ 5% error, the so-called “human level” performance
- Human and machines are making different kinds of errors, for details see
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
ConvNets on ImageNet
- e.g. google “Inception”, 27 layers, ~7M parameters; VGG ~100M parameters (table 2, arXiv:1409.1556).
- Imagenet challenge training set ~1.2M images (p > N)
- Typically takes ~1 week to train on a descent GPU node
- Models pre-trained on Imagenet turns out to be very good feature extractor or initialization model for many
- ther vision related tasks even on different datasets; popular in both academia and industry (startups)
input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2
Fancier applications
Image Captioning Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR 2015. Kelvin Xu et. al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015. Remi Lebret et. al. Phrase-based Image Captioning. ICML 2015. …
Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).
Unrolling Iterative Algorithms as Layers of Deep Nets
- Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).
Unrolling Multiplicative NMF Iterations
Jonathan Le Roux et. al. Deep NMF for Speech Separation. ICASSP 2015.
Speech Recognition
- RNNs: Non-fixed-length input, using context /
memory for current prediction
- Very deep neural network when unfolded in
time, hard to train
Image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications.
Realtime conversation translation
Reinforcement Learning & more
- Google Deep Mind. Human-level control through deep
reinforcement learning. Nature, Feb. 2015.
- Google Deep Mind. Neural Turing Machines. ArXiv 2014.
Montezuma's Revenge Private Eye Gravitar Frostbite Asteroids
- Ms. Pac-Man
Bowling Double Dunk Seaquest Venture Alien Amidar River Raid Bank Heist Zaxxon Centipede Chopper Command Wizard of Wor Battle Zone Asterix H.E.R.O. Q*bert Ice Hockey Up and Down Fishing Derby Enduro Time Pilot Freeway Kung-Fu Master Tutankham Beam Rider Space Invaders Pong James Bond Tennis Kangaroo Road Runner Assault Krull Name This Game Demon Attack Gopher Crazy Climber Atlantis Robotank Star Gunner Breakout Boxing Video Pinball At human-level or above Below human-level 100 200 300 400 4,500% 500 1,000 600 Best linear learner DQN
Deep Learning
What are the challenges?
Convergence of Optimization
- Gradients diminishing, lower layers hard to train
- ReLU, empirically faster convergence
- Gradients explode or diminish
- Clever initialization (preserve variance / scale in each layer)
- Xaiver and variants: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification. ArXiv 2015.
- Identity: Q. V. Le, N. Jaitly, G. E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. ArXiv 2015.
- Memory gates: LSTM, Highway Networks (Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber. Highway Networks.
ArXiv 2015), etc.
- Batch normalization: Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift. ICML 2015.
- Many more tricks out there…
Regularization
- “Baidu Overfitting Imagenet”: http://www.image-net.org/challenges/LSVRC/announcement-June-2-2015
- Data augmentation commonly used in
- computer vision (random translation, rotation, cropping, mirroring…)
- speech recognition
- e.g. Andrew Y. Ng et. al. Deep Speech: Scaling up end-to-end speech
- recognition. ArXiv 2015. 100,1000 hours (~11 years) of augmented speech
data
Overfitting problems do exist in deep learning
Regularization
- Dropout
- Intuition: forced to be robust; model averaging.
- Justification
- Wager, Stefan, Sida Wang, and Percy S
- Liang. “Dropout Training as Adaptive
Regularization.” NIPS 2013.
- David McAllester. A PAC-Bayesian Tutorial
with A Dropout Bound. ArXiv 2013.
- Variations: DropConnect, DropLabel…
Overfitting problems do exist in deep learning
MNIST TIMIT
figure source: http://winsty.net/talks/dropout.pptx
Regularization
- (Structured) sparsity comes into play
- Computer vision: ConvNets — sparse connection with weight sharing
- Speech recognition: RNNs — time index correspondence, weight sharing
- Unrolling: structure from algorithms
- Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. Norm-Based Capacity Control in
Neural Networks. COLT 2015.
- Q: is the sparsity pattern learnable?
Overfitting problems do exist in deep learning
Computation
- Hashing
- e.g. K.Q. Weinberger et. al. Compressing Neural Networks with the Hashing
- Trick. ICML 2015.
- Limited numerical precision computing with stochastic rounding
- Suyog Gupta et. al. Deep Learning with Limited Numerical Precision. ICML
2015.
Deep Learning
Existing Empirical Analysis
Network Visualization
- Visualizing the learned filters
- Visualizing high-response input images
- Adversarial images
- Reconstruction (what kind of information
is perserved)
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.
Matthew D. Zeiler, Rob
- Fergus. Visualizing and
Understanding Convolutional Networks. ECCV 2014.
the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Adversarial images for a trained CNN (or any classifier)
- 1st column:
- riginal images.
- 2nd column:
perturbations.
- 3rd column:
perturbed images, all classified as “ostrich, Struthio camelus”.
Christian Szegedy, …, Rob Fergus. Intriguing properties of neural networks. ICLR 2014.
Anh Nguyen, Jason Yosinski, Jeff Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. CVPR 2015. http://www.evolvingai.org/ fooling see also Supernormal Stimuli for human and animals: https://imgur.com/ a/ibMUn
Reconstruction from each layer of a CNN
- Aravindh Mahendran, Andrea Vedaldi. Understanding Deep Image
Representations by Inverting Them. CVPR 2015.
- Jonathan Long, Ning Zhang, Trevor Darrell. Do Convnets Learn
Correspondence? NIPS 2014.
Learning to reconstruct (from a trained CNN)
- Alexey Dosovitskiy,
Thomas Brox. Inverting Convolutional Networks with Convolutional
- Networks. ArXiv:
1506.02753, 2015.
- Learn a CNN to map from
layer representation into the image space. Unlike auto-encoders, the existing CNN is trained discriminatively, and fixed.
- Note spatial information is
kind of perserved even in fc layers.
Deep Nets are easy to train?
- Ian J. Goodfellow, Oriol
Vinyals, Andrew M. Saxe. Qualitatively characterizing neural network optimization
- problems. ICLR 2015.
- Y. LeCun et. al. The Loss Surfaces of Multilayer Networks. AISTATS 2015.
We study the connection between the highly non-convex loss function of a simple model of the fully- connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii)
- uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural
network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima
- f high quality measured by the test error. This emphasizes a major difference between large- and
small-size networks where for the latter poor quality local minima have non-zero probability of being
- recovered. Finally, we prove that recovering the global minimum becomes harder as the network size
increases and that it is in practice irrelevant as global minimum often leads to overfitting.
see also https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work/
Deep vs Shallow (empirical performance)
- Lei Jimmy Ba, Rich Caruana. Do Deep Nets Really Need to be Deep? NIPS
- 2014. Train a shallow net to mimic a deeper one, i.e. train with soft labels
produced by the deeper net.
Does it imply that deep nets have similar capacity as shallow nets, but easier to train (from discriminative labels)?
Deep vs Shallow (capacity of hypothesis space)
- Olivier Delalleau and Yoshua Bengio. Shallow vs. Deep Sum-Product Networks. NIPS 2011.
- Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neural
network classifiers. ESANN 2014. B(·) is the sum of Betti numbers, a topological complexity measure of a set (here the sub- levelset {f ≧ 0})
Deep Nets vs Kernel Methods
- Zhiyun Lu et. al. How to Scale Up Kernel Methods to Be As Good As Deep
Neural Nets. ArXiv: 1411.4000, 2014.
- Random Fourier features, MKL, parallel optimization… seems to require a lot
- f tuning, tricks and man-powers (measured by number of authors)
Results on CIFAR-10. Note ConvNets can achieve much lower error (18%) than densely connected DNN on this dataset.
see also: Po-Sen Huang et. al. Kernel Methods Match Deep Neural Networks On TIMIT. ICASSP 2014.
Deep Learning
Existing Theoretical Analysis
Provable learning of random sparse networks
- Sanjeev Arora, Aditya
Bhaskara, Rong Ge, Tengyu
- Ma. Provable Bounds for
Learning Some Deep
- Representations. ICML 2014.
- In practice: layer-wise pre-
training used to be popular, but gradually abandoned when larger amount of training data is available.
Learning 2 or 3-layer Nets with Quadratic Nonlinearity
- Roi Livni, Shai Shalev-Shwartz, Ohad Shamir. On the Computational Efficiency
- f Training Neural Networks. NIPS 2014.
t: depth; n: # nodes; L: weight size constraint; 𝝉2: square activation function.
Learning Network with 1 Hidden Layer
- Francis Bach. Breaking the Curse of Dimensionality with Convex Neural
- Networks. ArXiv: 1412.8690, 2014.
- Generalization bounds (approximation and estimation errors)
- Formulated as learning from continuously infinitely many basis functions
Learning Boolean Networks
- Dustin G. Mixon, Jesse Peterson. Learning Boolean functions with concentrated spectra. ArXiv:
1507.04319, 2015.
- Learning 1-hidden layer boolean network with highly concentrated Fourier transform.
- See also https://dustingmixon.wordpress.com/2015/07/17/a-relaxation-of-deep-learning/
Deep Learning
Open Problems (?)
Depth?
- Is deeper networks a richer function space than shallow networks?
- Both empirical & theoretical analysis exist (see previous sections)
- Is deeper networks easier to learn than shallow networks?
- Statistically
- Computationally
- Is there trade-off between depth and blah blah?
- Empirically people started to explore possibility of training networks with hundreds of
- layers. Although the current state-of-the-art networks are typically 10~30 layers.
Structure?
- It seems structures (sparse connections) is a very important factors for many
successful networks
- If structure is unknown, can we learn it?
- e.g. given samples from a statistical model defined with a sparsely
connected deep network, can we estimate the sparsity pattern and parameter values?
- Learning unknown invariances
Regularization? Dropout?
- Can dropout help to discover underlying sparsity structure?
- Other regularization technique?
SGD?
- Why SGD on non-convex objective function works?
- c.f. reference to existing empirical & theoretical analysis of objective function surfaces
in deep learning.
- Is there alternative algorithm that
- has theoretical guarantee / justification?
- has some nice properties? e.g.
- Easier to parallelize (SGD is sequential between mini-batches)
- Biologically plausible (neuroscientists would like it)
Rectified Linear Unit (ReLU)
- ReLU is usually found to be better than Sigmoid activation functions
(converges faster, and to better solution)
- Intuitions exist, but is there rigorous justification?
- Can we characterize properties for a “nice” activation functions?
- Other possible “good” activation functions?
Local minimums or Equivalent solutions
- Empirically, different random initialization leads to different solution, but almost
equally good measured by classification performance
- A huge number of equivalent solutions exist
- For ReLU, scale two layers accordingly does not change the final output
- Permuting filter index properly in the network does not change the final
- utput
Other problems?
- Unsupervised learning
- Structured prediction
- Weakly supervised learning
- …