Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM - - PowerPoint PPT Presentation

deep learning review discussion
SMART_READER_LITE
LIVE PREVIEW

Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM - - PowerPoint PPT Presentation

Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM 2015.07.22 Overview What has been done? Applications Main Challenges Empirical Analysis Theoretical Analysis What is to be done? Deep Learning What has


slide-1
SLIDE 1

Deep Learning: Review & Discussion

Chiyuan Zhang, CSAIL, CBMM 2015.07.22

slide-2
SLIDE 2

Overview

  • What has been done?
  • Applications
  • Main Challenges
  • Empirical Analysis
  • Theoretical Analysis
  • What is to be done?
slide-3
SLIDE 3

Deep Learning

What has been done?

slide-4
SLIDE 4

Applications

  • Computer Vision
  • ConvNets, dominating
  • Speech Recognition
  • Deep Nets, Recurrent Neural Networks (RNNs), dominating, industrial deployment
  • Natural Language Processing
  • Matched previous state-of-the-art, but no revolutionized results yet
  • Reinforcement Learning, Structured Prediction, Graphical Models, Unsupervised Learning, …
  • “Unrolling” iteration as NN layer
slide-5
SLIDE 5

Image Classification

  • Imagenet Large Scale Visual Recognition Challenge

(ILSVRC) 


http://image-net.org/challenges/LSVRC/

  • Tasks
  • Classification: 1000-way multiclass learning
  • Detection: classify and locate (bounding box)
  • State-of-the-art
  • ConvNets since 2012
  • Olga Russakovsky, . . ., Andrej Karpathy and Li Fei-Fei et.al.

ImageNet Large Scale Visual Recognition Challenge. arXiv: 1409.0575 [cs.CV].

slide-6
SLIDE 6
slide-7
SLIDE 7

Surpassing “Human Level” Performance

  • Try it yourself: http://cs.stanford.edu/people/karpathy/ilsvrc/
  • For human
  • Difficult & painful task (1000 classes)
  • One guy trained himself with 500 images and tested on 1500 (!!) images
  • ~ 1 minute to classify 1 images: ~ 25 hours…
  • ~ 5% error, the so-called “human level” performance
  • Human and machines are making different kinds of errors, for details see


http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

slide-8
SLIDE 8

ConvNets on ImageNet

  • e.g. google “Inception”, 27 layers, ~7M parameters; VGG ~100M parameters (table 2, arXiv:1409.1556).
  • Imagenet challenge training set ~1.2M images (p > N)
  • Typically takes ~1 week to train on a descent GPU node
  • Models pre-trained on Imagenet turns out to be very good feature extractor or initialization model for many
  • ther vision related tasks even on different datasets; popular in both academia and industry (startups)

input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2

slide-9
SLIDE 9

Fancier applications

Image Captioning Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR 2015. Kelvin Xu et. al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015. Remi Lebret et. al. Phrase-based Image Captioning. ICML 2015. …

slide-10
SLIDE 10

Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).

slide-11
SLIDE 11

Unrolling Iterative Algorithms as Layers of Deep Nets

  • Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).
slide-12
SLIDE 12

Unrolling Multiplicative NMF Iterations

Jonathan Le Roux et. al. Deep NMF for Speech Separation. ICASSP 2015.

slide-13
SLIDE 13

Speech Recognition

  • RNNs: Non-fixed-length input, using context /

memory for current prediction

  • Very deep neural network when unfolded in

time, hard to train

Image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications.

Realtime conversation translation

slide-14
SLIDE 14

Reinforcement Learning & more

  • Google Deep Mind. Human-level control through deep

reinforcement learning. Nature, Feb. 2015.

  • Google Deep Mind. Neural Turing Machines. ArXiv 2014.

Montezuma's Revenge Private Eye Gravitar Frostbite Asteroids

  • Ms. Pac-Man

Bowling Double Dunk Seaquest Venture Alien Amidar River Raid Bank Heist Zaxxon Centipede Chopper Command Wizard of Wor Battle Zone Asterix H.E.R.O. Q*bert Ice Hockey Up and Down Fishing Derby Enduro Time Pilot Freeway Kung-Fu Master Tutankham Beam Rider Space Invaders Pong James Bond Tennis Kangaroo Road Runner Assault Krull Name This Game Demon Attack Gopher Crazy Climber Atlantis Robotank Star Gunner Breakout Boxing Video Pinball At human-level or above Below human-level 100 200 300 400 4,500% 500 1,000 600 Best linear learner DQN

slide-15
SLIDE 15

Deep Learning

What are the challenges?

slide-16
SLIDE 16

Convergence of Optimization

  • Gradients diminishing, lower layers hard to train
  • ReLU, empirically faster convergence
  • Gradients explode or diminish
  • Clever initialization (preserve variance / scale in each layer)
  • Xaiver and variants: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing

Human-Level Performance on ImageNet Classification. ArXiv 2015.

  • Identity: Q. V. Le, N. Jaitly, G. E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. ArXiv 2015.
  • Memory gates: LSTM, Highway Networks (Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber. Highway Networks.

ArXiv 2015), etc.

  • Batch normalization: Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing

Internal Covariate Shift. ICML 2015.

  • Many more tricks out there…
slide-17
SLIDE 17

Regularization

  • “Baidu Overfitting Imagenet”: http://www.image-net.org/challenges/LSVRC/announcement-June-2-2015
  • Data augmentation commonly used in
  • computer vision (random translation, rotation, cropping, mirroring…)
  • speech recognition
  • e.g. Andrew Y. Ng et. al. Deep Speech: Scaling up end-to-end speech
  • recognition. ArXiv 2015. 100,1000 hours (~11 years) of augmented speech

data

Overfitting problems do exist in deep learning

slide-18
SLIDE 18

Regularization

  • Dropout
  • Intuition: forced to be robust; model averaging.
  • Justification
  • Wager, Stefan, Sida Wang, and Percy S
  • Liang. “Dropout Training as Adaptive

Regularization.” NIPS 2013.

  • David McAllester. A PAC-Bayesian Tutorial

with A Dropout Bound. ArXiv 2013.

  • Variations: DropConnect, DropLabel…

Overfitting problems do exist in deep learning

MNIST TIMIT

figure source: http://winsty.net/talks/dropout.pptx

slide-19
SLIDE 19

Regularization

  • (Structured) sparsity comes into play
  • Computer vision: ConvNets — sparse connection with weight sharing
  • Speech recognition: RNNs — time index correspondence, weight sharing
  • Unrolling: structure from algorithms
  • Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. Norm-Based Capacity Control in

Neural Networks. COLT 2015.

  • Q: is the sparsity pattern learnable?

Overfitting problems do exist in deep learning

slide-20
SLIDE 20

Computation

  • Hashing
  • e.g. K.Q. Weinberger et. al. Compressing Neural Networks with the Hashing
  • Trick. ICML 2015.
  • Limited numerical precision computing with stochastic rounding
  • Suyog Gupta et. al. Deep Learning with Limited Numerical Precision. ICML

2015.

slide-21
SLIDE 21

Deep Learning

Existing Empirical Analysis

slide-22
SLIDE 22

Network Visualization

  • Visualizing the learned filters
  • Visualizing high-response input images
  • Adversarial images
  • Reconstruction (what kind of information

is perserved)

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.

slide-23
SLIDE 23

Matthew D. Zeiler, Rob

  • Fergus. Visualizing and

Understanding Convolutional Networks. ECCV 2014.

the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.

slide-24
SLIDE 24

Adversarial images for a trained CNN (or any classifier)

  • 1st column:
  • riginal images.
  • 2nd column:

perturbations.

  • 3rd column:

perturbed images, all classified as “ostrich, Struthio camelus”.

Christian Szegedy, …, Rob Fergus. Intriguing properties of neural networks. ICLR 2014.

slide-25
SLIDE 25

Anh Nguyen, Jason Yosinski, Jeff Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. CVPR 2015. http://www.evolvingai.org/ fooling see also Supernormal Stimuli for human and animals: https://imgur.com/ a/ibMUn

slide-26
SLIDE 26

Reconstruction from each layer of a CNN

  • Aravindh Mahendran, Andrea Vedaldi. Understanding Deep Image

Representations by Inverting Them. CVPR 2015.

  • Jonathan Long, Ning Zhang, Trevor Darrell. Do Convnets Learn

Correspondence? NIPS 2014.

slide-27
SLIDE 27

Learning to reconstruct (from a trained CNN)

  • Alexey Dosovitskiy,

Thomas Brox. Inverting Convolutional Networks with Convolutional

  • Networks. ArXiv:

1506.02753, 2015.

  • Learn a CNN to map from

layer representation into the image space. Unlike auto-encoders, the existing CNN is trained discriminatively, and fixed.

  • Note spatial information is

kind of perserved even in fc layers.

slide-28
SLIDE 28

Deep Nets are easy to train?

  • Ian J. Goodfellow, Oriol

Vinyals, Andrew M. Saxe. Qualitatively characterizing neural network optimization

  • problems. ICLR 2015.
slide-29
SLIDE 29
  • Y. LeCun et. al. The Loss Surfaces of Multilayer Networks. AISTATS 2015.

We study the connection between the highly non-convex loss function of a simple model of the fully- connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii)

  • uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural

network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima

  • f high quality measured by the test error. This emphasizes a major difference between large- and

small-size networks where for the latter poor quality local minima have non-zero probability of being

  • recovered. Finally, we prove that recovering the global minimum becomes harder as the network size

increases and that it is in practice irrelevant as global minimum often leads to overfitting.

see also https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work/

slide-30
SLIDE 30

Deep vs Shallow (empirical performance)

  • Lei Jimmy Ba, Rich Caruana. Do Deep Nets Really Need to be Deep? NIPS
  • 2014. Train a shallow net to mimic a deeper one, i.e. train with soft labels

produced by the deeper net.

Does it imply that deep nets have similar capacity as shallow nets, but easier to train (from discriminative labels)?

slide-31
SLIDE 31

Deep vs Shallow (capacity of hypothesis space)

  • Olivier Delalleau and Yoshua Bengio. Shallow vs. Deep Sum-Product Networks. NIPS 2011.
  • Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neural

network classifiers. ESANN 2014. B(·) is the sum of Betti numbers, a topological complexity measure of a set (here the sub- levelset {f ≧ 0})

slide-32
SLIDE 32

Deep Nets vs Kernel Methods

  • Zhiyun Lu et. al. How to Scale Up Kernel Methods to Be As Good As Deep

Neural Nets. ArXiv: 1411.4000, 2014.

  • Random Fourier features, MKL, parallel optimization… seems to require a lot
  • f tuning, tricks and man-powers (measured by number of authors)

Results on CIFAR-10. Note ConvNets can achieve much lower error (18%) than densely connected DNN on this dataset.

see also: Po-Sen Huang et. al. Kernel Methods Match Deep Neural Networks On TIMIT. ICASSP 2014.

slide-33
SLIDE 33

Deep Learning

Existing Theoretical Analysis

slide-34
SLIDE 34

Provable learning of random sparse networks

  • Sanjeev Arora, Aditya

Bhaskara, Rong Ge, Tengyu

  • Ma. Provable Bounds for

Learning Some Deep

  • Representations. ICML 2014.
  • In practice: layer-wise pre-

training used to be popular, but gradually abandoned when larger amount of training data is available.

slide-35
SLIDE 35

Learning 2 or 3-layer Nets with Quadratic Nonlinearity

  • Roi Livni, Shai Shalev-Shwartz, Ohad Shamir. On the Computational Efficiency
  • f Training Neural Networks. NIPS 2014.

t: depth; n: # nodes; L: weight size constraint; 𝝉2: square activation function.

slide-36
SLIDE 36

Learning Network with 1 Hidden Layer

  • Francis Bach. Breaking the Curse of Dimensionality with Convex Neural
  • Networks. ArXiv: 1412.8690, 2014.
  • Generalization bounds (approximation and estimation errors)
  • Formulated as learning from continuously infinitely many basis functions
slide-37
SLIDE 37

Learning Boolean Networks

  • Dustin G. Mixon, Jesse Peterson. Learning Boolean functions with concentrated spectra. ArXiv:

1507.04319, 2015.

  • Learning 1-hidden layer boolean network with highly concentrated Fourier transform.
  • See also https://dustingmixon.wordpress.com/2015/07/17/a-relaxation-of-deep-learning/
slide-38
SLIDE 38

Deep Learning

Open Problems (?)

slide-39
SLIDE 39

Depth?

  • Is deeper networks a richer function space than shallow networks?
  • Both empirical & theoretical analysis exist (see previous sections)
  • Is deeper networks easier to learn than shallow networks?
  • Statistically
  • Computationally
  • Is there trade-off between depth and blah blah?
  • Empirically people started to explore possibility of training networks with hundreds of
  • layers. Although the current state-of-the-art networks are typically 10~30 layers.
slide-40
SLIDE 40

Structure?

  • It seems structures (sparse connections) is a very important factors for many

successful networks

  • If structure is unknown, can we learn it?
  • e.g. given samples from a statistical model defined with a sparsely

connected deep network, can we estimate the sparsity pattern and parameter values?

  • Learning unknown invariances
slide-41
SLIDE 41

Regularization? Dropout?

  • Can dropout help to discover underlying sparsity structure?
  • Other regularization technique?
slide-42
SLIDE 42

SGD?

  • Why SGD on non-convex objective function works?
  • c.f. reference to existing empirical & theoretical analysis of objective function surfaces

in deep learning.

  • Is there alternative algorithm that
  • has theoretical guarantee / justification?
  • has some nice properties? e.g.
  • Easier to parallelize (SGD is sequential between mini-batches)
  • Biologically plausible (neuroscientists would like it)
slide-43
SLIDE 43

Rectified Linear Unit (ReLU)

  • ReLU is usually found to be better than Sigmoid activation functions

(converges faster, and to better solution)

  • Intuitions exist, but is there rigorous justification?
  • Can we characterize properties for a “nice” activation functions?
  • Other possible “good” activation functions?
slide-44
SLIDE 44

Local minimums or Equivalent solutions

  • Empirically, different random initialization leads to different solution, but almost

equally good measured by classification performance

  • A huge number of equivalent solutions exist
  • For ReLU, scale two layers accordingly does not change the final output
  • Permuting filter index properly in the network does not change the final
  • utput
slide-45
SLIDE 45

Other problems?

  • Unsupervised learning
  • Structured prediction
  • Weakly supervised learning