a shallow survey of deep learning Applications, Models, Algorithms - - PowerPoint PPT Presentation

a shallow survey of deep learning
SMART_READER_LITE
LIVE PREVIEW

a shallow survey of deep learning Applications, Models, Algorithms - - PowerPoint PPT Presentation

a shallow survey of deep learning Applications, Models, Algorithms and Theory (?) Chiyuan Zhang April 22, 2015 CSAIL, Poggio Lab the state of the art the state of the art Deep learning is achieving the state of the art performance in many


slide-1
SLIDE 1

a shallow survey of deep learning

Applications, Models, Algorithms and Theory (?) Chiyuan Zhang April 22, 2015

CSAIL, Poggio Lab

slide-2
SLIDE 2

the state of the art

slide-3
SLIDE 3

the state of the art

Deep learning is achieving the state of the art performance in many standard prediction tasks. ∙ Computer vision: image classification, image summarization, etc. ∙ Speech recognition: acoustic modeling, language modeling, etc. ∙ Natural language processing: language modeling, word embedding, etc. ∙ Reenforcement learning: non-linear Q-function, etc. ∙ . . .

2

slide-4
SLIDE 4

the state of the art

Imagenet Large Scale Visual Recognition Challenge (ILSVRC)1

1000 categories, 1.2 million images, classification, localization and detection.

1http://image-net.org/challenges/LSVRC/

3

slide-5
SLIDE 5

the state of the art

7.5 15 22.5 30 Top-5 error 2010 2011 2012 2013 2014 Human ArXiv 2015

LSVRC on Imagenet, deep learning from 20122, surpassing “human performance” this year.

2Olga Russakovsky, . . ., Andrej Karpathy and Li Fei-Fei et.al. ImageNet Large Scale

Visual Recognition Challenge. arXiv:1409.0575 [cs.CV].

4

slide-6
SLIDE 6

the state of the art

Human-Level Performance ∙ Difficult for human because of fine-grained visual categories. ∙ Human and computer make different kinds of errors3. ∙ Try it out: http://cs.stanford.edu/people/karpathy/ilsvrc/

3http://karpathy.github.io/2014/09/02/

what-i-learned-from-competing-against-a-convnet-on-imagenet/ 5

slide-7
SLIDE 7

the state of the art

input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2

The model called “Inception” (GoogLeNet)4 that won ILSVRC 2014. 27 layers, ∼ 7 million parameters.

4Christian Szegedy, et. al. Going Deeper with Convolutions. arXiv:1409.4842 [cs.CV].

6

slide-8
SLIDE 8

the state of the art

(Deep) neural networks could be easily fooled5,6.

5Rob Fergus et. al. Intriguing properties of neural networks. arXiv:1312.6199 [cs.CV] 6Ian J. Goodfellow et. al. Explaining and Harnessing Adversarial Examples.

arXiv:1412.6572 [stat.ML]

7

slide-9
SLIDE 9

the state of the art

image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications.

Deep learning in speech recognition. Deployed in production.

e.g. Baidu Deep Speech7, RNN trained on 100,000 hours of data (100, 000/365/5 ≈ 55).

7Andrew Ng et. al. Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567 [cs.CL]

8

slide-10
SLIDE 10

the state of the art

Image captioning: parse image, generate text (CNN + RNN).

9

slide-11
SLIDE 11

the state of the art

Image captioning (not quite there yet) ∙ Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions.

http://cs.stanford.edu/people/karpathy/deepimagesent/ ∙ Demo page: http://cs.stanford.edu/people/karpathy/ deepimagesent/generationdemo/

∙ Toronto group’s image to text demo: http://www.cs.toronto. edu/~nitish/nips2014demo/index.html

10

slide-12
SLIDE 12

the state of the art

Reenforcement learning to play video games8

8Google Deep Mind. Human-level control through deep reinforcement learning.

Nature, Feb. 2015.

11

slide-13
SLIDE 13

the state of the art

Attracted a lot of media attention because learning to play video games sounds more close to an AI.

Montezuma's Revenge Private Eye Gravitar Frostbite Asteroids

  • Ms. Pac-Man

Bowling Double Dunk Seaquest Venture Alien Amidar River Raid Bank Heist Zaxxon Centipede Chopper Command Wizard of Wor Battle Zone Asterix H.E.R.O. Q*bert Ice Hockey Up and Down Fishing Derby Enduro Time Pilot Freeway Kung-Fu Master Tutankham Beam Rider Space Invaders Pong James Bond Tennis Kangaroo Road Runner Assault Krull Name This Game Demon Attack Gopher Crazy Climber Atlantis Robotank Star Gunner Breakout Boxing Video Pinball At human-level or above Below human-level 100 200 300 400 4,500% 500 1,000 600 Best linear learner DQN

12

slide-14
SLIDE 14

models

slide-15
SLIDE 15

basic setting

∙ Unsupervised Learning: Given signals x1, . . . , xN ∈ X (e.g. a collection of images), learn something about px. Typically, the goal is to find some representation that captures the main property of the data distribution that is compact, or sparse, or “high-level”, etc.

∙ Clustering: K-means clustering, Gaussian Mixture Models, etc. ∙ Dictionary Learning: Principle Component Analysis (Singular Value Decomposition), Sparse Coding, etc.

∙ Supervised Learning: Given i.i.d. samples {(x1, y1), . . . , (xN, yN)} (e.g. image-category pairs) of px,y, learn something about py|x. Typical goal is to approximate the regression function ρ(x) = E[y|x], or approximate py|x itself. Note px,y is unknown.

14

slide-16
SLIDE 16

deep unsupervised learning

x1 x2 x3 x4 x5 x6 Input layer h1 h2 h3 Hidden layer ˜ x1 ˜ x2 ˜ x3 ˜ x4 ˜ x5 ˜ x6 Reconstruction layer

Auto-encoder Minimize the reconstruction error ∥x − ˜ x∥2, where ˜ x = σ(W⊤h) h = σ(Wx) Here W is the parameter to learn. When the nonlinearity σ(·) is identity, this reduces to PCA. After training, the hidden layer contains a compact, latent representation of the input.

15

slide-17
SLIDE 17

deep unsupervised learning

Denoising Auto-encoder Reconstructing noise-corrupted input. Minimize ∥x − ˜ x∥2, where ˜ x = σ(W⊤h), h = σ(W(x + ε)) ε could be white noise or domain specific noises. Stacked Auto-encoder (deep models) Layer-wise training: let h(0) = x, for l = 1, . . . , L ∙ Train an auto-encoder by W∗ = arg min

W

  • h(l−1) − σ(W⊤σ(Wh(l−1)))
  • 2

∙ Let h(l) = σ(W∗h(l−1))

16

slide-18
SLIDE 18

deep unsupervised learning

Ristricted Boltzmann Machine

x h1 h2 h3

Deep Belief Network

Ristricted Boltzmann Machines P(x, h) = 1 Ze−E(x,h), E(x, h) = −aTx−bTh−xTWh Conditional probabilities are factorized out, and easy to compute P(x|h) =

m

i=1

P(xi|h), P(h|x) =

n

j=1

P(hj|x) Layer-wise (pre) training with maximal (log) likelihood to estimator model parameters.

17

slide-19
SLIDE 19

deep unsupervised learning

(Deep) Unsupervised Learning is ∙ Typically used to initialize supervised deep models.

∙ Unsupervised pre-training is no longer used in most of the cases nowadays when abundant supervised training data is available.

∙ Empirically not performing as good as supervised models, but considered very important because

∙ Unsupervised data is much cheaper to get than labeled data. ∙ Humans are not trained by looking at 1 million images with labels.

∙ Able to do probabilistic inference, sampling, etc.

18

slide-20
SLIDE 20

deep supervised learning

Deep Neural Networks (DNNs), a.k.a. Multi-Layer Perceptrons (MLPs)

MLP with one hidden layer is a universal approximators for continuous functions. x(l+1) = σ ( W(l)x(l) + b(l)) Typical nonlinearity σ(·): sigmoid, rectified linear unit, etc.

19

slide-21
SLIDE 21

deep supervised learning

Deep Neural Networks ∙ Softmax is typically used at the last layer to produce a probability distribution over the output alphabet.

∙ e.g. In speech recognition, a MLP qθ(y|x) is trained to approximate the posterior distribution over HMM states y given the observation x.

∙ Deeper models usually outperform shallow models with the same number of parameters.

∙ But we do not know why...

∙ Objective function is non-convex, non-linear, non-smooth; usually trained with stochastic gradient descent for a local

  • ptimal. Typical learning curves of SGD:

20

slide-22
SLIDE 22

deep supervised learning

Recurrent Neural Networks (RNNs)

image source: http://arxiv.org/abs/1412.5567

∙ Hidden layers takes inputs from both lower layers and the same layer from the previous (or next) time step. ∙ Expanded over time: a (very) deep neural network with shared weights. ∙ Widely used in speech recognition and natural language processing to process sequence data.

21

slide-23
SLIDE 23

deep supervised learning

∙ RNNs are very difficult to train.

∙ Gradients diminish or explode when back-propagate over time.

∙ Many variants proposed to improve (empirical) convergence property.

∙ Long-short Term Memory (LSTM])

22

slide-24
SLIDE 24

deep supervised learning

Convolutional Neural Networks (CNNs, ConvNets) ∙ Sparse connections vs. dense connections in DNNs. ∙ A filter (receptive field) operates on local patches, and defines a feature map as it moves around the spatial domain of the input. ∙ A pooling layer that sub-sample a local neighborhood (with MEAN

  • r MAX).

image source: http://deeplearning.net/tutorial/lenet.html

23

slide-25
SLIDE 25

deep supervised learning

Convolutional Neural Networks ∙ ConvNets are dominant in computer vision related tasks.

∙ All the winners of the Imagenet challenge are ConvNets in recent years.

∙ ConvNets typically have less parameters than densely connected DNN counterparts but are more computationally expensive to evaluate and train.

∙ Because of sparse connection / shared weights.

∙ ConvNets are reminiscent of feed-forward models in Computational Neuroscience.

24

slide-26
SLIDE 26

invariant representations

slide-27
SLIDE 27

invariant representations

Understanding vision as an information processing system, Marr’s Tri-Level Hypothesis:

  • 1. computational level: what problem does the system solve, and

why does it do these things.

  • 2. algorithmic / representational level: how does the system do

what it does, specifically, what representations does it use and what processes does it employ to build and manipulate the representations.

  • 3. Physical level: how is the system physically realised (in the case of

biological vision, what neural structures and neuronal activities implement the visual system).

26

slide-28
SLIDE 28

invariant representation

Fact: visual signals contains nuisance (e.g. viewpoint, illumination, etc.) that does not change the semantic of the scene. Hypothesis: an invariant (and selective) representation is employed for typical visual tasks (object recognition, scene understanding, etc.).

image source: Another 53 objects database, http://www.vision.ee.ethz.ch/datasets/index.en.html.

27

slide-29
SLIDE 29

invariant representations

A Formal Model Assumption: a collection of transformations g ∈ G form a group G (e.g. affine transformations). The transformations applied to input signals x ∈ X is irrelevant with respect to the semantic labels: p(y|x) = p(y|gx), ∀x ∈ X, ∀y ∈ Y, ∀g ∈ G Representation: a representation map is a function Φ : X → F. The map is ∙ invariant if Φ(x) = Φ(gx) for any x ∈ X and g ∈ G. ∙ selective if Φ(x) = Φ(x′) ⇒ ∃g ∈ G, x′ = gx.

28

slide-30
SLIDE 30

group orbit as invariant representation

Definition (Group Orbit) The orbit Ox of x under the group of transformations G is Ox = {gx : g ∈ G} Orbit is invariant If x g0x for some (unknown) g0 G, then Ox g g0x g G g g0x g G g x g G Ox Orbit is selective If Ox Ox , there must be g g G such that gx g x . Therefore, x g

1

gx, and g

1

g G.

29

slide-31
SLIDE 31

group orbit as invariant representation

Definition (Group Orbit) The orbit Ox of x under the group of transformations G is Ox = {gx : g ∈ G} Orbit is invariant If x′ = g0x for some (unknown) g0 ∈ G, then Ox′ = {g(g0x) : g ∈ G} = {g ◦ g0x : g ∈ G} = {g′x : g′ ∈ G} = Ox Orbit is selective If Ox Ox , there must be g g G such that gx g x . Therefore, x g

1

gx, and g

1

g G.

29

slide-32
SLIDE 32

group orbit as invariant representation

Definition (Group Orbit) The orbit Ox of x under the group of transformations G is Ox = {gx : g ∈ G} Orbit is invariant If x′ = g0x for some (unknown) g0 ∈ G, then Ox′ = {g(g0x) : g ∈ G} = {g ◦ g0x : g ∈ G} = {g′x : g′ ∈ G} = Ox Orbit is selective If Ox = Ox′, there must be g, g′ ∈ G such that gx = g′x′. Therefore, x′ = (g′)−1 ◦ gx, and (g′)−1 ◦ g ∈ G.

29

slide-33
SLIDE 33

computing the orbit map

Questions ∙ How to embed the the orbit into Euclidean space Rr?

∙ Most machine learning algorithms only work in a real-valued vector space.

∙ How to represent the orbit?

∙ As a set, the ordering of the elements in the orbit should be ignored. Some statistics should be used, e.g. MAX, MEAN, Histogram, etc.

∙ Is the feature map biologically plausible?

∙ What kind of computations are needed to implement it? Are neurons capable of performing them? ∙ Are there parameters in the feature extraction system? How could those parameters be initialized and learned? Are the learning procedures consistent with existing computational neuroscience literature?

30

slide-34
SLIDE 34

computing the orbit map

Questions ∙ How to embed the the orbit into Euclidean space Rr?

∙ Most machine learning algorithms only work in a real-valued vector space.

∙ How to represent the orbit?

∙ As a set, the ordering of the elements in the orbit should be ignored. Some statistics should be used, e.g. MAX, MEAN, Histogram, etc.

∙ Is the feature map biologically plausible?

∙ What kind of computations are needed to implement it? Are neurons capable of performing them? ∙ Are there parameters in the feature extraction system? How could those parameters be initialized and learned? Are the learning procedures consistent with existing computational neuroscience literature?

30

slide-35
SLIDE 35

computing the orbit map

Questions ∙ How to embed the the orbit into Euclidean space Rr?

∙ Most machine learning algorithms only work in a real-valued vector space.

∙ How to represent the orbit?

∙ As a set, the ordering of the elements in the orbit should be ignored. Some statistics should be used, e.g. MAX, MEAN, Histogram, etc.

∙ Is the feature map biologically plausible?

∙ What kind of computations are needed to implement it? Are neurons capable of performing them? ∙ Are there parameters in the feature extraction system? How could those parameters be initialized and learned? Are the learning procedures consistent with existing computational neuroscience literature?

30

slide-36
SLIDE 36

computing the orbit map

Orbit = ⇒ Probability Distribution When the group G is compact, the haar measure µG on G could be normalized as ˜ µG(·) = µG(·) µG(G), so that it induces a probability distribution νx on X for each orbit Ox: νx(A) = µG ({g ∈ G : gx ∈ A}) , ∀A ⊂ X νx is well defined (not depending on the choice of x) by the translation invariance property of Haar measure.

31

slide-37
SLIDE 37

computing the orbit map

High-dimensional Probability Distribution = ⇒ 1D Distribution For a unit vector t, projecting onto t induces a 1D probability distribution: ν⟨x,t⟩(A) = νx ({x′ ∈ X : ⟨x, t⟩ ∈ A}) , ∀A ⊂ R The original distribution could be completely characterized by all of its 1D projected distributions (Cramér-Wold Theorem).

32

slide-38
SLIDE 38

computing the orbit map

Approximating by concatenation of 1D histograms

  • 1. Approximate Cramér-Wold by finitely

many projections.

∙ We call the projecting directions t1, . . . , tK templates.

  • 2. Approximate 1D probability

distributions by discrete histograms. zk

m(x) = Eg

[ ηm(⟨gx, tk⟩) ] where k = 1, . . . , K; and ηm(·) is the m-th histogram bin indicator function for m = 1, . . . , M.

33

slide-39
SLIDE 39

computing the orbit map

Biologically Plausible Implementation ∙ Templates tk are stored as neuron input weights. ∙ Projections ⟨Φg(x), tk⟩ is input accumulation of neurons. ∙ Histogram binning are implemented by neuron nonlinearity ηm.

∙ Instead of histogram binning, step function could be used to implement cumulative histogram binning. ∙ Other nonlinearity: sigmoid, MAX (OR), MEAN, n-th moments, etc.

−4 −2 2 4 0.2 0.4 0.6 0.8 1 sigmoid (1/1 + e−x) step function

34

slide-40
SLIDE 40

computing the orbit map

Memory-based Learning ∙ When the group G is unitary, ⟨gx, tk⟩ = ⟨x, g−1tk⟩ ∙ Model parameter: gtk for k = 1, . . . , K and g ∈ G.

∙ Group G needs to be finitely sampled if not finite. ∙ Transformed templates are stored in neuron synapses.

∙ Learning by memorizing

∙ Each template is an instance of observed “object”. ∙ Transformed versions of each “object” are collected and stored during learning.

∙ Very natural for visual templates: consider a child playing with a toy, a continuous movie of 3D affine transformed toy (template) could be easily

  • btained.

35

slide-41
SLIDE 41

computing the orbit map

Memory-based Learning ∙ When the group G is unitary, ⟨gx, tk⟩ = ⟨x, g−1tk⟩ ∙ Model parameter: gtk for k = 1, . . . , K and g ∈ G.

∙ Group G needs to be finitely sampled if not finite. ∙ Transformed templates are stored in neuron synapses.

∙ Learning by memorizing

∙ Each template is an instance of observed “object”. ∙ Transformed versions of each “object” are collected and stored during learning.

∙ Very natural for visual templates: consider a child playing with a toy, a continuous movie of 3D affine transformed toy (template) could be easily

  • btained.

35

slide-42
SLIDE 42

learning the templates

slide-43
SLIDE 43

choosing the templates

What template to use? ∙ “Objects” as templates: natural, biologically plausible memory-based learning. ∙ Gabor templates: found in low-level visual receptive fields, optimal if simultaneous translation and scale invariance is wanted. ∙ Random templates: possible to get finite sample guarantees. ∙ Learned templates: connection with convolutional neural networks (ConvNets).

37

slide-44
SLIDE 44

convolutional neural networks

ConvNets in 1D ∙ Convolution: map input x to output h with a template t h[i] = ∑

j

x[j]t[i − j] = ∑

j

x[j]˜ t[j − i] = ⟨x, Φi(˜ t)⟩ where Φi(·) is a translation by i grids, and ˜ t is the mirrored reflection of t. ∙ Pooling: map h to pooled output z z[i] = max

j∈N(i) h[j]

This is a special case of the approximate orbit map when G is the 1D translation group, and MAX statistics are accumulated locally (local invariance).

38

slide-45
SLIDE 45

convolutional neural networks

Discriminatively learning the templates ∙ Let T = (t1, . . . , tK) and FT(·) be the invariant feature map parameterized by the templates T. ∙ Consider a multi-class (phoneme) classification problem:

∙ X: input space of (speech) signals ∙ Y: the space of (phonetic) labels ∙ fθ(y|z): a classifier (e.g. a multi-layer neural network) parameterized by θ, that predicts the probability of label y ∈ Y given the observation representation z = FT(x) for the input signal x.

39

slide-46
SLIDE 46

convolutional neural networks

Discriminatively learning the templates ∙ Given a training set of i.i.d. samples (x1, y1), . . . , (xN, yN). ∙ Solve the following optimization problem (maximum log-likelihood estimation) min

T,θ

{ L(T, θ) = − 1 N

N

i=1

log fθ(yi|FT(xi)) } ∙ The optimization is solved jointly with respect to θ and T so that the invariant feature extractor FT is adapted to the given classification problem.

40

slide-47
SLIDE 47

convolutional neural networks

Greedy algorithm for solving the optimization ∙ Iteratively update the parameters via stochastic gradient descent θnew ← θold − α∂L ∂θ Tnew ← Told − α∂L ∂T ∙ The gradients are computed with chain-rules (back-propagation) ∂L ∂tk = ∂h/∂t

convolution

·

pooling

∂z/∂h · ∂L/∂h

classifier 41

slide-48
SLIDE 48

Questions?

42