Multilayer Networks L eon Bottou COS 424 3/11/2010 Agenda - - PowerPoint PPT Presentation

multilayer networks
SMART_READER_LITE
LIVE PREVIEW

Multilayer Networks L eon Bottou COS 424 3/11/2010 Agenda - - PowerPoint PPT Presentation

Multilayer Networks L eon Bottou COS 424 3/11/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs.


slide-1
SLIDE 1

Multilayer Networks

L´ eon Bottou COS 424 – 3/11/2010

slide-2
SLIDE 2

Agenda

Goals Classification, clustering, regression, other. Representation Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Capacity Control Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Operational Considerations Loss functions Budget constraints Online vs. offline Computational Considerations Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms.

L´ eon Bottou 2/26 COS 424 – 3/11/2010

slide-3
SLIDE 3

Summary

  • 1. Brains and machines.
  • 2. Multilayer networks.
  • 3. Modular back-propagation.
  • 4. Examples
  • 5. Tricks

L´ eon Bottou 3/26 COS 424 – 3/11/2010

slide-4
SLIDE 4

Cybernetics

Mature communication technologies: telegraph, telephone, radio, . . . Nascent computing technologies: Eniac (1946) Norber Wiener (1948) Cybernetics or Control and Communication in the Animal and the Machine. Redefining of the man–machine boundary.

L´ eon Bottou 4/26 COS 424 – 3/11/2010

slide-5
SLIDE 5

What should a computer be?

A universal machine to process information. – which structure? what building blocks? – which model to emulate? Biological computer Mathematical computer Mathematical logic offers a lot more guidance.

→ Turing machines. → Von Neumann architecture. → Software and hardware. → Today’s computer science.

L´ eon Bottou 5/26 COS 424 – 3/11/2010

slide-6
SLIDE 6

An engineering perspective on the brain

The brain as a computer – Compact – Energy efficient (20 Watts) – Amazingly good for perception and informal reasoning. Bill of materials

≈ 90%: support, energy, cooling. ≈ 10%: signalling wires.

A lot of wires in a small box – Severe wiring constraints force a very specific architecture. – Local connections (98%) vs. long distance connections (2%). – Layered structure (at least in the visual system.) – This is not a universal machine! – But this machine defines what we belive is interesting!

L´ eon Bottou 6/26 COS 424 – 3/11/2010

slide-7
SLIDE 7

Computing with artificial neurons?

Retina Associative area x w’ x (w’ x) sign Treshold element

McCulloch and Pitts (1943) – Neurons as linear threshold units. Perceptron (1957) Adaline (1961) – Training linear threshold units. – A viable computing primitive?

⇐ People really tried things!

– Madaline, NeoCognitron. – But how to train them?

L´ eon Bottou 7/26 COS 424 – 3/11/2010

slide-8
SLIDE 8

Computing with artificial neurons?

Circuits of linear threshold units? – You can do complicated things that actually work. . . – But how to train them? Fukushima’s NeoCognitron (1980) – Leveraging symmetries and invariances.

L´ eon Bottou 8/26 COS 424 – 3/11/2010

slide-9
SLIDE 9

Minsky and Papert “Perceptrons” (1969)

Cicuits of logic gates – Linear threshold unit ≈ logic gate. – Computers ≈ lots of logic gates. – Which functions require what kind of circuit? Counter-examples – Easily solvable on a general purpose computer. – Demand deep circuits to solve effectively. – Perceptron can train a single logic gate! – Training deep circuits seem hopeless. In the background – Universal computers need a universal representation of knowledge. – Mathematical logic is offering first order logic. – First order logic can represent a lot more than perceptrons. – This is absolutely correct.

L´ eon Bottou 9/26 COS 424 – 3/11/2010

slide-10
SLIDE 10

Choose your Evil

Training first order logic Training deep circuits of logic gates – Symbolic domains, discrete space, – Combinatorial explosion, – Non Polynomial Continuous approximations – Replace the threshold by a sigmoid function. – Continuous and differentiable. – Usually nonconvex. Circuits of linear units −

→ Multilayer networks (1985)

First order logic −

→ Markov Logic networks (2010)

Human logic −

→ ?

L´ eon Bottou 10/26 COS 424 – 3/11/2010

slide-11
SLIDE 11

Multilayer networks, 1980s style

“ANN accurately predicts the effectiveness of the Micro-Compact Heat Exchanger and compares well with those obtained from the finite element simulation. [. . . ] computational effort has been minimized and simulation time has been drastically reduced.”

L´ eon Bottou 11/26 COS 424 – 3/11/2010

slide-12
SLIDE 12

Multilayer networks, modularized

The generic brick

  • ∂L

∂w = ∂L ∂y × ∂y ∂w ∂L ∂x = ∂L ∂y × ∂y ∂x

Forward pass in a two layer network – Present example x, compute output f(x), compute loss L(x, y, w).

eon Bottou 12/26 COS 424 – 3/11/2010

slide-13
SLIDE 13

Back-propagation algorithm

Backward pass in the two layer network – Set dL/dL = 1, compute gradients dL/dy and dL/dw for all boxes.

  • Update weights

– For instance with a stochastic gradient update.

w ← w − γt ∂L ∂w(x, y, w) .

L´ eon Bottou 13/26 COS 424 – 3/11/2010

slide-14
SLIDE 14

Modules

Build representations with any piece you need. Module Symbol Forward Backward Gradient

Linear

Wx

y = Wx ˇ x = W ⊤ˇ y ˇ w = ˇ y x⊤

Euclidian

(x-w)2

yk = (x − wk)2 ˇ x = 2(x − wk)ˇ yk ˇ wk = 2(wk − x)ˇ yk

Sigmoid

sigmoid

yi = σ(xi) ˇ xi = σ′(xi)ˇ yi

MSE loss

MSE

L = (x − y)2 ˇ x = 2(x − y)ˇ L

Perceptron loss Perceptron

L = max{0, −yx} ˇ x = −1 I(yx ≤ 0)ˇ L

Log loss

LogLoss

L = log(1 + e−yx) ˇ x = −(1 + eyx)−1 ˇ L

· · ·

L´ eon Bottou 14/26 COS 424 – 3/11/2010

slide-15
SLIDE 15

Combine modules

L´ eon Bottou 15/26 COS 424 – 3/11/2010

slide-16
SLIDE 16

Composite modules

Convolutional module – many linear modules with shared parameters. Remember the NeoCognitron?

L´ eon Bottou 16/26 COS 424 – 3/11/2010

slide-17
SLIDE 17

CNNs for signal processing

Time-Delay Neural Networks – 1990: speaker-independent phoneme recognition – 1991: speaker-independent word recognition – 1992: continuous speech recognition.

L´ eon Bottou 17/26 COS 424 – 3/11/2010

slide-18
SLIDE 18

CNNs for image analysis

2D Convolutional Neural Networks – 1989: isolated handwritten digit recognition – 1991: face recognition, sonar image analysis – 1993: vehicle recognition – 1994: zip code recognition – 1996: check reading

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

L´ eon Bottou 18/26 COS 424 – 3/11/2010

slide-19
SLIDE 19

CNNs for character recognition

3 4 4 4 4 3 4 8 3

C1 S2 C3 S4 C5 F6 Output

L´ eon Bottou 19/26 COS 424 – 3/11/2010

slide-20
SLIDE 20

CNNs for face recognition

Note: same code as the digit recognizer.

L´ eon Bottou 20/26 COS 424 – 3/11/2010

slide-21
SLIDE 21

Combining CNNs and HMM

Constrained Interpretation Graph Interpretation Graph Path Selector Forward Scorer Forward Scorer

Edforw Cforw Cdforw + − Gc Gint

Desired Sequence SDNN Transformer Compose Character Model Transducer

S....c.....r......i....p....t s....e.....n.....e.j...o.T 5......a...i...u......p.....f

SDNN Output

2 33 4 5 2345

C1 C3 C5 F6 Input SDNN Output Compose + Viterbi Answer

L´ eon Bottou 21/26 COS 424 – 3/11/2010

slide-22
SLIDE 22

Combining CNNs and HMM

6 777 88 678 3 55 114 3514 1 1 1 441 1114 55 4 540

Input F6

SDNN

  • utput

Answer

L´ eon Bottou 22/26 COS 424 – 3/11/2010

slide-23
SLIDE 23

Combining CNNs and FSTs

Segmentation Graph Interpretation Graph Grammar Recognition Graph Field Graph Check Graph Best Amount Graph Compose

2nd Nat. Bank $ *** 3.45 three dollars and 45/xx not to exceed $10,000.00

$ *** 3.45 $10,000.00 45/xx $ * 3 ** 45 "$" 0.2 "*" 0.4 "3" 0.1 "B" 23.6 ....... "$" 0.2 "*" 0.4 "3" 0.1 .......

Recognition Transformer Segmentation Transf. Field Location Transf. Viterbi Answer Viterbi Transformer

Check reading involves – locating the fields. – segmenting the characters. – recognizing the characters. – making sense of the string. Global training – integrate all these modules into a single trainable system. Deployment – deployed in 1996-1997 – was still in use in 2007. – processing ≈ 15% of the US checks.

L´ eon Bottou 23/26 COS 424 – 3/11/2010

slide-24
SLIDE 24

Optimisation for multilayer network

The simplest multilayer network – Two weights w1, w2 – One example {(1, 1)}

eon Bottou 24/26 COS 424 – 3/11/2010

slide-25
SLIDE 25

Optimisation for multilayer network

Landscape – Ravine along w1 w2 = 1. – Massive saddle point near the origin. – Mountains in the quadrants w1 w2 < 0. – Plateaux in the distance. Tricks of the trade – How to initialize the weights? – How to avoid the great saddle point? – etc.

L´ eon Bottou 25/26 COS 424 – 3/11/2010

slide-26
SLIDE 26

Capacity control through optimization

Idea – Initialize weights with quite small values (but not too small!) You are exercising the linear part of the sigmoid The whole network therefore implements a linear function. – When learning progresses, weights increase. The function slowly becomes more and more nonlinear. Early stopping – Monitor both the training and validation errors during training. – The training error illustrates the optimisation process. – Stop training when the validation error stops improving.

L´ eon Bottou 26/26 COS 424 – 3/11/2010