Neural networks Slides adapted from Stuart Russell Slides adapted - - PowerPoint PPT Presentation

neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural networks Slides adapted from Stuart Russell Slides adapted - - PowerPoint PPT Presentation

Neural networks Slides adapted from Stuart Russell Slides adapted from Stuart Russell 1 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms10ms cycle time Signals are noisy spike trains of electrical potential Axonal


slide-1
SLIDE 1

Neural networks

Slides adapted from Stuart Russell

Slides adapted from Stuart Russell 1

slide-2
SLIDE 2

Brains

1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential

Axon Cell body or Soma Nucleus Dendrite Synapses Axonal arborization Axon from another cell Synapse

Slides adapted from Stuart Russell 2

slide-3
SLIDE 3

McCulloch–Pitts “unit”

Output is a “squashed” linear function of the inputs: ai ← g(ini) = g

ΣjWj,iaj

  • Output
  • Input

Links Activation Function Input Function Output Links

a0 = 1 ai = g(ini) ai g ini Wj,i W0,i

Bias Weight

aj

A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do

Slides adapted from Stuart Russell 3

slide-4
SLIDE 4

Activation functions

(a) (b) +1 +1 ini ini g(ini) g(ini) (a) is a step function or threshold function (b) is a sigmoid function 1/(1 + e−x) Changing the bias weight W0,i moves the threshold location

Slides adapted from Stuart Russell 4

slide-5
SLIDE 5

Network structures

Feed-forward networks: – single-layer perceptrons – multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: – recurrent neural nets have directed cycles with delays ⇒ have internal state (like flip-flops), can oscillate etc.

Slides adapted from Stuart Russell 5

slide-6
SLIDE 6

Feed-forward example

W

1,3 1,4

W

2,3

W

2,4

W W

3,5 4,5

W 1 2 3 4 5

Feed-forward network = a parameterized family of nonlinear functions: a5 = g(W3,5 · a3 + W4,5 · a4) = g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2)) Adjusting weights changes the function: do learning this way!

Slides adapted from Stuart Russell 6

slide-7
SLIDE 7

Single-layer perceptrons

Input Units Units Output

Wj,i

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 Perceptron output

Adjusting weights moves the location, orientation, and steepness of cliff

Slides adapted from Stuart Russell 7

slide-8
SLIDE 8

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960). Represents a linear separator in input space:

ΣjWjxj > 0

  • r

W · x > 0 Can represent AND, OR, NOT, majority, etc.:

AND

W0 = 1.5 W1 = 1 W2 = 1

OR

W2 = 1 W1 = 1 W0 = 0.5

NOT

W1 = –1 W0 = – 0.5

But not XOR:

(a) x1 and x2 1 1 x1 x2 (b) x1 or x2 1 1 x1 x2 (c) x1 xor x2 ? 1 1 x1 x2

Slides adapted from Stuart Russell 8

slide-9
SLIDE 9

Multilayer perceptrons

Layers are usually fully connected; numbers of hidden units typically chosen by hand

Input units Hidden units Output units ai Wj,i aj W

k,j

ak

Slides adapted from Stuart Russell 9

slide-10
SLIDE 10

Expressiveness of MLPs

All continuous functions w/ 2 layers, all functions w/ 3 layers

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 hW(x1, x2)

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 hW(x1, x2)

Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units

Slides adapted from Stuart Russell 10

slide-11
SLIDE 11

Back-propagation learning

At each epoch, sum gradient updates for all examples and apply Training curve for 100 restaurant examples: finds exact fit

2 4 6 8 10 12 14 50 100 150 200 250 300 350 400 Total error on training set Number of epochs

Typical problems: slow convergence, local minima

Slides adapted from Stuart Russell 11

slide-12
SLIDE 12

Handwritten digit recognition

3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error LeNet (1998): 768–192–30–10 unit MLP = 0.9% error SVMs: ≈ 0.6% error Current best: 0.24% error (committee of convolutional nets)

Slides adapted from Stuart Russell 12

slide-13
SLIDE 13

slide 5

Example: ALVINN

[Pomerleau, 1995]

steering direction

slide-14
SLIDE 14

Backpropagation

Slides adapted from Kyunghyun Cho

slide-15
SLIDE 15

Learning as an Optimization

Ultimately, learning is (mostly) θ = arg min

θ

1 N

N

X

n=1

c ((xn, yn) | θ) + λΩ (θ, D) , where c ((x, y) | θ) is a per-sample cost function.

slide-16
SLIDE 16

Gradient Descent

Gradient-descent Algorithm: θt = θt−1 ηrL(θt−1) where, in our case, L(θ) = 1 N

N

X

n=1

l ((xn, yn) | θ) .

Let us assume that Ω (θ, D) = 0.

slide-17
SLIDE 17

Stochastic Gradient Descent

Often, it is too costly to compute C(θ) due to a large training set. Stochastic gradient descent algorithm: θt = θt1 ηtrl

  • (x0, y 0) | θt1

, where (x0, y 0) is a randomly chosen sample from D, and

1

X

t=1

ηt ! 1 and

1

X

t=1

  • ηt2 < 1.

Let us assume that Ω (θ, D) = 0.

slide-18
SLIDE 18

Almost there. . .

How do we compute the gradient efficiently for neural networks?

slide-19
SLIDE 19

Backpropagation Algorithm – (1) Forward Pass

  • Forward Computation:

L(f (h1(x1, x2, θh1), h2(x1, x2, θh2), θf ), y) Multilayer Perceptron with a single hidden layer: L(x, y, θ) = 1 2

  • y − U>φ
  • W>x

2

slide-20
SLIDE 20

Backpropagation Algorithm – (2) Chain Rule

  • Chain rule of derivatives:

∂L ∂x1 = ∂L ∂f ∂f ∂x1 = ∂L ∂f ✓ ∂f ∂h1 ∂h1 ∂x1 + ∂f ∂h2 ∂h2 ∂x1 ◆

slide-21
SLIDE 21

Backpropagation Algorithm – (3) Shared Derivatives

  • Local derivatives are shared:

∂L ∂x1 = ∂L ∂f ✓ ∂f ∂h1 ∂h1 ∂x1 + ∂f ∂h2 ∂h2 ∂x1 ◆ ∂L ∂x2 = ∂L ∂f ✓ ∂f ∂h1 ∂h1 ∂x2 + ∂f ∂h2 ∂h2 ∂x2 ◆

slide-22
SLIDE 22

Backpropagation Algorithm – (4) Local Computation

  • Each node computes

I Forward: h(a1, a2, . . . , aq) I Backward: ∂h ∂a1 , ∂h ∂a2 , . . . , ∂h ∂aq

slide-23
SLIDE 23

Backpropagation Algorithm – Requirements

  • I Each node computes a

differentiable function1

I Directed Acyclic Graph2

  • 1Well. . . ?
  • 2Well. . . ?
slide-24
SLIDE 24

Backpropagation Algorithm – Automatic Differentiation

  • I Generalized approach to computing partial derivatives

I As long as your neural network fits the requirements, you do not

need to derive the derivatives yourself!

I Theano, Torch, . . .

slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50

What is a word embedding?

Suppose you have a dictionary of words. The ith word in the dictionary is represented by an embedding: wi ∈ Rd i.e. a d-dimensional vector, which is learnt! d typically in the range 50 to 1000. Similar words should have similar embeddings (share latent features). Embeddings can also be applied to symbols as well as words (e.g. Freebase nodes and edges). Discuss later: can also have embeddings of phrases, sentences, documents, or even other modalities such as images.

2 / 69

slide-51
SLIDE 51

Learning an Embedding Space

Example of Embedding of 115 Countries (Bordes et al., ’11)

3 / 69

slide-52
SLIDE 52

Convolutional Neural Networks for Sentence Classification Classification Convolutional Neural Networks

How well can we do with a simple CNN?

Collobert-Weston style CNN with pre-trained embeddings from word2vec

19 / 34