Feedforward neural nets CSE 250B Outline 1 Architecture 2 - - PowerPoint PPT Presentation

feedforward neural nets
SMART_READER_LITE
LIVE PREVIEW

Feedforward neural nets CSE 250B Outline 1 Architecture 2 - - PowerPoint PPT Presentation

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The architecture y h ( ` ) . . . h (2) h (1) x The value at a hidden unit h z 1 z 2 z m How is h computed from z 1 , . . . , z m ? The value at a


slide-1
SLIDE 1

Feedforward neural nets

CSE 250B

slide-2
SLIDE 2

Outline

1 Architecture 2 Expressivity 3 Learning

slide-3
SLIDE 3

The architecture

x h(1) h(2) h(`) y . . .

slide-4
SLIDE 4

The value at a hidden unit

z1 z2 · · · zm h

How is h computed from z1, . . . , zm?

slide-5
SLIDE 5

The value at a hidden unit

z1 z2 · · · zm h

How is h computed from z1, . . . , zm?

  • h = σ(w1z1 + w2z2 + · · · + wmzm + b)
  • σ(·) is a nonlinear activation function, e.g. “rectified linear”

σ(u) = u if u ≥ 0

  • therwise
slide-6
SLIDE 6

Common activation functions

  • Threshold function or Heaviside step function

σ(z) = 1 if z ≥ 0

  • therwise
  • Sigmoid

σ(z) = 1 1 + e−z

  • Hyperbolic tangent

σ(z) = tanh(z)

  • ReLU (rectified linear unit)

σ(z) = max(0, z)

slide-7
SLIDE 7

Why do we need nonlinear activation functions? x h(1) h(2) h(`) y . . .

slide-8
SLIDE 8

The output layer

Classification with k labels: want k probabilities summing to 1.

z1 z2 · · · zm z3 y1 y2 yk · · ·

slide-9
SLIDE 9

The output layer

Classification with k labels: want k probabilities summing to 1.

z1 z2 · · · zm z3 y1 y2 yk · · ·

  • y1, . . . , yk are linear functions of the parent nodes zi.
  • Get probabilities using softmax:

Pr(label j) = eyj ey1 + · · · + eyk .

slide-10
SLIDE 10

The complexity

x h(1) h(2) h(`) y . . .

slide-11
SLIDE 11

Outline

1 Architecture 2 Expressivity 3 Learning

slide-12
SLIDE 12

Approximation capability

Let f : Rd → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well.

slide-13
SLIDE 13

Approximation capability

Let f : Rd → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well.

  • The hidden layer may need a lot of nodes.
  • For certain classes of functions:
  • Either: one hidden layer of enormous size
  • Or: multiple hidden layers of moderate size
slide-14
SLIDE 14

Stone-Weierstrass theorem I

If f : [a, b] → R is continuous then there is a sequence of polynomials Pn such that Pn has degree n and sup

x∈[a,b]

|Pn(x) − f (x)| → 0 as n → ∞.

slide-15
SLIDE 15

Stone-Weierstrass theorem II

Let K ⊂ Rd be some bounded set. Suppose there is a collection of functions A such that:

  • A is an algebra: closed under addition, scalar multiplication,

and multiplication.

  • A does not vanish on K: for any x ∈ K, there is some h ∈ A

with h(x) = 0.

  • A separates points in K: for any x = y ∈ K, there is some

h ∈ A with h(x) = h(y). Then for any continuous function f : K → R and any ǫ > 0, there is some h ∈ A with sup

x∈K

|f (x) − h(x)| ≤ ǫ.

slide-16
SLIDE 16

Example: exponentiated linear functions

For domain K = Rd, let A be all linear combinations of {ew·x+b : w ∈ Rd, b ∈ R}.

1 Is an algebra. 2 Does not vanish. 3 Separates points.

slide-17
SLIDE 17

Variation: RBF kernels

For domain K = Rd, and any σ > 0, let A be all linear combinations of {e−x−u2/σ2 : u ∈ Rd}. Any continuous function is approximated arbitrarily well by A.

slide-18
SLIDE 18

A class of activation functions

For domain K = Rd, let A be all linear combinations of {σ(w · x + b) : w ∈ Rd, b ∈ R} where σ : R → R is continuous and non-decreasing with σ(z) → 1 if z → ∞ if z → −∞ This also satisfies the conditions of the approximation result.

slide-19
SLIDE 19

Outline

1 Architecture 2 Expressivity 3 Learning

slide-20
SLIDE 20

Learning a net: the loss function

Classification problem with k labels.

  • Parameters of entire net: W
  • For any input x, net computes probabilities of labels:

PrW (label = j|x)

slide-21
SLIDE 21

Learning a net: the loss function

Classification problem with k labels.

  • Parameters of entire net: W
  • For any input x, net computes probabilities of labels:

PrW (label = j|x)

  • Given data set (x(1), y(1)), . . . , (x(n), y(n)), loss function:

L(W ) = −

n

  • i=1

ln PrW (y(i)|x(i)) (also called cross-entropy).

slide-22
SLIDE 22

Nature of the loss function

w L(w) w L(w)

slide-23
SLIDE 23

Variants of gradient descent

Initialize W and then repeatedly update.

1 Gradient descent

Each update involves the entire training set.

2 Stochastic gradient descent

Each update involves a single data point.

3 Mini-batch stochastic gradient descent

Each update involves a modest, fixed number of data points.

slide-24
SLIDE 24

Derivative of the loss function

Update for a specific parameter: derivative of loss function wrt that parameter.

x h(1) h(2) h(`) y . . .

slide-25
SLIDE 25

Chain rule

1 Suppose h(x) = g(f (x)), where x ∈ R and f , g : R → R.

Then: h′(x) = g′(f (x)) f ′(x)

slide-26
SLIDE 26

Chain rule

1 Suppose h(x) = g(f (x)), where x ∈ R and f , g : R → R.

Then: h′(x) = g′(f (x)) f ′(x)

2 Suppose z is a function of y, which is a function of x.

x y z

Then: dz dx = dz dy dy dx

slide-27
SLIDE 27

A single chain of nodes

A neural net with one node per hidden layer:

x = h0 h1 h2 h3 h`

· · ·

For a specific input x,

  • hi = σ(wihi−1 + bi)
  • The loss L can be gleaned from hℓ
slide-28
SLIDE 28

A single chain of nodes

A neural net with one node per hidden layer:

x = h0 h1 h2 h3 h`

· · ·

For a specific input x,

  • hi = σ(wihi−1 + bi)
  • The loss L can be gleaned from hℓ

To compute dL/dwi we just need dL/dhi: dL dwi = dL dhi dhi dwi = dL dhi σ′(wihi−1 + bi) hi−1

slide-29
SLIDE 29

Backpropagation

  • On a single forward pass, compute all the hi.
  • On a single backward pass, compute dL/dhℓ, . . . , dL/dh1

x = h0 h1 h2 h3 h`

· · ·

slide-30
SLIDE 30

Backpropagation

  • On a single forward pass, compute all the hi.
  • On a single backward pass, compute dL/dhℓ, . . . , dL/dh1

x = h0 h1 h2 h3 h`

· · ·

From hi+1 = σ(wi+1hi + bi+1), we have dL dhi = dL dhi+1 dhi+1 dhi = dL dhi+1 σ′(wi+1hi +bi+1) wi+1

slide-31
SLIDE 31

Two-dimensional examples

What kind of net to use for this data?

slide-32
SLIDE 32

Two-dimensional examples

What kind of net to use for this data?

  • Input layer: 2 nodes
  • One hidden layer: H nodes
  • Output layer: 1 node
  • Input → hidden: linear functions, ReLU activation
  • Hidden → output: linear function, sigmoid activation
slide-33
SLIDE 33

Example 1

How many hidden units should we use?

slide-34
SLIDE 34

Example 1

H = 2

slide-35
SLIDE 35

Example 1

H = 2

slide-36
SLIDE 36

Example 2

How many hidden units should we use?

slide-37
SLIDE 37

Example 2

H = 4

slide-38
SLIDE 38

Example 2

H = 4

slide-39
SLIDE 39

Example 2

H = 4

slide-40
SLIDE 40

Example 2

H = 4

slide-41
SLIDE 41

Example 2

H = 8: overparametrized

slide-42
SLIDE 42

Example 3

How many hidden units should we use?

slide-43
SLIDE 43

Example 3

H = 4

slide-44
SLIDE 44

Example 3

H = 8

slide-45
SLIDE 45

Example 3

H = 16

slide-46
SLIDE 46

Example 3

H = 16

slide-47
SLIDE 47

Example 3

H = 16

slide-48
SLIDE 48

Example 3

H = 32

slide-49
SLIDE 49

Example 3

H = 32

slide-50
SLIDE 50

Example 3

H = 32

slide-51
SLIDE 51

Example 3

H = 64

slide-52
SLIDE 52

Example 3

H = 64

slide-53
SLIDE 53

Example 3

H = 64

slide-54
SLIDE 54

PyTorch snippet

Declaring and initializing the network: d, H = 2, 8 model = torch.nn.Sequential( torch.nn.Linear(d, H), torch.nn.ReLU(), torch.nn.Linear(H, 1), torch.nn.Sigmoid()) lossfn = torch.nn.BCELoss() A gradient step: ypred = model(x) loss = lossfn(ypred, y) model.zero grad() loss.backward() with torch.no grad(): for param in model.parameters(): param -= eta * param.grad