feedforward neural nets
play

Feedforward neural nets CSE 250B Outline 1 Architecture 2 - PowerPoint PPT Presentation

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The architecture y h ( ` ) . . . h (2) h (1) x The value at a hidden unit h z 1 z 2 z m How is h computed from z 1 , . . . , z m ? The value at a


  1. Feedforward neural nets CSE 250B

  2. Outline 1 Architecture 2 Expressivity 3 Learning

  3. The architecture y h ( ` ) . . . h (2) h (1) x

  4. The value at a hidden unit h z 1 z 2 z m · · · How is h computed from z 1 , . . . , z m ?

  5. The value at a hidden unit h z 1 z 2 z m · · · How is h computed from z 1 , . . . , z m ? • h = σ ( w 1 z 1 + w 2 z 2 + · · · + w m z m + b ) • σ ( · ) is a nonlinear activation function , e.g. “rectified linear” � u if u ≥ 0 σ ( u ) = 0 otherwise

  6. Common activation functions • Threshold function or Heaviside step function � 1 if z ≥ 0 σ ( z ) = 0 otherwise • Sigmoid 1 σ ( z ) = 1 + e − z • Hyperbolic tangent σ ( z ) = tanh( z ) • ReLU (rectified linear unit) σ ( z ) = max(0 , z )

  7. Why do we need nonlinear activation functions? y h ( ` ) . . . h (2) h (1) x

  8. The output layer Classification with k labels: want k probabilities summing to 1. y 1 y 2 y k · · · z 3 z 1 z 2 z m · · ·

  9. The output layer Classification with k labels: want k probabilities summing to 1. y 1 y 2 y k · · · z 3 z 1 z 2 z m · · · • y 1 , . . . , y k are linear functions of the parent nodes z i . • Get probabilities using softmax : e y j Pr (label j ) = e y 1 + · · · + e y k .

  10. The complexity y h ( ` ) . . . h (2) h (1) x

  11. Outline 1 Architecture 2 Expressivity 3 Learning

  12. Approximation capability Let f : R d → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well.

  13. Approximation capability Let f : R d → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well. • The hidden layer may need a lot of nodes. • For certain classes of functions: • Either: one hidden layer of enormous size • Or: multiple hidden layers of moderate size

  14. Stone-Weierstrass theorem I If f : [ a , b ] → R is continuous then there is a sequence of polynomials P n such that P n has degree n and sup | P n ( x ) − f ( x ) | → 0 as n → ∞ . x ∈ [ a , b ]

  15. Stone-Weierstrass theorem II Let K ⊂ R d be some bounded set. Suppose there is a collection of functions A such that: • A is an algebra : closed under addition, scalar multiplication, and multiplication. • A does not vanish on K : for any x ∈ K , there is some h ∈ A with h ( x ) � = 0. • A separates points in K : for any x � = y ∈ K , there is some h ∈ A with h ( x ) � = h ( y ). Then for any continuous function f : K → R and any ǫ > 0, there is some h ∈ A with sup | f ( x ) − h ( x ) | ≤ ǫ. x ∈ K

  16. Example: exponentiated linear functions For domain K = R d , let A be all linear combinations of { e w · x + b : w ∈ R d , b ∈ R } . 1 Is an algebra. 2 Does not vanish. 3 Separates points.

  17. Variation: RBF kernels For domain K = R d , and any σ > 0, let A be all linear combinations of { e −� x − u � 2 /σ 2 : u ∈ R d } . Any continuous function is approximated arbitrarily well by A .

  18. A class of activation functions For domain K = R d , let A be all linear combinations of { σ ( w · x + b ) : w ∈ R d , b ∈ R } where σ : R → R is continuous and non-decreasing with � 1 if z → ∞ σ ( z ) → 0 if z → −∞ This also satisfies the conditions of the approximation result.

  19. Outline 1 Architecture 2 Expressivity 3 Learning

  20. Learning a net: the loss function Classification problem with k labels. • Parameters of entire net: W • For any input x , net computes probabilities of labels: Pr W (label = j | x )

  21. Learning a net: the loss function Classification problem with k labels. • Parameters of entire net: W • For any input x , net computes probabilities of labels: Pr W (label = j | x ) • Given data set ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ), loss function: n � ln Pr W ( y ( i ) | x ( i ) ) L ( W ) = − i =1 (also called cross-entropy ).

  22. Nature of the loss function L ( w ) L ( w ) w w

  23. Variants of gradient descent Initialize W and then repeatedly update. 1 Gradient descent Each update involves the entire training set. 2 Stochastic gradient descent Each update involves a single data point. 3 Mini-batch stochastic gradient descent Each update involves a modest, fixed number of data points.

  24. Derivative of the loss function Update for a specific parameter: derivative of loss function wrt that parameter. y h ( ` ) . . . h (2) h (1) x

  25. Chain rule 1 Suppose h ( x ) = g ( f ( x )), where x ∈ R and f , g : R → R . Then: h ′ ( x ) = g ′ ( f ( x )) f ′ ( x )

  26. Chain rule 1 Suppose h ( x ) = g ( f ( x )), where x ∈ R and f , g : R → R . Then: h ′ ( x ) = g ′ ( f ( x )) f ′ ( x ) 2 Suppose z is a function of y , which is a function of x . x y z Then: dz dx = dz dy dy dx

  27. A single chain of nodes A neural net with one node per hidden layer: · · · x = h 0 h 1 h 2 h 3 h ` For a specific input x , • h i = σ ( w i h i − 1 + b i ) • The loss L can be gleaned from h ℓ

  28. A single chain of nodes A neural net with one node per hidden layer: · · · x = h 0 h 1 h 2 h 3 h ` For a specific input x , • h i = σ ( w i h i − 1 + b i ) • The loss L can be gleaned from h ℓ To compute dL / dw i we just need dL / dh i : dL = dL dh i = dL σ ′ ( w i h i − 1 + b i ) h i − 1 dw i dh i dw i dh i

  29. Backpropagation • On a single forward pass, compute all the h i . • On a single backward pass, compute dL / dh ℓ , . . . , dL / dh 1 · · · x = h 0 h 1 h 2 h 3 h `

  30. Backpropagation • On a single forward pass, compute all the h i . • On a single backward pass, compute dL / dh ℓ , . . . , dL / dh 1 · · · x = h 0 h 1 h 2 h 3 h ` From h i +1 = σ ( w i +1 h i + b i +1 ), we have dL dL dh i +1 dL σ ′ ( w i +1 h i + b i +1 ) w i +1 = = dh i dh i +1 dh i dh i +1

  31. Two-dimensional examples What kind of net to use for this data?

  32. Two-dimensional examples What kind of net to use for this data? • Input layer: 2 nodes • One hidden layer: H nodes • Output layer: 1 node • Input → hidden: linear functions, ReLU activation • Hidden → output: linear function, sigmoid activation

  33. Example 1 How many hidden units should we use?

  34. Example 1 H = 2

  35. Example 1 H = 2

  36. Example 2 How many hidden units should we use?

  37. Example 2 H = 4

  38. Example 2 H = 4

  39. Example 2 H = 4

  40. Example 2 H = 4

  41. Example 2 H = 8: overparametrized

  42. Example 3 How many hidden units should we use?

  43. Example 3 H = 4

  44. Example 3 H = 8

  45. Example 3 H = 16

  46. Example 3 H = 16

  47. Example 3 H = 16

  48. Example 3 H = 32

  49. Example 3 H = 32

  50. Example 3 H = 32

  51. Example 3 H = 64

  52. Example 3 H = 64

  53. Example 3 H = 64

  54. PyTorch snippet Declaring and initializing the network: d, H = 2, 8 model = torch.nn.Sequential( torch.nn.Linear(d, H), torch.nn.ReLU(), torch.nn.Linear(H, 1), torch.nn.Sigmoid()) lossfn = torch.nn.BCELoss() A gradient step: ypred = model(x) loss = lossfn(ypred, y) model.zero grad() loss.backward() with torch.no grad(): for param in model.parameters(): param -= eta * param.grad

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend