lecture 18 anatomy of nn
play

Lecture 18: Anatomy of NN CS109A Introduction to Data Science - PowerPoint PPT Presentation

Lecture 18: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS , R ADER


  1. Lecture 18: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

  2. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER

  3. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER

  4. Anatomy of artificial neural network (ANN) neuron input output node X Y CS109A, P ROTOPAPAS , R ADER

  5. Anatomy of artificial neural network (ANN) neuron input output node Affine transformation 𝑍 = 𝑔(ℎ) Activation Y X We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. CS109A, P ROTOPAPAS , R ADER

  6. Anatomy of artificial neural network (ANN) Input layer hidden layer output layer / 𝑌 = 𝑋 𝑨 * = 𝑋 ** 𝑌 * + 𝑋 *+ 𝑌 + + 𝑋 * *1 ℎ * = 𝑔(𝑨 * ) 𝑌 * 𝑋 * ,, 𝑍) , = 𝑕(ℎ * , ℎ + ) , 𝐾 = ℒ(𝑍 𝑍 𝑍 Output function Loss function 𝑋 𝑌 + + / 𝑌 = 𝑋 𝑨 + = 𝑋 +* 𝑌 * + 𝑋 ++ 𝑌 + + 𝑋 +1 + h + = 𝑔(𝑨 + ) We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli. CS109A, P ROTOPAPAS , R ADER

  7. Anatomy of artificial neural network (ANN) Input layer hidden layer 1 hidden layer 2 output layer 𝑌 * 𝑋 𝑋 ** +* 𝑍 𝑋 𝑋 𝑌 + *+ ++ CS109A, P ROTOPAPAS , R ADER

  8. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1 output layer … 𝑌 * 𝑋 𝑋 ** 8* 𝑍 … 𝑋 𝑋 𝑌 + *+ 8+ We will talk later about the choice of the number of layers. CS109A, P ROTOPAPAS , R ADER

  9. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer 3 nodes 3 nodes 𝑋 8* 𝑋 ** 𝑌 * 𝑍 … 𝑋 𝑋 8+ *+ 𝑌 + 𝑋 𝑋 8: *: CS109A, P ROTOPAPAS , R ADER

  10. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes 𝑋 8* 𝑋 ** 𝑌 * m nodes 𝑍 … … … 𝑌 + 𝑋 𝑋 8; *; We will talk later about the choice of the number of nodes. CS109A, P ROTOPAPAS , R ADER

  11. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes Number of inputs d 𝑋 8* 𝑋 ** 𝑌 * m nodes 𝑍 … … … 𝑌 + 𝑋 𝑋 8; *; Number of inputs is specified by the data CS109A, P ROTOPAPAS , R ADER

  12. Why layers? Representation Representation matters! CS109A, P ROTOPAPAS , R ADER

  13. Learning Multiple Components CS109A, P ROTOPAPAS , R ADER

  14. Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER

  15. Neural Networks Hand-written digit recognition: MNIST data CS109A, P ROTOPAPAS , R ADER

  16. Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER

  17. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER

  18. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER

  19. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Ensures not linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER

  20. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Ensures not linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER

  21. Beyond Linear Models Linear models: • Can be fit efficiently (via convex optimization) • Limited model capacity Alternative: f ( x ) = w T φ ( x ) Where 𝜚 is a non-linear transform CS109A, P ROTOPAPAS , R ADER

  22. Traditional ML Manually engineer 𝜚 • Domain specific, enormous human effort Generic transform • Maps to a higher-dimensional space • Kernel methods: e.g. RBF kernels • Over fitting: does not generalize well to test set • Cannot encode enough prior information CS109A, P ROTOPAPAS , R ADER

  23. Deep Learning Directly learn 𝜚 • 𝑔 𝑦; 𝜄 = 𝑋 / 𝜚(𝑦; 𝜄) where 𝜄 are parameters of the transform • 𝜚 defines hidden layers • Non-convex optimization • • Can encode prior beliefs, generalizes well CS109A, P ROTOPAPAS , R ADER

  24. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Ensures not linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER

  25. CS109A, P ROTOPAPAS , R ADER

  26. ReLU and Softplus -b/W CS109A, P ROTOPAPAS , R ADER

  27. Generalized ReLU Generalization: For 𝛽 H > 0 𝑕 𝑦 H , 𝛽 = max 𝑏, 𝑦 H + 𝛽 min{0, 𝑦 H } CS109A, P ROTOPAPAS , R ADER

  28. Maxout Max of k linear functions. Directly learn the activation function. 𝑕(𝑦) = max H∈{*,…,S} 𝛽 H 𝑦 H + 𝛾 CS109A, P ROTOPAPAS , R ADER

  29. Swish: A Self-Gated Activation Function Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets. 𝑕 𝑦 = 𝑦 𝜏(𝑦) CS109A, P ROTOPAPAS , R ADER

  30. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER

  31. Loss Function Cross-entropy between training data and model distribution (i.e. negative log-likelihood) 𝐾 𝑋 = −𝔽 X,Y~[ \ ]^_^ log 𝑞 defgh (y|x) Do not need to design separate loss functions. Gradient of cost function must be large enough CS109A, P ROTOPAPAS , R ADER

  32. Loss Function Example: sigmoid output + squared loss + \ + = y − 𝜏 𝑦 𝑀 jk = 𝑧 − 𝑧 Flat surfaces CS109A, P ROTOPAPAS , R ADER

  33. Cost Function Example: sigmoid output + cross-entropy loss 𝑀 no 𝑧, 𝑧 \ = −{ 𝑧 log 𝑧 \ + 1 − 𝑧 log(1 − 𝑧 \)} CS109A, P ROTOPAPAS , R ADER

  34. Design Choices Activation function Loss function Output units Architecture Optimizer CS109A, P ROTOPAPAS , R ADER

  35. Output Units Output Type Output Distribution Output layer Cost Function Binary CS109A, P ROTOPAPAS , R ADER

  36. Link function 1 𝑌 ⟹ 𝜚 𝑌 = 𝑋 / 𝑌 ⟹ 𝑄 𝑧 = 0 = 1 + 𝑓 u(v) , = P(y = 0) X 𝑍 OUTPUT UNIT , = P(y = 0) X 𝜏(𝜚) 𝑍 𝜚 𝑌 CS109A, P ROTOPAPAS , R ADER

  37. Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy CS109A, P ROTOPAPAS , R ADER

  38. Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete CS109A, P ROTOPAPAS , R ADER

  39. Link function multi-class problem , X 𝑍 OUTPUT UNIT 𝑓 u { v , = 𝑍 X SoftMax 𝜚(𝑌) } ∑ 𝑓 u { v S~* CS109A, P ROTOPAPAS , R ADER

  40. Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy CS109A, P ROTOPAPAS , R ADER

  41. Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE CS109A, P ROTOPAPAS , R ADER

  42. Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary - GANS CS109A, P ROTOPAPAS , R ADER

  43. Design Choices Activation function Loss function Output units Architecture Optimizer CS109A, P ROTOPAPAS , R ADER

  44. NN in action CS109A, P ROTOPAPAS , R ADER

  45. NN in action CS109A, P ROTOPAPAS , R ADER

  46. NN in action CS109A, P ROTOPAPAS , R ADER

  47. NN in action … CS109A, P ROTOPAPAS , R ADER

  48. NN in action CS109A, P ROTOPAPAS , R ADER

  49. NN in action CS109A, P ROTOPAPAS , R ADER

  50. NN in action CS109A, P ROTOPAPAS , R ADER

  51. Universal Approximation Theorem Think of a Neural Network as function approximation. 𝑍 = 𝑔 𝑦 + 𝜗 €(𝑦) + 𝜗 𝑍 = 𝑔 €(𝑦) NN: ⟹ 𝑔 𝑋 𝑋 : • One hidden layer is enough to represent an depth approximation of any function to an arbitrary degree of accuracy 𝑋 𝑋 So why deeper? * + • Shallow net may need (exponentially) more width • Shallow net may overfit more width CS109A, P ROTOPAPAS , R ADER

  52. Better Generalization with Depth (Goodfellow 2017) CS109A, P ROTOPAPAS , R ADER

  53. Large, Shallow Nets Overfit More (Goodfellow 2017) CS109A, P ROTOPAPAS , R ADER

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend