lecture 19 anatomy of nn
play

Lecture 19: Anatomy of NN CS109A Introduction to Data Science - PowerPoint PPT Presentation

Lecture 19: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS


  1. Lecture 19: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner

  2. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  3. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  4. Anatomy of artificial neural network (ANN) neuron input output node W X Y CS109A, P ROTOPAPAS , R ADER , T ANNER

  5. Anatomy of artificial neural network (ANN) neuron input output node Affine transformation 𝑍 = 𝑔(ℎ) Activation Y X We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. CS109A, P ROTOPAPAS , R ADER , T ANNER

  6. Anatomy of artificial neural network (ANN) Input layer hidden layer output layer / 𝑌 = 𝑋 𝑨 * = 𝑋 ** 𝑌 * + 𝑋 *+ 𝑌 + + 𝑋 * *1 ℎ * = 𝑔(𝑨 * ) 𝑌 * 𝑋 * ,, 𝑍) , = 𝑕(ℎ * , ℎ + ) , 𝐾 = ℒ(𝑍 𝑍 𝑍 Output function Loss function 𝑋 𝑌 + + / 𝑌 = 𝑋 𝑨 + = 𝑋 +* 𝑌 * + 𝑋 ++ 𝑌 + + 𝑋 +1 + h + = 𝑔(𝑨 + ) We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli. CS109A, P ROTOPAPAS , R ADER , T ANNER

  7. Anatomy of artificial neural network (ANN) Input layer hidden layer 1 hidden layer 2 output layer 𝑌 * 𝑋 𝑋 ** +* 𝑍 𝑋 𝑋 𝑌 + *+ ++ CS109A, P ROTOPAPAS , R ADER , T ANNER

  8. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1 output layer … 𝑌 * 𝑋 𝑋 ** 8* 𝑍 … 𝑋 𝑋 𝑌 + *+ 8+ We will talk later about the choice of the number of layers. CS109A, P ROTOPAPAS , R ADER , T ANNER

  9. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer 3 nodes 3 nodes 𝑋 8* 𝑋 ** 𝑌 * … 𝑍 𝑋 𝑋 8+ *+ 𝑌 + 𝑋 𝑋 8: *: CS109A, P ROTOPAPAS , R ADER , T ANNER

  10. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes 𝑋 8* 𝑋 ** 𝑌 * m nodes … 𝑍 … … 𝑌 + 𝑋 𝑋 8; *; We will talk later about the choice of the number of nodes. CS109A, P ROTOPAPAS , R ADER , T ANNER

  11. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes Number of inputs d 𝑋 8* 𝑋 ** 𝑌 * m nodes … 𝑍 … … 𝑌 + 𝑋 𝑋 8; *; Number of inputs is specified by the data CS109A, P ROTOPAPAS , R ADER , T ANNER

  12. Anatomy of artificial neural network (ANN) output layer input layer hidden layer 1 hidden layer 2 CS109A, P ROTOPAPAS , R ADER , T ANNER

  13. Anatomy of artificial neural network (ANN) output layer input layer hidden layer 1 hidden layer 2 CS109A, P ROTOPAPAS , R ADER , T ANNER

  14. Why layers? Representation Representation matters! CS109A, P ROTOPAPAS , R ADER , T ANNER

  15. Learning Multiple Components CS109A, P ROTOPAPAS , R ADER , T ANNER

  16. Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER , T ANNER

  17. Neural Networks Hand-written digit recognition: MNIST data CS109A, P ROTOPAPAS , R ADER , T ANNER

  18. Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER , T ANNER

  19. Beyond Linear Models Linear models: • Can be fit efficiently (via convex optimization) • Limited model capacity Alternative: f ( x ) = w T φ ( x ) Where 𝜚 is a non-linear transform CS109A, P ROTOPAPAS , R ADER , T ANNER

  20. Traditional ML Manually engineer 𝜚 • Domain specific, enormous human effort Generic transform • Maps to a higher-dimensional space • Kernel methods: e.g. RBF kernels • Over fitting: does not generalize well to test set • Cannot encode enough prior information CS109A, P ROTOPAPAS , R ADER , T ANNER

  21. Deep Learning Directly learn 𝜚 • 𝑔 𝑦; 𝜄 = 𝑋 / 𝜚(𝑦; 𝜄) 𝜚 𝑦; 𝜄 is an automatically-learned re resentation of x • repre For deep networks , 𝜚 is the function learned by the hidden layers of the • network 𝜄 are the learned weights • Non-convex optimization • • Can encode prior beliefs, generalizes well CS109A, P ROTOPAPAS , R ADER , T ANNER

  22. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  23. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  24. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Provide non-linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER , T ANNER

  25. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Provide non-linearity • • Ensure gradients remain large through hidden unit Common choices are • sigmoid tanh • • ReLU, leaky ReLU, Generalized ReLU, MaxOut softplus • • swish • CS109A, P ROTOPAPAS , R ADER , T ANNER

  26. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Provide non-linearity • • Ensure gradients remain large through hidden unit Common choices are • sigmoid tanh • • ReLU, leaky ReLU, Generalized ReLU, MaxOut softplus • • swish CS109A, P ROTOPAPAS , R ADER , T ANNER

  27. Sigmoid (aka Logistic) 1 𝑧 = 1 + 𝑓 UV Derivative is zero for much of the domain. This leads to “ vanishing gradients ” in backpropagation. CS109A, P ROTOPAPAS , R ADER , T ANNER

  28. Hyperbolic Tangent (Tanh) 𝑧 = 𝑓 V − 𝑓 UV 𝑓 V + 𝑓 UV Same problem of “ vanishing gradients ” as sigmoid. CS109A, P ROTOPAPAS , R ADER , T ANNER

  29. Rectified Linear Unit (ReLU) 𝑧 = max (0, 𝑦) Two major advantages: 1. No vanishing gradient when x > 0 2. Provides sparsity (regularization) since y = 0 when x < 0 CS109A, P ROTOPAPAS , R ADER , T ANNER

  30. Leaky ReLU 𝑧 = max 0, 𝑦 + 𝛽 min(0,1) where 𝛽 takes a small value • Tries to fix “ dying ReLU ” problem: derivative is non-zero everywhere. • Some people report success with this form of activation function, but the results are not always consistent CS109A, P ROTOPAPAS , R ADER , T ANNER

  31. Generalized ReLU Generalization: For 𝛽 Z > 0 𝑕 𝑦 Z , 𝛽 = max 𝑏, 𝑦 Z + 𝛽 min{0, 𝑦 Z } CS109A, P ROTOPAPAS , R ADER , T ANNER

  32. softplus 𝑧 = log(1 + 𝑓 V ) The logistic sigmoid function is a smooth approximation of the derivative of the rectifier CS109A, P ROTOPAPAS , R ADER , T ANNER

  33. Maxout Max of k linear functions. Directly learn the activation function. 𝑕(𝑦) = max Z∈{*,…,b} 𝛽 Z 𝑦 Z + 𝛾 CS109A, P ROTOPAPAS , R ADER , T ANNER

  34. Swish: A Self-Gated Activation Function 𝑕 𝑦 = 𝑦 𝜏(𝑦) Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets. CS109A, P ROTOPAPAS , R ADER , T ANNER

  35. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  36. � � Loss Function Likelihood for a given point: 𝑞 𝑧 Z 𝑋; 𝑦 Z Assume independency, likelihood for all measurements: 𝑀 𝑋; 𝑌, 𝑍 = 𝑞 𝑍 𝑋; 𝑌 = g 𝑞 𝑧 Z 𝑋; 𝑦 Z Z Maximize the likelihood, or equivalently maximize the log-likelihood: log 𝑀(𝑋; 𝑌, 𝑍) = i log 𝑞 𝑧 Z 𝑋; 𝑦 Z Z Turn this into a loss function: ℒ 𝑋; 𝑌, 𝑍 = − log 𝑀(𝑋; 𝑌, 𝑍) CS109A, P ROTOPAPAS , R ADER , T ANNER

  37. � � Loss Function Do not need to design separate loss functions if we follow this simple procedure Examples: • Distribution is Normal then likelihood is: q p r 1 𝐍𝐓𝐅 √ 2𝜌 + 𝜏 𝑓 U o p Uo 𝑞 𝑧 Z 𝑋; 𝑦 Z = +s^+ q Z + ℒ 𝑋; 𝑌, 𝑍 = ∑ 𝑧 Z − 𝑧 Z • Distribution is Bernouli then likelihood is: o p 1 − 𝑞 Z *Uo p Cross-Entropy 𝑞 𝑧 Z 𝑋; 𝑦 Z = 𝑞 Z ℒ 𝑋; 𝑌, 𝑍 = − ∑ 𝑧 Z log 𝑞 Z + (1 − 𝑧 Z ) log(1 − 𝑞 Z ) Z CS109A, P ROTOPAPAS , R ADER , T ANNER

  38. Design Choices Activation function Loss function Output units Architecture Optimizer CS109A, P ROTOPAPAS , R ADER , T ANNER

  39. Output Units Output Type Output Distribution Output layer Loss Function Binary CS109A, P ROTOPAPAS , R ADER , T ANNER

  40. Output Units Output Type Output Distribution Output layer Loss Function Binary Bernoulli CS109A, P ROTOPAPAS , R ADER , T ANNER

  41. Output Units Output Type Output Distribution Output layer Loss Function Binary Bernoulli Binary Cross Entropy CS109A, P ROTOPAPAS , R ADER , T ANNER

  42. Output Units Output Type Output Distribution Output layer Loss Function Binary Bernoulli ? Binary Cross Entropy CS109A, P ROTOPAPAS , R ADER , T ANNER

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend