neural networks
play

Neural Networks Representations Learning in the net Problem: Given - PowerPoint PPT Presentation

Neural Networks Representations Learning in the net Problem: Given a collection of input-output pairs, learn the function Learning for classification x 2 x 1 When the net must learn to classify.. Learn the classification boundaries


  1. Neural Networks Representations

  2. Learning in the net • Problem: Given a collection of input-output pairs, learn the function

  3. Learning for classification x 2 x 1 • When the net must learn to classify.. – Learn the classification boundaries that separate the training instances

  4. Learning for classification x 2 • In reality – In general not really cleanly separated • So what is the function we learn?

  5. In reality: Trivial linear example x 2 x 1 5 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 5

  6. Non-linearly separable data: 1-D example y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 6

  7. Undesired Function y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 7

  8. What if? y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 8

  9. What if? y 90 instances 10 instances x • What must the value of the function be at this X? – 1 because red dominates? – 0.9 : The average? 9

  10. What if? y 90 instances 10 instances x • What must the value of the function be at this X? Estimate: – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 10

  11. What if? y 90 instances Should an infinitesimal nudge of the red dot change the function estimate entirely? 10 instances If not, how do we estimate 𝑄(1|𝑌) ? (since the positions of the red and blue X Values are different) x • What must the value of the function be at this X? Estimate: – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 11

  12. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of Y=1 at that point 12

  13. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 13

  14. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 14

  15. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 15

  16. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 16

  17. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 17

  18. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 18

  19. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 19

  20. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 20

  21. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 21

  22. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 22

  23. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 23

  24. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 24

  25. The logistic regression model 1   P ( y 1 x )     ( w w x ) 1 e  y=1 y=0 x • Class 1 becomes increasingly probable going left to right – Very typical in many problems 25

  26. The logistic perceptron 1  P ( y x )     ( w w x ) 1 e  � � • A sigmoid perceptron with a single input models the a posteriori probability of the class given the input

  27. Non-linearly separable data x 2 x 1 27 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 27

  28. Logistic regression Decision: y > 0.5? � � � � x 2 � � � � � When X is a 2-D variable x 1 • This the perceptron with a sigmoid activation – It actually computes the probability that the input belongs to class 1 – Decision boundaries may be obtained by comparing the probability to a threshold • These boundaries will be lines (hyperplanes in higher dimensions) • The sigmoid perceptron is a linear classifier 28

  29. Estimating the model y x 1   P ( y x ) f ( x )     ( w w x ) 1 e  • Given the training data (many pairs represented by the dots), estimate and for the curve 29

  30. Estimating the model y x • Easier to represent using a y = +1/-1 notation 1 1      P ( y 1 x ) P ( y 1 x )       ( w w x )  ( w w x ) 1 e 1 e   1  P ( y x )     y ( w w x ) 1 e  30

  31. Estimating the model • Given: Training data • s are vectors, s are binary (0/1) class values • Total probability of data � � � � 31

  32. Estimating the model • Likelihood � � � � • Log likelihood 32

  33. Maximum Likelihood Estimate � � • Equals (note argmin rather than argmax) • Identical to minimizing the KL divergence between the desired output and actual output • Cannot be solved directly, needs gradient descent 33

  34. So what about this one? x 2 • Non-linear classifiers..

  35. First consider the separable case.. x 2 x 1 • When the net must learn to classify..

  36. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net

  37. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier

  38. First consider the separable case.. ??? x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier over the output of the penultimate layer

  39. First consider the separable case.. � � y 2 x 1 x 2 y 1 • For perfect classification the output of the penultimate layer must be linearly separable

  40. First consider the separable case.. � � y 2 x 1 x 2 y 1 • The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features – We can now attach any linear classifier above it for perfect classification – Need not be a perceptron – In fact, slapping on an SVM on top of the features may be more generalizable!

  41. First consider the separable case.. � � y 2 x 1 x 2 y 1 • The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features – We can now attach any linear classifier above it for perfect classification – Need not be a perceptron – In fact, for binary classifiers an SVM on top of the features may be more generalizable!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend