neural networks
play

Neural Networks Representations Fall 2017 Learning in the net - PowerPoint PPT Presentation

Neural Networks Representations Fall 2017 Learning in the net Problem: Given a collection of input-output pairs, learn the function Learning for classification x 2 x 1 When the net must learn to classify.. Learning for classification x


  1. Neural Networks Representations Fall 2017

  2. Learning in the net • Problem: Given a collection of input-output pairs, learn the function

  3. Learning for classification x 2 x 1 • When the net must learn to classify..

  4. Learning for classification x 2 • In reality – In general not really cleanly separated • So what is the function we learn?

  5. In reality: Trivial linear example x 2 x 1 5 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 5

  6. Non-linearly separable data: 1-D example y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 6

  7. Undesired Function y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 7

  8. What if? y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 8

  9. What if? y 90 instances 10 instances x • What must the value of the function be at this X? – 1 because red dominates? – 0.9 : The average? 9

  10. What if? y 90 instances 10 instances x • What must the value of the function be at this X? Estimate: ≈ 𝑄(1|𝑌) – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 10

  11. What if? y 90 instances Should an infinitesimal nudge of the red dot change the function estimate entirely? 10 instances If not, how do we estimate 𝑄(1|𝑌) ? (since the positions of the red and blue X Values are different) x • What must the value of the function be at this X? Estimate: ≈ 𝑄(1|𝑌) – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 11

  12. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of Y=1 at that point 12

  13. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 13

  14. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 14

  15. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 15

  16. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 16

  17. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 17

  18. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 18

  19. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 19

  20. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 20

  21. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 21

  22. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 22

  23. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 23

  24. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 24

  25. The logistic regression model 1   P ( y 1 x )     ( w w x )  1 e y=1 y=0 x • Class 1 becomes increasingly probable going left to right – Very typical in many problems 25

  26. The logistic perceptron 𝑧 1  P ( y x )     ( w w x )  e 1 𝑥 0 𝑥 1 𝑦 • A sigmoid perceptron with a single input models the a posteriori probability of the class given the input

  27. Non-linearly separable data x 2 x 1 27 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 27

  28. Logistic regression Decision: y > 0.5? 𝑧 x 2 𝑥 0 𝑥 2 𝑥 1 𝑦 1 𝑦 2 x 1 1 When X is a 2-D variable 𝑄 𝑍 = 1 𝑌 = 1 + exp −(σ 𝑗 𝑥 𝑗 𝑦 𝑗 + 𝑥 0 ) • This the perceptron with a sigmoid activation – It actually computes the probability that the input belongs to class 1 28

  29. Estimating the model y x 1   P ( y x ) f ( x )     ( w w x )  1 e • Given the training data (many (𝑦, 𝑧) pairs represented by the dots), estimate 𝑥 0 and 𝑥 1 for the curve 29

  30. Estimating the model y x • Easier to represent using a y = +1/-1 notation 1 1      P y x P y x ( 1 ) ( 1 )        ( w w x ) ( w w x )   1 e 1 e 1  P ( y x )     y ( w w x )  1 e 30

  31. Estimating the model • Given: Training data 𝑌 1 , 𝑧 1 , 𝑌 2 , 𝑧 2 , … , 𝑌 𝑂 , 𝑧 𝑂 • 𝑌 s are vectors, 𝑧 s are binary (0/1) class values • Total probability of data 𝑄 𝑌 1 , 𝑧 1 , 𝑌 2 , 𝑧 2 , … , 𝑌 𝑂 , 𝑧 𝑂 = ෑ 𝑄 𝑌 𝑗 , 𝑧 𝑗 𝑗 1 = ෑ 𝑄 𝑧 𝑗 |𝑌 𝑗 𝑄 𝑌 𝑗 = ෑ 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑄 𝑌 𝑗 𝑗 𝑗 31

  32. Estimating the model • Likelihood 1 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑒𝑏𝑢𝑏 = ෑ 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑄 𝑌 𝑗 𝑗 • Log likelihood log 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑒𝑏𝑢𝑏 = log 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) ෍ log 𝑄 𝑌 𝑗 − ෍ 𝑗 𝑗 32

  33. Maximum Likelihood Estimate 𝑥 0 , ෝ ෝ 𝑥 1 = argmax log 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑒𝑏𝑢𝑏 𝑥 0 ,𝑥 1 • Equals (note argmin rather than argmax) log 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑥 0 , ෝ ෝ 𝑥 1 = argmin ෍ 𝑥 0 ,𝑥 𝑗 • Identical to minimizing the KL divergence between the desired output 𝑧 and actual output 1 1+𝑓 − (𝑥0+𝑥𝑈𝑌𝑗) • Cannot be solved directly, needs gradient descent 33

  34. So what about this one? x 2 • Non-linear classifiers..

  35. First consider the separable case.. x 2 x 1 • When the net must learn to classify..

  36. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net

  37. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier

  38. First consider the separable case.. ??? x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier over the output of the penultimate layer

  39. First consider the separable case.. 𝑧 1 𝑧 2 y 2 x 1 x 2 y 1 • For perfect classification the output of the penultimate layer must be linearly separable

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend