review
play

Review Linear separability (and use of features) Class - PowerPoint PPT Presentation

Review Linear separability (and use of features) Class probabilities for linear discriminants sigmoid (logistic) function Applications: USPS, fMRI ' figure from book 1 #$& 2 #$% ) 0.5 #$! #$" 0 # ! ! ! "


  1. Review • Linear separability (and use of features) • Class probabilities for linear discriminants � sigmoid (logistic) function • Applications: USPS, fMRI ' figure from book 1 #$& φ 2 #$% ) 0.5 #$! #$" 0 # ! ! ! " # " ! 0 0.5 1 ( φ 1 1

  2. Review • Generative vs. discriminative � maximum conditional likelihood • Logistic regression • Weight space ! � each example adds a penalty to ' " all weight vectors that ! ,-./0)1/2/*+ & misclassify it # % � penalty is approximately $ piecewise linear ! ! ! " ! # $ # " ! ()*+ 2

  3. Example " #%( #%' #%& #%! # ! ! ! " # " ! $ 3

  4. –log(P(Y 1..3 | X 1..3 , W)) % " ! # ! % ! $ ! & ! $ ! % " % $ & ' ! " 4

  5. Generalization: multiple classes • One weight vector per class: Y � {1,2,…,C} � P(Y=k) = � Z k = • In 2-class case: 5

  6. Multiclass example 6 4 2 0 � 2 figure from book � 4 � 6 � 6 � 4 � 2 0 2 4 6 6

  7. Priors and conditional MAP • P(Y | X, W) = � Z = • As in linear regression, can put prior on W � common priors: L 2 (ridge), L 1 (sparsity) • max w P(W=w | X, Y) 7

  8. Software • Logistic regression software is easily available: most stats packages provide it � e.g., glm function in R � or, http://www.cs.cmu.edu/~ggordon/IRLS-example/ • Most common algorithm: Newton’s method on log-likelihood (or L 2 -penalized version) � called “iteratively reweighted least squares” � for L 1 , slightly harder (less software available) 8

  9. Historical application: Fisher iris data P(I. virginica) petal length 9

  10. 10

  11. Bayesian regression • In linear and logistic regression, we’ve looked at � conditional MLE: max w P(Y | X, w) � conditional MAP: max w P(W=w | X, Y) • But of course, a true Bayesian would turn up nose at both � why? 11

  12. Sample from posterior & ' " ! # ! ' ! & ! % ! $ " $ #" ! " 12

  13. Predictive distribution # "$' "$& "$% "$! " ! !" ! #" " #" !" 13

  14. Overfitting • Overfit: training likelihood � test likelihood � often a result of overconfidence • Overfitting is an indicator that the MLE or MAP approximation is a bad one • Bayesian inference rarely overfits � may still lead to bad results for other reasons! � e.g., not enough data, bad model class, … 14

  15. So, we want the predictive distribution • Most of the time… � Graphical model is big and highly connected � Variables are high-arity or continuous • Can’t afford exact inference � Inference reduces to numerical integration (and/ or summation) � We’ll look at randomized algorithms 15

  16. Numerical integration $ ' ) ,-+.*/ ( ! " ! ! ! "#$ " ! "#$ "#% "#& "#' "#( " ! "#( ! "#' ! "#& ! ! "#% ! ! + * 16

  17. 2D is 2 easy! • We care about high-D problems • Often, much of the mass is hidden in a tiny fraction of the volume � simultaneously try to discover it and estimate amount 17

  18. Application: SLAM 18

  19. Integrals in multi-million-D Eliazar and Parr, IJCAI-03 19

  20. Simple 1D problem )" (" $" '" &" %" !" " ! ! ! "#$ " "#$ ! 20

  21. Uniform sampling )" (" $" '" &" %" !" " ! ! ! "#$ " "#$ ! 21

  22. Uniform sampling E(f(X)) = • So, is desired integral • But standard deviation can be big • Can reduce it by averaging many samples • But only at rate 1/ � N 22

  23. Importance sampling • Instead of X ~ uniform, use X ~ Q(x) • Q = • Should have Q(x) large where f(x) is large • Problem: 23

  24. Importance sampling • Instead of X ~ uniform, use X ~ Q(x) • Q = • Should have Q(x) large where f(x) is large • Problem: � E Q ( f ( X )) = Q ( x ) f ( x ) dx 23

  25. Importance sampling h ( x ) f ( x ) /Q ( x ) ≡ � E Q ( h ( X )) = Q ( x ) h ( x ) dx � = Q ( x ) f ( x ) /Q ( x ) dx � = f ( x ) dx 24

  26. Importance sampling • So, take samples of h(X) instead of f(X) • W i = 1/Q(X i ) is importance weight • Q = 1/V yields uniform sampling 25

  27. Importance sampling )" (" $" '" &" %" !" " ! ! ! "#$ " "#$ ! 26

  28. Variance • How does this help us control variance? • Suppose: � f big � Q small • Then h = f/Q: • Variance of each weighted sample is • Optimal Q? 27

  29. Importance sampling, part II • Suppose we want � � f ( x ) dx = P ( x ) g ( x ) dx = E P ( g ( X )) • Pick N samples X i from proposal Q(X) • Average W i g(X i ), where importance weight is � W i = 28

  30. Importance sampling, part II • Suppose we want � � f ( x ) dx = P ( x ) g ( x ) dx = E P ( g ( X )) • Pick N samples X i from proposal Q(X) • Average W i g(X i ), where importance weight is � W i = � � E Q ( Wg ( X )) = Q ( x )[ P ( x ) /Q ( x )] g ( x ) dx = P ( x ) g ( x ) dx 28

  31. Two variants of IS • Same algorithm, different terminology � want � f(x) dx vs. E P (f(X)) � W = 1/Q vs. W = P/Q 29

  32. Parallel importance sampling • Suppose we want � � f ( x ) dx = P ( x ) g ( x ) dx = E P ( g ( X )) • But P(x) is unnormalized (e.g., represented by a factor graph)—know only Z P(x) 30

  33. Parallel IS • Pick N samples X i from proposal Q(X) • If we knew W i = P(X i )/Q(X i ), could do IS • Instead, set � and, • Then: 31

  34. Parallel IS • Final estimate: 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend