non parametric methods
play

Non parametric methods Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Non parametric methods Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Probability distribution estimates The statistical approach to classification


  1. Non parametric methods Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1

  2. Probability distribution estimates • The statistical approach to classification requires the (at least estimated. 2 approximate) knowledge of p ( C i | x ) : in fact, an item x shall be assigned to the class C i such that i = argmax p ( C k | x ) k • The same holds in the regression case, where p ( y | x ) has to be

  3. Probability distribution estimates: hypotheses 3 What do we assume to know of class distributions, given a training set X , t ? • Case 1. The probabilities p ( x |C i ) are known: an item is assigned x to the class C i such that i = argmax p ( C j | x ) j where p ( C j | x ) can be derived through Bayes’ rule and prior probabilities, since p ( C k ) | x ) ∝ p ( x |C k ) p ( C k )

  4. Probability distribution estimates: hypotheses class and, from them, 4 • Case 2. The type of probability distribution p ( x | θ ) is known: an estimate of parameter values θ i is performed for all classes, taking into account for each class C i the subset of X i , t i of items belonging to the class, that is such that t = i . Different approaches to parameter estimation: 1. Maximum likelihood: θ ML = argmax p ( X i , t i | θ ) is computed. Item x is i θ assigned to class C i if p ( x | θ ML i = argmax p ( C j | x ) = argmax ) p ( C j ) j j j 2. Maximum a posteriori: θ MAP = argmax p ( θ | X i , t i ) is computed. Item x is i θ assigned to class C i if p ( x | θ MAP i = argmax p ( C j | x ) = argmin ) p ( C j ) j j j 3. Bayesian estimate: the distributions p ( θ | X i , t i ) are estimated for each ∫ p ( x |C i ) = p ( x | θ ) p ( θ | X i , t i ) d θ θ Item x is assigned to class C i if i = argmax p ( C j | x ) = argmax p ( C j ) p ( x |C j ) j j ∫ = argmax p ( C j ) p ( x | θ ) p ( θ | X j , t j ) d θ j θ

  5. Probability distribution estimates: hypotheses • Case 3. No knowledge of the probabilities assumed. • In previous cases, use of (parametric) models for a synthetic • In this case, no models (and parameters): training set items explicitly appear in class distribution estimates. parameters is used 5 • The class distributions p ( x |C i ) are directly from data. description of data in X , t • Denoted as non parametric models: indeed, an unbounded number of

  6. Histograms • The probability density in the interval corresponding to the bin • Elementary type of non parametric estimate 6 • Domain partitioned into m d -dimensional intervals (bins) • The probability P x that an item belongs to the bin containing item x is estimated as n ( x ) , where n ( x ) is the number of element in that bin n containing x is then estimated as the ratio between the above probability and the interval width ∆( x ) (tipically, a constant ∆ ) n ( x ) n ( x ) N p H ( x ) = ∆( x ) = N ∆( x ) 5 ∆ = 0 . 04 0 0 0.5 1 5 ∆ = 0 . 08 0 0 0.5 1 5 ∆ = 0 . 25 0 0 0.5 1

  7. Histograms: problems • The density is a function of the position of the first bin. In the case of multivariate data, also from bin orientation. • The resulting estimates is not continuous. • Curse of dimensionality: the number of bins grows as a polynomial of unless a large number of items is available. • In practice, histograms can be applied only in low-dimensional datasets (1,2) 7 order d : in high-dimensional spaces many bins may result empty,

  8. Kernel density estimators Hence, in general, 8 • Probability that an item is in region R ( x ) , containing x ∫ P x = p ( z ) d z R ( x ) • Given n items x 1 , x 2 , . . . , x n , the probability that k among them are in R ( x ) is given by the binomial distribution ( ) n n ! x (1 − P x ) n − k = P K k !( n − k )! P K x (1 − P x ) n − k p ( k ) = k • Mean and variance of the ratio r = k n are var [ r ] = P x (1 − P x ) E [ r ] = P x n • P x is the expected fraction of items in R ( x ) , and the ratio r is an estimate. As n → ∞ variance decreases and r tends to E [ r ] = P x . r = k n ≃ P ( x )

  9. Nonparametric estimates almost constant in the region and 9 • Let the volume of R ( x ) be sufficiently small. Then, the density p ( x ) is ∫ p ( z ) d z ≃ p ( x ) V P x = R ( x ) where V is the volume of R ( x ) • since P x ≃ k k n , it then derives that p ( x ) ≃ nV

  10. It can be shown that in both cases, under suitable conditions, the estimator Approaches to nonparametric estimates 10 k Two alternative ways to exploit the estimate p ( x ) ≃ nV 1. Fix V and derive k from data (kernel density estimation) 2. Fix k and derive V from data (K-nearest neighbor). tends to the true density p ( x ) as n → ∞ .

  11. Kernel density estimation: Parzen windows otherwise • the number of items in the hypercube is then 11 • Region associated to a point x : hypercube with edge length h (and volume h d ) centered on x . • Kernel function k ( u ) (Parzen window) used to count the number of items in the unit hypercube centered on the origin 0  | u i | ≤ 1 / 2 1 i = 1 , . . . , d  k ( u ) = 0  ( x − x ′ ) = 1 iff x ′ is in the hypercube of edge • as a consequence, k h length h centered on x n ( x − x i ) ∑ K = k h i =1

  12. Kernel density estimation: Parzen windows • Since and it derives • The estimated density is and 12 n ( x − x i p ( x ) = 1 1 ) ∑ h d k n h i =1 ∫ k ( u ) ≥ 0 k ( u ) d u = 1 ( x − x i ∫ ( x − x i ) ) d x = h d k ≥ 0 k h h

  13. Kernel density estimation: Parzen windows and Clearly, the window size has a relevant effect on the estimate 13 As a consequence, it results that p n ( x ) is a probability density. In fact, n ( x − x i p ( x ) = 1 1 ) ∑ ≥ 0 h d k n h i =1 n ( x − x i ∫ ∫ 1 1 ) ∑ p ( x ) d x = h d k d x n h i =1 n ( x − x i 1 ∫ ) ∑ k d x nh d h i =1 n 1 ∫ ( x − x i 1 nh d nh d = 1 ) ∑ k d x = nh d h i =1

  14. Kernel density estimation: Parzen windows 14 h = 2 h = 1 h = ε

  15. Kernels and smoothing Drawbacks 1. discontinuity of the estimates points nearer to the origin. 15 2. items in a region centered on x have uniform weights: their distance from x is not taken into account Solution. Use of smooth kernel functions κ h ( u ) to assign larger weights to Assumed characteristics of κ h ( u ) : ∫ κ h ( x ) d x = 1 ∫ x κ h ( x ) d x = 0 ∫ x 2 κ h ( x ) d x > 0

  16. Kernels and smoothing Usually kernels are based on smooth radial functions (functions of the resulting estimate: 16 distance from the origin) 1 u 2 e − 1 1. gaussian κ ( u ) = √ 2 σ 2 , unlimited support 2 πσ ( 1 ) 2 − u 2 , | u | ≤ 1 2. Epanechnikov κ ( u ) = 3 2 , limited support 3. · · · k ( u ) κ ( u ) κ ( u ) u u u − 1 1 − 1 1 − 1 1 2 2 2 2 2 2 n n p ( x ) = 1 ( x − x i = 1 ) ∑ ∑ κ κ h ( x − x i ) nh h n i =1 i =1

  17. Kernels and smoothing 17 h = 1 8 6 p ( x ) 4 2 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  18. Kernels and smoothing 18 h = 2 7 6 5 4 p ( x ) 3 2 1 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  19. Kernels and smoothing 19 h = . 5 14 12 10 8 p ( x ) 6 4 2 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  20. Kernels and smoothing 20 h = . 25 25 20 15 p ( x ) 10 5 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  21. Kernels regression Kernel smoothers methods can be applied also to regression: in this case, Applying kernels, we have should be returned. 21 In this case, the conditional expectation the value corresponding to any item x is predicted by referring to items in the training set (and in particular to the items which are closer to x ). ∫ yp ( x , y ) dy ∫ yp ( x , y ) dy ∫ ∫ y p ( x , y ) f ( x ) = E [ y | x ] = yp ( y | x ) dy = p ( x ) dy = = ∫ p ( x ) p ( x , y ) dy n p ( x , y ) ≈ 1 ∑ κ h ( x − x i ) κ h ( y − t i ) n i =1

  22. Kernels regression This results into and, since 22 y 1 ∑ n ∑ n ∫ i =1 κ h ( x − x i ) κ h ( y − t i ) dy ∫ i =1 κ h ( x − x i ) yκ h ( y − t i ) dy n f ( x ) = i =1 κ h ( x − x i ) κ h ( y − t i ) dy = 1 ∑ n ∑ n ∫ i =1 κ h ( x − x i ) ∫ κ h ( y − t i ) dy n ∫ ∫ κ h ( y − t i ) dy = 1 and yκ h ( y − t i ) dy = y i , we get ∑ n i =1 κ h ( x − x i ) t i f ( x ) = ∑ n i =1 κ h ( x − x i )

  23. Kernels regression By setting we can write that is, the predicted value is computed as a linear combination of all target values, weighted by kernels (Nadaraya-Watson) 23 κ h ( x − x i ) w i ( x ) = ∑ n j =1 κ h ( x − x j ) n ∑ f ( x ) = w i ( x ) t i i =1

  24. Locally weighted regression In Nadaraya-Watson model, the prediction is performed by means of a weighted combination of constant values (target values in the training set). Locally weighted regression improves that approach by referring to a weighted version of the sum of squared differences loss function used in regression. 24 If a value y has to be predicted for a provided item x , a “local” version of the loss function is considered, with weight w i dependent from the “distance” between x and x i . n κ h ( x − x i )( w T x i − t i ) 2 ∑ L ( x ) = i =1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend