instance based learning
play

Instance-based Learning CE-717: Machine Learning Sharif University - PowerPoint PPT Presentation

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density


  1. Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

  2. Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density Estimation } Supervised: Instance-based learners } Classification ¨ kNN classification ¨ Weighted (or kernel) kNN } Regression ¨ kNN regression ¨ Locally linear weighted regression 2

  3. Introduction } Estimation of arbitrary density functions } Parametric density functions cannot usually fit the densities we encounter in practical problems. } e.g., parametric densities are unimodal. } Non-parametric methods don't assume that the model (from) of underlying densities is known in advance } Non-parametric methods (for classification) can be categorized into } Generative } Estimate 𝑞(𝒚|𝒟 & ) from 𝒠 & using non-parametric density estimation } Discriminative } Estimate 𝑞(𝒟 & |𝒚) from 𝒠 3

  4. Parametric vs. nonparametric methods } Parametric methods need to find parameters from data and then use the inferred parameters to decide on new data points } Learning: finding parameters from data } Nonparametric methods } Training examples are explicitly used } Training phase is not required } Both supervised and unsupervised learning methods can be categorized into parametric and non-parametric methods. 4

  5. Histogram approximation idea } Histogram approximation of an unknown pdf } 𝑄(𝑐 + ) ≈ 𝑙 . (𝑐 + )/𝑜 𝑚 = 1, … , 𝑀 } 𝑙 . (𝑐 + ) : number of samples (among n ones) lied in the bin 𝑐 + 𝑙 . } The corresponding estimated pdf: 9(: ; ) < } 𝑞 7 𝑦 = 𝑦 − 𝑦̅ : ; ≤ < @ ℎ Mid-point of the bin 𝑐 + 5

  6. Non-parametric density estimation } Probability of falling in a region ℛ : 𝑞 𝒚 D 𝑒𝒚′ } 𝑄 = ∫ (smoothed version of 𝑞 𝒚 ) ℛ . : a set of samples drawn i.i.d. according to 𝑞 𝒚 } 𝒠 = 𝒚 & &GH } The probability that 𝑙 of the 𝑜 samples fall in ℛ : } 𝑄 I = 𝑜 𝑙 𝑄 I 1 − 𝑄 .JI } 𝐹 𝑙 = 𝑜𝑄 } This binomial distribution peaks sharply about the mean: } 𝑙 ≈ 𝑜𝑄 ⇒ I . as an estimate for 𝑄 } More accurate for larger 𝑜 6

  7. Non-parametric density estimation } We can estimate smoothed 𝑞 𝒚 by estimating 𝑄 : } Assumptions: 𝑞 𝒚 is continuous and the region ℛ enclosing 𝒚 is so small that 𝑞 is near constant in it: 𝑄 = N 𝑞 𝒚 D 𝑒𝒚′ = 𝑞 𝒚 ×𝑊 ℛ 𝑊 = 𝑊𝑝𝑚 ℛ 𝒚 ∈ ℛ ⇒ 𝑞 𝒚 = 𝑄 𝑊 ≈ 𝑙/𝑜 𝑊 } Let 𝑊 approach zero if we want to find 𝑞 𝒚 instead of the averaged version. 7

  8. Necessary conditions for converge } 𝑞 . 𝒚 is the estimate of 𝑞 𝒚 using 𝑜 samples: } 𝑊 . : the volume of region around 𝒚 } 𝑙 . : the number of samples falling in the region 𝑞 . 𝒚 = 𝑙 . /𝑜 𝑊 . } Necessary conditions for converge of 𝑞 . 𝒚 to 𝑞(𝒚) : } lim .→W 𝑊 . = 0 } lim .→W 𝑙 . = ∞ } lim .→W 𝑙 . /𝑜 = 0 8

  9. � � Non-parametric density estimation: Main approaches } Two approaches of satisfying conditions: } k-nearest neighbor density estimator: fix k and determine the value ofV from the data } Volume grows until it contains k neighbors of 𝒚 } converges to the true probability density in the limit 𝑜 → ∞ when k grows with n (e.g., 𝑙 . = 𝑙 H 𝑜 ) } Kernel density estimator (Parzen window): fix V and determine K from the data } Number of points falling inside the volume can vary from point to point } converges to the true probability density in the limit 𝑜 → ∞ when V shrinks suitably with n (e.g., 𝑊 . = 𝑊 H / 𝑜 ) 9

  10. 1 Parzen window −1/2 1/2 } Extension of histogram idea: } Hyper-cubes with length of side ℎ (i.e., volume ℎ [ ) are located on the samples 1 } Hypercube as a simple window function: 𝜒 𝒗 = ^1 ( 𝑣 H ≤ 1 2 ∧ … ∧ 𝑣 [ ≤ 1 −1/2 1/2 2) 0 𝑝. 𝑥. . [ d 𝜒 𝒚 − 𝒚 (&) 𝑞 . 𝑦 = 1 𝑜 × 1 ℎ . ℎ . &GH I e } 𝑞 . 𝒚 = .f e 𝒚J𝒚 (h) . } 𝑙 . = ∑ 𝜒 number of samples in the hypercube around 𝒚 &GH < e . = ℎ . [ } 𝑊 10

  11. � � Window function } Necessary conditions for window function to find legitimate density function: } 𝜒(𝒚) ≥ 0 } ∫ 𝜒 𝒚 𝑒𝒚 = 1 } Windows are also called kernels or potential functions. 11

  12. � � Density estimation: non-parametric 𝑞̂ . 𝑦 = 1 . 𝑓 J nJn (h) o 𝑂(𝑦|𝑦 (&) , ℎ @ ) 1 𝑜 d @< o 2𝜌 ℎ &GH 1 1.2 1.4 1.5 1.6 2 2.1 2.15 4 4.3 4.7 4.75 5 𝜏 = ℎ 𝑞̂ 𝑦 = 1 . 𝑂(𝑦|𝑦 (&) , 𝜏 @ ) 𝑜 d &GH 𝑓 J nJn (h) o = 1 . 1 𝑜 d @q o 2𝜌 𝜏 &GH Choice of 𝜏 is crucial. 12

  13. Density estimation: non-parametric 𝜏 = 0.02 𝜏 = 0.1 𝜏 = 1.5 𝜏 = 0.5 13

  14. Window (or kernel) function: Width parameter . [ d 𝜒 𝒚 − 𝒚 (&) 𝑞 . 𝑦 = 1 𝑜 × 1 ℎ . ℎ . &GH } Choosing ℎ . : } Too large: low resolution } Too small: much variability [Duda, Hurt, and Stork] } For unlimited 𝑜 , by letting 𝑊 . slowly approach zero as 𝑜 increases 𝑞 . (𝒚) converges to 𝑞(𝒚) 14

  15. � Parzen Window: Example 𝜒 𝑣 = 𝑂 0,1 𝑞 𝑦 = 𝑂(0,1) ℎ . = ℎ/ 𝑜 15 [Duda, Hurt, and Stork]

  16. Width parameter } For fixed 𝑜 , a smaller ℎ results in higher variance while a larger ℎ leads to higher bias. } For a fixed ℎ , the variance decreases as the number of sample points 𝑜 tends to infinity } for a large enough number of samples, the smaller ℎ the better the accuracy of the resulting estimate } In practice, where only a finite number of samples is possible, a compromise between ℎ and 𝑜 must be made. } ℎ can be set using techniques like cross-validation where the density estimation used for learning tasks such as classification 16

  17. Practical issues: Curse of dimensionality } Large 𝑜 is necessary to find an acceptable density estimation in high dimensional feature spaces } 𝑜 must grow exponentially with the dimensionality 𝑒 . } If 𝑜 equidistant points are required to densely fill a one-dim interval, 𝑜 [ points are needed to fill the corresponding 𝑒 -dim hypercube. ¨ We need an exponentially large quantity of training data to ensure that the cells are not empty } Also complexity requirements 17 𝑒 = 1 𝑒 = 3 𝑒 = 2

  18. 𝑙 . -nearest neighbor estimation } Cell volume is a function of the point location } To estimate 𝑞(𝒚) , let the cell around 𝒚 grow until it captures 𝑙 . samples called 𝑙 . nearest neighbors of 𝒚 . } Two possibilities can occur: } high density near 𝒚 ⇒ cell will be small which provides a good resolution } low density near 𝒚 ⇒ cell will grow large and stop until higher density regions are reached 18

  19. � � 𝑙 . -nearest neighbor estimation 𝑞 . 𝒚 = 𝑙 . /𝑜 1 ⇒ 𝑊 . ≈ 𝑞(𝒚) ×𝑙 . /𝑜 𝑊 . } A family of estimates by setting 𝑙 . = 𝑙 H 𝑜 and choosing different values for 𝑙 H : 𝑞 . 𝒚 = 𝑙 . /𝑜 . ≈ 1/𝑞(𝒚) ⇒ 𝑊 𝑙 H = 1 𝑊 𝑜 . 𝑊 . is a function of 𝒚 19

  20. 𝑙 . -Nearest Neighbor Estimation: Example } Discontinuities in the slopes [Bishop] 20

  21. � 𝑙 . -Nearest Neighbor Estimation: Example } 𝑙 . = 𝑜 1 𝑞 H 𝑦 = 2 𝑦 − 𝑦 (H) 21 [Duda, Hurt, and Stork]

  22. Non-parametric density estimation: Summary } Generality of distributions } With enough samples, convergence to an arbitrarily complicated target density can be obtained. } The number of required samples must be very large to assure convergence } grows exponentially with the dimensionality of the feature space } These methods are very sensitive to the choice of window width or number of nearest neighbors } There may be severe requirements for computation time and storage (needed to save all training samples). } ‘training’ phase simply requires storage of the training set. } computational cost of evaluating 𝑞(𝒚) grows linearly with the size of the data set. 22

  23. Nonparametric learners } Memory-based or instance-based learners } lazy learning: (almost) all the work is done at the test time. } Generic description: } Memorize training (𝒚 (H) , 𝑧 (H) ), . . . , (𝒚 (.) , 𝑧 (.) ) . 7 = 𝑔(𝒚; 𝒚 (H) , 𝑧 (H) , . . . , 𝒚 (.) , 𝑧 (.) ). } Given test 𝒚 predict: 𝑧 } 𝑔 is typically expressed in terms of the similarity of the test sample 𝒚 to the training samples 𝒚 (H) , . . . , 𝒚 (.) 23

  24. � � Parzen window & generative classification z 𝒚{𝒚(h) ew × w w xy ∑ 𝒚(h)∈𝒠w 9(𝒟 w ) x × 9(𝒟 o ) > 1 decide 𝒟 H } If z 𝒚{𝒚(h) eo × w w xy ∑ 𝒚(h)∈𝒠o x } otherwise decide 𝒟 @ } 𝑜 } = 𝒠 } ( 𝑘 = 1,2 ): number of training samples in class 𝒟 } } 𝒠 } : set of training samples labels as 𝒟 } } For large 𝑜 , it needs both high time and memory requirements 24

  25. Parzen window & generative classification: Example Smaller ℎ larger ℎ 25 [Duda, Hurt, and Stork]

  26. Estimate the posterior 𝑞 𝑦 𝑧 = 𝑗 = 𝑙 & 𝑜 & 𝑊 𝑞 𝑧 = 𝑗 = 𝑜 & 𝑜 𝑞(𝑦) = 𝑙 𝑜𝑊 𝑞 𝑧 = 𝑗 𝑦 = 𝑞 𝑦 𝑧 = 𝑗 𝑞(𝑧 = 𝑗) = 𝑙 & 𝑞(𝑦) 𝑙 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend