nearest neighbors kernel functions svm decision trees
play

Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Po - PowerPoint PPT Presentation

CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Po s k Czech Technical University in Prague Faculty of Electrical


  1. CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Poˇ s´ ık Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Poˇ s´ ık c � 2020 Artificial Intelligence – 1 / 43

  2. Nearest neighbors P. Poˇ s´ ık c � 2020 Artificial Intelligence – 2 / 43

  3. Method of k nearest neighbors ■ Simple, non-parametric, instance-based method for supervised learning, applicable for both classification and regression . ■ Do not confuse k -NN with Nearest neighbors • kNN k -means (a clustering algorithm) ■ • Question • Class. example ■ NN (neural networks) • Regression example • k -NN Summary SVM Decision Trees Summary P. Poˇ s´ ık c � 2020 Artificial Intelligence – 3 / 43

  4. Method of k nearest neighbors ■ Simple, non-parametric, instance-based method for supervised learning, applicable for both classification and regression . ■ Do not confuse k -NN with Nearest neighbors • kNN k -means (a clustering algorithm) ■ • Question • Class. example ■ NN (neural networks) • Regression example • k -NN Summary ■ Training: Just remember the whole training dataset T . SVM ■ Prediction: To get the model prediction for a new data point x (query), Decision Trees ■ find the set N k ( x ) of k nearest neighbors of x in T using certain distance measure, Summary y = h ( x ) as the majority ■ in case of classification , determine the predicted class � vote among the nearest neighbors, i.e. I ( y ′ = y ) , ∑ y = h ( x ) = arg max � y ( x ′ , y ′ ) ∈ N k ( x ) where I ( P ) is an indicator function (returns 1 if P is true, 0 otherwise). ■ in case of regression , determine the predicted value � y = h ( x ) as the average of values y of the nearest neighbors, i.e. y = h ( x ) = 1 y ′ , ∑ � k ( x ′ , y ′ ) ∈ N k ( x ) ■ What is the influence of k to the final model? P. Poˇ s´ ık c � 2020 Artificial Intelligence – 3 / 43

  5. Question The influence of method parameters on model flexibility: ■ Polynomial models: the larger the degree of the polynom, the higher the model flexibility. Nearest neighbors • kNN ■ Basis expansion: the more bases we derive, the higher the model flexibility. • Question • Class. example ■ Regularization: the higher the coefficient size penalty, the lower the model flexibility. • Regression example • k -NN Summary What is the influence of the number of neighbours k to the flexibility of k -NN? SVM Decision Trees Summary A The flexibility of k -NN does not depend on k . B The flexibility of k -NN grows with growing k . C The flexibility of k -NN drops with growing k . D The flexibility of k -NN first drops with growing k , then it grows again. P. Poˇ s´ ık c � 2020 Artificial Intelligence – 4 / 43

  6. KNN classification: Example 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 5 / 43

  7. KNN classification: Example 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 ■ Only in 1-NN, all training examples are classified correctly (unless there are two exactly the same observations with a different evaluation). ■ Unbalanced classes may be an issue: the more frequent class takes over with increasing k . P. Poˇ s´ ık c � 2020 Artificial Intelligence – 5 / 43

  8. k -NN Regression Example The training data: Nearest neighbors • kNN • Question • Class. example • Regression example • k -NN Summary SVM 10 Decision Trees Summary 5 10 0 5 10 5 0 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 6 / 43

  9. k -NN regression example 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 7 / 43

  10. k -NN regression example 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 ■ For small k , the surface is rugged. ■ For large k , too much averaging (smoothing) takes place. P. Poˇ s´ ık c � 2020 Artificial Intelligence – 7 / 43

  11. k -NN Summary Comments: ■ For 1-NN, the division of the input space into convex cells is called a Voronoi tessellation . Nearest neighbors • kNN ■ A weighted variant can be constructed: • Question • Class. example ■ Each of the k nearest neighbors has a weight inversely proportional to its • Regression example distance to the query point. • k -NN Summary ■ Prediction is then done using weighted voting (in case of classification) or SVM weighted averaging (in case of regression). Decision Trees Summary ■ In regression tasks, instead of averaging you can use e.g. (weighted) linear regression to compute the prediction. P. Poˇ s´ ık c � 2020 Artificial Intelligence – 8 / 43

  12. k -NN Summary Comments: ■ For 1-NN, the division of the input space into convex cells is called a Voronoi tessellation . Nearest neighbors • kNN ■ A weighted variant can be constructed: • Question • Class. example ■ Each of the k nearest neighbors has a weight inversely proportional to its • Regression example distance to the query point. • k -NN Summary ■ Prediction is then done using weighted voting (in case of classification) or SVM weighted averaging (in case of regression). Decision Trees Summary ■ In regression tasks, instead of averaging you can use e.g. (weighted) linear regression to compute the prediction. Advantages: ■ Simple and widely applicable method. ■ For both classification and regression tasks. ■ For both categorial and continuous predictors (independent variables). Disadvantages: ■ Must store the whole training set (there are methods for training set reduction). ■ During prediction, it must compute the distances to all the training data points (can be alleviated e.g. by using KD-tree structure for the training set). Overfitting prevention: ■ Choose the right value of k e.g. using crossvalidation. P. Poˇ s´ ık c � 2020 Artificial Intelligence – 8 / 43

  13. Support vector machine P. Poˇ s´ ık c � 2020 Artificial Intelligence – 9 / 43

  14. Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 10 / 43

  15. Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 Basis expansion: ■ Instead of a linear model � w , x � , create a linear model of nonlinearly transformed features � w ′ , Φ ( x ) � which represents a nonlinear model in the original space. P. Poˇ s´ ık c � 2020 Artificial Intelligence – 10 / 43

  16. Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 Basis expansion: ■ Instead of a linear model � w , x � , create a linear model of nonlinearly transformed features � w ′ , Φ ( x ) � which represents a nonlinear model in the original space. What if we put these two things together? P. Poˇ s´ ık c � 2020 Artificial Intelligence – 10 / 43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend