1 Implicit Classification Function Efficient Indexing Although it - PDF document

Instance-Based Learning • Unlike other learning algorithms, does not involve construction of an explicit abstract generalization but classifies new instances based on direct comparison and similarity to known training instances. CS 391L: Machine Learning: • Training can be very easy, just memorizing training Instance Based Learning instances. • Testing can be very expensive, requiring detailed comparison to all past training instances. • Also known as: – Case-based – Exemplar-based – Nearest Neighbor Raymond J. Mooney – Memory-based – Lazy Learning University of Texas at Austin 1 2 Similarity/Distance Metrics Other Distance Metrics • Mahalanobis distance • Instance-based methods assume a function for determining the similarity or distance between any two instances. – Scale-invariant metric that normalizes for variance. • Cosine Similarity • For continuous feature vectors, Euclidian distance is the generic choice: – Cosine of the angle between the two vectors. – Used in text and other high-dimensional data. n d x x = a x − a x ( , ) ( ( ) ( )) 2 i j p i p j • Pearson correlation ∑ p = 1 Where a p ( x ) is the value of the p th feature of instance x . – Standard statistical correlation coefficient. – Used for bioinformatics data. • For discrete features, assume distance between two values • Edit distance is 0 if they are the same and 1 if they are different (e.g. – Used to measure distance between unbounded length Hamming distance for bit vectors). strings. • To compensate for difference in units across features, scale – Used in text and bioinformatics. all continuous values to the interval [0,1]. 3 4 K-Nearest Neighbor 5-Nearest Neighbor Example • Calculate the distance between a test point and every training instance. • Pick the k closest training examples and assign the test instance to the most common category amongst these nearest neighbors. • Voting multiple neighbors helps decrease susceptibility to noise. • Usually use odd value for k to avoid ties. 5 6 1

Implicit Classification Function Efficient Indexing • Although it is not necessary to explicitly calculate • Linear search to find the nearest neighbors is not it, the learned classification rule is based on efficient for large training sets. regions of the feature space closest to each • Indexing structures can be built to speed testing. training example. • For Euclidian distance, a kd-tree can be built that • For 1-nearest neighbor with Euclidian distance, reduces the expected time to find the nearest neighbor to O(log n ) in the number of training the Voronoi diagram gives the complex polyhedra segmenting the space into the regions examples. closest to each point. – Nodes branch on threshold tests on individual features and leaves terminate at nearest neighbors. • Other indexing structures possible for other metrics or string data. – Inverted index for text retrieval. 7 8 Nearest Neighbor Variations Feature Relevance and Weighting • Standard distance metrics weight each feature equally • Can be used to estimate the value of a real- when determining similarity. valued function (regression) by taking the – Problematic if many features are irrelevant, since similarity along average function value of the k nearest many irrelevant examples could mislead the classification. • Features can be weighted by some measure that indicates neighbors to an input point. their ability to discriminate the category of an example, such as information gain. • All training examples can be used to help • Overall, instance-based methods favor global similarity classify a test instance by giving every over concept simplicity. training example a vote that is weighted by + Training the inverse square of its distance from the – Data + test instance. ?? Test Instance 9 10 Rules and Instances in Other Issues Human Learning Biases • Can reduce storage of training instances to a small set of • Psychological experiments representative examples. show that people from – Support vectors in an SVM are somewhat analogous. different cultures exhibit • Can hybridize with rule-based methods or neural-net distinct categorization methods. biases. – Radial basis functions in neural nets and Gaussian kernels in • “Western” subjects favor SVMs are similar. simple rules (straight stem) • Can be used for more complex relational or graph data. and classify the target – Similarity computation is complex since it involves some sort of graph isomorphism. object in group 2. • Can be used in problems other than classification. • “Asian” subjects favor – Case-based planning global similarity and – Case-based reasoning in law and business. classify the target object in group 1. 11 12 2

Conclusions • IBL methods classify test instances based on similarity to specific training instances rather than forming explicit generalizations. • Typically trade decreased training time for increased testing time. 13 3

1 Implicit Classification Function Efficient Indexing Although it - PDF document

Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit abstract generalization but classifies new instances based on direct comparison and similarity to known training instances. CS 391L:

Near Neighbor Search in High Dimensional Data (1) Motivation Distance Measures Shingling

Using technology to reduce social isolation: research on dementia and social isolation Professor

Lecture 1: Introduction to Program Analysis 17-355/17-655/17-819: Program Analysis Claire Le

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Summary Valid Arguments and Rules

Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins Kanchana

We need a better perceptual similarity metric Lubomir Bourdev WaveOne, Inc. CVPR Workshop

An expressive dissimilarity measure for relational clustering using neighbourhood trees

Data Mining Techniques: Cluster Analysis Mirek Riedewald Many slides based on presentations by

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and

Notes about correlation (for Asgn 2) Sharon Goldwater Sharon Goldwater Correlation Overview of

A graphical view of distance between rankings: the Point and Area measures Giorgio Maria Di Nunzio

A Similarity Measure for the ALN Description Logic Nicola Fanizzi, Claudia dAmato Dipartimento

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with

Passage Based Retrieval (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Passage Based

Measuring distance/ similarity of data objects Multiple data types Records of users

Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning

Trajectory Clustering: Visual Analytics Approaches Gennady Andrienko & Natalia Andrienko y

Lab 1: Packet Sniffing and Wireshark Fengwei Zhang Wayne State University CSC 5991 Cyber

Specific Simple Network Management Tools urgen Sch onw J alder University of Osnabr

Linux Kernel AgentX Sub-Agents Oliver Wellnitz wellnitz@ibr.cs.tu-bs.de Institute of Operating

ilab 2 - IPSec with IKEv2 and Strongswan Lukas Grillmayer and Linus Lotz Chair for Network

Class of Infrastructures for Cloud Computing and Big Data M QoS basics and protocols Antonio

1 Implicit Classification Function Efficient Indexing Although it - PDF document

Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit abstract generalization but classifies new instances based on direct comparison and similarity to known training instances. CS 391L:

Near Neighbor Search in High Dimensional Data (1) Motivation Distance Measures Shingling

Using technology to reduce social isolation: research on dementia and social isolation Professor

Lecture 1: Introduction to Program Analysis 17-355/17-655/17-819: Program Analysis Claire Le

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Summary Valid Arguments and Rules

Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins Kanchana

We need a better perceptual similarity metric Lubomir Bourdev WaveOne, Inc. CVPR Workshop

An expressive dissimilarity measure for relational clustering using neighbourhood trees

Data Mining Techniques: Cluster Analysis Mirek Riedewald Many slides based on presentations by

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and

Notes about correlation (for Asgn 2) Sharon Goldwater Sharon Goldwater Correlation Overview of

A graphical view of distance between rankings: the Point and Area measures Giorgio Maria Di Nunzio

A Similarity Measure for the ALN Description Logic Nicola Fanizzi, Claudia dAmato Dipartimento

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with

Passage Based Retrieval (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Passage Based

Measuring distance/ similarity of data objects Multiple data types Records of users

Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning

Trajectory Clustering: Visual Analytics Approaches Gennady Andrienko &amp; Natalia Andrienko y

Lab 1: Packet Sniffing and Wireshark Fengwei Zhang Wayne State University CSC 5991 Cyber

Specific Simple Network Management Tools urgen Sch onw J alder University of Osnabr

Linux Kernel AgentX Sub-Agents Oliver Wellnitz wellnitz@ibr.cs.tu-bs.de Institute of Operating

ilab 2 - IPSec with IKEv2 and Strongswan Lukas Grillmayer and Linus Lotz Chair for Network

Class of Infrastructures for Cloud Computing and Big Data M QoS basics and protocols Antonio

Trajectory Clustering: Visual Analytics Approaches Gennady Andrienko & Natalia Andrienko y