Performance Metrics for Graph Mining Tasks
1
Performance Metrics for Graph Mining Tasks 1 Outline - - PowerPoint PPT Presentation
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance
1
2
3
Performance metric measures how well your data mining algorithm is performing on a given dataset. For example, if we apply a classification algorithm on a dataset, we first check to see how many of the data points were classified correctly. This is a performance metric and the formal name for it is “accuracy.” Performance metrics also help us decide is one algorithm is better or worse than another. For example, one classification algorithm A classifies 80% of data points correctly and another classification algorithm B classifies 90% of data points correctly. We immediately realize that algorithm B is doing better. There are some intricacies that we will discuss in this chapter.
4
5
Outline:
6
7
Predicted Class +
Class
An 2X2 matrix, is used to tabulate the results of 2-class supervised learning problem and entry (i,j) represents the number of elements with class label i, but predicted to have class label j. True Positive False Positive False Negative True Negative
+ and – are two class labels
8
Vertex ID Actual Class Predicted Class 1 + + 2 + + 3 + + 4 + + 5 +
7
8
Class +
Class + 4 1 C = 5
1 D = 3 A = 6 B = 2 T = 8 Corresponding 2x2 matrix for the given table Results from a Classification Algorithms
Walk-through different metrics using the following example
9
truly “+”
An nXn matrix, where n is the number of classes and entry (i,j) represents the number of elements with class label i, but predicted to have class label j
10
11
Predicted Class Marginal Sum
Actual Values Class 1 Class 2 Class 3 Actual Class Class 1 2 1 1 4 Class 2 1 2 1 4 Class 3 1 2 3 6 Marginal Sum of Predictions 4 5 5 T = 14
Predicted Class Class 1 Class 2 Class 3 Actual Class Class 1 2 1 1 Class 2 1 2 1 Class 3 1 2 3 2X2 Matrix Specific to Class 1
Predicted Class Class 1 (+) Not Class 1 (-) Actual Class Class 1 (+) 2 2 C = 4 Not Class 1 (-) 2 8 D = 10 A = 4 B = 10 T = 14
Accuracy = 2/14 Error = 8/14 Recall = 2/4 Precision = 2/4
13
Predicted Class Class 1 Class 2 Class 3 Actual Class Class 1 2 1 1 Class 2 1 2 1 Class 3 1 2 3
L to the sum of vertices that belong to L and those predicted as L
number of points predicted as L. Bias helps understand if a model is over or under-predicting a class
14
Metrics that are plotted on a graph to obtain the visual picture of the performance of two class classifiers
15
1 1 False Positive Rate True Positive Rate (0,1) - Ideal (0,0) Predicts the –ve class all the time (1,1) Predicts the +ve class all the time AUC = 0.5 Plot the performance of multiple models to decide which one performs best
16
1 1 False Positive Rate True Positive Rate AUC = 0.5
Models that lie in this area perform worse than random Note: Models here can be negated to move them to the upper right corner Models that lie in this upper left have good performance Note: This is where you aim to get the model 1. Models that lie in lower left are conservative. 2. Will not predict “+” unless strong evidence 3. Low False positives but high False Negatives 1. Models that lie in upper right are liberal. 2. Will predict “+” with little evidence 3. High False positives
17
1 1 False Positive Rate True Positive Rate
M1 (0.1,0.8) M2 (0.5,0.5) M3 (0.3,0.5)
M1’s performance occurs furthest in the upper-right direction and hence is considered the best model.
Cross-validation also called rotation estimation, is a way to analyze how a predictive data mining model will perform on an unknown dataset, i.e., how well the model generalizes Strategy: 1. Divide up the dataset into two non-overlapping subsets 2. One subset is called the “test” and the other the “training” 3. Build the model using the “training” dataset 4. Obtain predictions of the “test” set 5. Utilize the “test” set predictions to calculate all the performance metrics
18
Typically cross-validation is performed for multiple iterations, selecting a different non-overlapping test and training set each time
remaining 2/3rd as training
and remaining k-1 partitions for training
19
Note: Selection of data points is typically done in stratified manner, i.e., the class distribution in the test set is similar to the training set
20
Outline:
21
To test the effectiveness of unsupervised learning methods is by considering a dataset D with known class labels, stripping the labels and providing the set as input to any unsupervised leaning algorithm, U. The resulting clusters are then compared with the knowledge priors to judge the performance of U To evaluate performance
22
23
Cluster Same Cluster Different Cluster Class Same Class u11 u10 Different Class u01 u00 (A) To fill the table, initialize u11, u01, u10, u00 to 0 (B) Then, for each pair of points of form (v,w):
where both placing a pair of points with the same class label in the same cluster and placing a pair of points with different class labels in different clusters are given equal importance, i.e., it accounts for both specificity and sensitivity of the clustering
same class label in the same cluster is primarily important
24
Example Matrix Cluster Same Cluster Different Cluster Class Same Class 9 4 Different Class 3 12
Given that the number of points is T, the ideal-matrix is a TxT matrix, where each cell (i,j) has a 1 if the points i and j belong to the same class and a 0 if they belong to different clusters. The observed-matrix is a TxT matrix, where a cell (i,j) has a 1 if the points i and j belong to the same cluster and a 0 if they belong to different cluster
same rank. The two matrices, in this case, are symmetric and, hence, it is sufficient to analyze lower or upper diagonals of each matrix
25
26
In the absence of prior knowledge we have to rely on the information from the clusters themselves to evaluate performance. 1. Cohesion measures how closely objects in the same cluster are related 2. Separation measures how distinct or separated a cluster is from all the other clusters
Here, gi refers to cluster i, W is total number of clusters, x and y are data points, proximity can be any similarity measure (e.g., cosine similarity) We want the cohesion to be close to 1 and separation to be close to 0
27
28
Outline:
29
Squared sum error (SSE) is typically used in clustering algorithms to measure the quality of the clusters obtained. This parameter takes into consideration the distance between each point in a cluster to its cluster center (centroid or some other chosen representative). For dj, a point in cluster gi, where mi is the cluster center of gi, and W, the total number of clusters, SSE is defined as follows: This value is small when points are close to their cluster center, indicating a good
algorithms aim to minimize SSE.
30
Preserved variability is typically used in eigenvector-based dimension reduction techniques to quantify the variance preserved by the chosen dimensions. The
Given that the point is represented in r dimensions (k << r), the eigenvalues are λ1>=λ2>=….. λr-1>=λr. The preserved variability (PV) is calculated as follows: The value of this parameter depends on the number of dimensions chosen: the more included, the higher the value. Choosing all the dimensions will result in the perfect score of 1.
31
32
metrics Scenario:
– We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels like 0.99 is a very good score. – However, it is possible that the underlying data is structured in such a way that you would get 0.99 no matter how you cluster the data. – Thus, 0.99 is not very significant. One way to decide that is by using statistical significance estimation.
We will discuss the Monte Carlo Procedure in next slide!
33
Monte Carlo procedure uses random sampling to assess the significance of a particular performance metric we obtain could have been attained at random. For example, if we obtain a cohesion score of a cluster of size 5 is 0.99, we would be inclined to think that it is a very cohesive score. However, this value could have resulted due to the nature of the data and not due to the algorithm. To test the significance of this 0.99 value we 1. Sample N (usually 1000) random sets of size 5 from the dataset
Steps 1-4 is the Monte Carlo method for p-value estimation.
34
35
Metrics that compare the performance of different algorithms Scenario: 1) Model 1 provides an accuracy of 70% and Model 2 provides an accuracy
2) From the first look, Model 2 seems better, however it could be that Model 1 is predicting Class1 better than Class2 3) However, Class1 is indeed more important than Class2 for our problem 4) We can use model comparison methods to take this notion of “importance” into consideration when we pick one model over another Cost-based Analysis is an important model comparison method discussed in the next few slides.
36
In real-world applications, certain aspects of model performance are considered more important than others. For example: if a person with cancer was diagnosed as cancer-free or vice-versa then the prediction model should be especially penalized. This penalty can be introduced in the form of a cost-matrix.
37
Cost Matrix Predicted Class +
Class + c11 c10
c00 Associated with f11 or u11 Associated with f01 or u01 Associated with f10 or u10 Associated with f00 or u00
The cost and confusion matrices for Model M are given below Cost of Model M is given as:
38
Cost Matrix Predicted Class +
Class + c11 c10
c00 Confusion Matrix Predicted Class +
Class + f11 f10
f00
This analysis is typically used to select one model when we have more than one choice through using different algorithms or different parameters to the learning algorithms.
39
Cost Matrix Predicted Class +
Class +
100
Confusion Matrix of Mx Predicted Class +
Class + 4 1
1 Confusion Matrix of My Predicted Class +
Class + 3 2
1 Cost of My : 200 Cost of Mx: 100 CMx < CMy Purely, based on cost model, Mx is a better model
40