Link prediction in graph construction for supervised and semi-supervised learning Lilian Berton, Jorge Valverde-Rebaza and Alneu de Andrade Lopes Laboratory of Computational Intelligence (LABIC) University of S˜ ao Paulo (USP) Brazil July 2015

Outline Introduction 1 Proposal 2 Experiments 3 Conclusion 4 Jorge Valverde-Rebaza Link prediction in graph construction 2 / 20

Outline Introduction 1 Proposal 2 Experiments 3 Conclusion 4 Jorge Valverde-Rebaza Link prediction in graph construction 3 / 20

Motivation Networks or graphs are a powerful relational representation that has been employed in different tasks of machine learning. Link prediction is an important scientific issue regarding network analysis that has attracted increasing attention in recent years. Many social, biological and information systems can be naturally described as networks, while some data are flat data . To apply graph-based methods to flat data is necessary to convert the data into a network, furthermore converting flat data to relational data can help to improve classification accuracy. Despite many methods for graph construction have been proposed, it is still an open problem . Jorge Valverde-Rebaza Link prediction in graph construction 4 / 20

Objective and hypothesis Propose a new method for graph construction using the link prediction intuition. If a network is very sparse , for example when a minimum spanning tree is applied, it misses structural information for the inference algorithms. If a network is very dense , for example when k NN considering k > 10 is applied, the excess edges become noise in the graph. Considering a basic graph structure is possible add predicted edges, generating a new ( balanced ) graph structure. It can improves the quality of graphs leading to better classification accuracy in supervised and semi-supervised domains (SSL). Jorge Valverde-Rebaza Link prediction in graph construction 5 / 20

Graph Construction Many data sets are available in tabular flat format. It is necessary to convert the data into a network to be able to apply a graph-based algorithm. We apply k -nearest neighbor ( k NN), Mutual k NN, Minimum/Maximum spanning tree (Min/MaxST) to generate an initial graph. (a) 3NN (b) M3NN (c) MinST (d) MaxST Figure: Graph construction methods. Jorge Valverde-Rebaza Link prediction in graph construction 6 / 20

Link Prediction (LP) Link prediction (LP) addresses the problem of predicting the existence of missing relations or new ones. Common Neighbors (c) : s c v i , v j = | Γ( v i ) ∩ Γ( v j ) | Weighted CN (w) : s w v i , v j = � v k ∈ Γ( v i ) ∩ Γ( v j ) w ( v i , v k ) + w ( v k , v j ) l = 1 β l · | paths � l � v i , v j = � ∞ Katz (k) : s k v i , v j | = β A v i , v j + β 2 ( A 2 ) v i , v j + . . . Figure: Link prediction process. Jorge Valverde-Rebaza Link prediction in graph construction 7 / 20

Outline Introduction 1 Proposal 2 Experiments 3 Conclusion 4 Jorge Valverde-Rebaza Link prediction in graph construction 8 / 20

Proposal To predict new links is assigned a score s v i v j for each pair of disconnected vertices v i and v j . All non-observed links are ranked according to their scores, and the links connecting more similar nodes are supposed to be of higher existence likelihoods. A percentage of the top ranked links can be considered. (a) Dataset (b) MinST (c) MinST+LP (Katz- 30%) Figure: LP construction steps. Jorge Valverde-Rebaza Link prediction in graph construction 9 / 20

Outline Introduction 1 Proposal 2 Experiments 3 Conclusion 4 Jorge Valverde-Rebaza Link prediction in graph construction 10 / 20

Datasets Table: Data sets descriptions for SSL classification Data set # Instances # Attributes # Classes g241c 1500 241 2 g241n 1500 241 2 Digit 1 1500 241 2 USPS 1500 241 2 COIL 2 1500 241 2 Table: Data sets descriptions for supervised classification Data set # Instances # Attributes # Classes Wine 178 13 3 Ecoli 336 8 8 Customers 440 8 2 Cancer 699 10 2 Blood 748 5 2 Gaussians3 500 2 2 Gaussians5 500 2 2 Jorge Valverde-Rebaza Link prediction in graph construction 11 / 20

SSL experimental setup PCA was applied reducing the dimensions to 50 since in high-dimensional data the distance to the nearest neighbor approaches the distance to the farthest neighbor which degenerates the quality of the graph. 10 and 100 labeled vertices were randomly selected. We apply MinST, MaxST, k NN and M k NN with 1 ≤ k ≤ 20, and the LP graphs (our proposal) considering the same methods combined with a LP measure: MinST+LP , MaxST+LP , k NN+LP and M k NN+LP with 1 ≤ k ≤ 5. The weighted graph W uses the binary weighting approach. The algorithm used for the label inference task was the Local and Global Consistency (LGC). The average accuracy of 30 runs was used as evaluation. Jorge Valverde-Rebaza Link prediction in graph construction 12 / 20

Supervised experimental setup For Cancer dataset the instances with missing values were also removed. We apply MinST, MaxST, k NN and M k NN with 1 ≤ k ≤ 20, and the LP graphs (our proposal) considering the same methods combined with a LP measure: MinST+LP , MaxST+LP , k NN+LP and M k NN+LP with 1 ≤ k ≤ 3. The weighted graph W uses the opposite of Euclidean Distance. The relational algorithms used for the classification were: nobayes, nolb-lr-binary, nolb-lr-count, nolb-lr-mode, prn. The accuracy of 10-fold cross validation was used as evaluation. Jorge Valverde-Rebaza Link prediction in graph construction 13 / 20

Results CD 1 2 3 4 5 6 7 8 kNN+LP MaxST MkNN MinST kNN MaxST+LP MinST+LP MkNN+LP Figure: Nemenyi post-hoc test for semi-supervised classification. CD 1 2 3 4 5 6 7 8 kNN+LP MaxST kNN MkNN+LP MkNN MinST MinST+LP MaxST+LP Figure: Nemenyi post-hoc test for supervised classification. Jorge Valverde-Rebaza Link prediction in graph construction 14 / 20

Parameter analysis Figure: Distribution of parameters k and top percentage of links used for the graph construction methods in the supervised classification. Jorge Valverde-Rebaza Link prediction in graph construction 15 / 20

Average degree 12 k NN M k NN 10 MST Average degree 8 k NN+LP M k NN+LP 6 MST+LP 4 2 0 2 4 6 8 10 k or % of links * 10 Figure: Average degree for k NN, M k NN, MST and LP versions: k NN+LP , M k NN+LP , MSt+LP applied to Gaussians3 data set. LP versions use k = 3 and the common neighbors measure. Jorge Valverde-Rebaza Link prediction in graph construction 16 / 20

Outline Introduction 1 Proposal 2 Experiments 3 Conclusion 4 Jorge Valverde-Rebaza Link prediction in graph construction 17 / 20

Conclusions Link prediction (LP) has been used in many fields of science, as online social networks where links can be recommended as promising friendships. Here LP was used for graph construction: from an initial graph structure edges are predict generating a new balanced graph. The proposed graphs were evaluated in supervised and semi-supervised classification providing improvements in accuracy. The graphs are sparse and represent well the neighborhood of a point. In future work, other baseline methods could be tested as well other measures for LP . Our approach also could be applied in other domains of machine learning using graph-based methods. Jorge Valverde-Rebaza Link prediction in graph construction 18 / 20

References Berton, L. and Lopes, A. (2014). JMLR , 8:935–983. Graph construction based on labeled instances for Rohban, M. H. and Rabiee, H. R. (2012). semi-supervised learning. Supervised neighborhood graph construction for In Proceedings of 22nd ICPR , pages 2477–2482. semi-supervised classification. Liben-Nowell, D. and Kleinberg, J. (2007). Pattern Recognition , 45(4):1363–1372. The link-prediction problem for social networks. Valverde-Rebaza, J. and Lopes, A. (2012). JASIST , 58(7):1019–1031. Link prediction in complex networks based on cluster information. L¨ u, L. and Zhou, T. (2011). In SBIA ’12 , pages 92–101. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications , Valverde-Rebaza, J. and Lopes, A. (2013). 390(6):1150 – 1170. Exploiting behaviors of communities of Twitter users for Macskassy, S. A. and Provost, F. J. (2007). link prediction. SNAM , 3(4):1063–1074. Classification in networked data: A toolkit and a univariate case study. Jorge Valverde-Rebaza Link prediction in graph construction 19 / 20

Thank you Jorge Valverde-Rebaza jvalverr@icmc.usp.br

Recommend

More recommend