Learning to Predict Interactions in Networks
Charles Elkan University of California, San Diego Research with Aditya Menon December 1, 2011
1 / 71
Learning to Predict Interactions in Networks Charles Elkan - - PowerPoint PPT Presentation
Learning to Predict Interactions in Networks Charles Elkan University of California, San Diego Research with Aditya Menon December 1, 2011 1 / 71 In a social network ... Can we predict future friendships? flickr.com/photos/greenem 2 / 71
1 / 71
2 / 71
3 / 71
◮ Values of explicit variables represent side-information. ◮ Latent values represent the position of each node in the
◮ The probability that an edge exists is a function of
i Λαj + xT i Wxj + vTzij)
4 / 71
1
2
3
4
5
5 / 71
6 / 71
7 / 71
8 / 71
9 / 71
10 / 71
11 / 71
12 / 71
13 / 71
14 / 71
15 / 71
◮ (ri, ci) is a dyad, yi is a label.
◮ Often, but not necessarily, transductive.
◮ ri, ci can be from same or different sets,
◮ yi can be unordered, ordered, or real-valued.
16 / 71
◮ Latent features play a similar role to explicit features. ◮ Computationally, learning does SVD (singular value
17 / 71
18 / 71
◮ Contrast: Non-predictive task, e.g. community detection.
◮ Contrast: MCMC, all pairs betweenness, SVD, etc. use too
◮ Contrast: Compare only to classic methods. 19 / 71
◮ An algorithm but no goal function, e.g. betweenness.
◮ Uses only graph structure, e.g. commute time. ◮ Should also use known properties of nodes and edges..
◮ Assuming that the only structure is communities or blocks. 20 / 71
◮ Experimentally: humans are bad at learning network structures. ◮ And they learn non-social networks just as well as social ones. 21 / 71
22 / 71
23 / 71
24 / 71
25 / 71
26 / 71
◮ the edge involves him/herself, or ◮ one node of the edge has low or high degree.
◮ whether another person is a loner or gregarious, ◮ whether a person is a friend or enemy of oneself, ◮ in high school, whether another student is a geek or jock, ◮ etc. 27 / 71
1
2
3
4
5
28 / 71
◮ Need probabilities of ratings e.g. p(5 stars|user, movie)
◮ Link types may be { friend, colleague, family } ◮ For Amazon, labels may be { viewed, purchased, returned }
◮ Combine information from latent and explicit feature vectors.
29 / 71
◮ Models probabilities of labels given an example ◮ Purely discriminative: no attempt to model x ◮ Labels can be nominal and/or have structure ◮ Combines multiple sources of information correctly. 30 / 71
rc).
31 / 71
rc = (αy r:)Tβy c: = K
rkβy ck
◮ In practice, a single vector of movie characteristics suffices:
c = βc
◮ The characteristics predicting that a user will rate 1 star versus
32 / 71
r)Tβy c + (vy)Tsrc).
33 / 71
◮ But then all users have the same rankings of movies.
r)Tβy c + uT r V ymc).
34 / 71
r)Tβy c + (vy)Tsrc + uT r V ymc)
r and βy c are latent feature vectors in RK.
◮ K is number of latent features
◮ Fix a base class for identifiability. ◮ Intercept terms for each user and movie are important. ◮ Use L2 regularization. ◮ Train with stochastic gradient descent (SGD). 35 / 71
◮ Difference between 1 and 5 = difference between 4 and 5
|Y|
◮ The expectation E[y] is a summary function. ◮ A standard latent feature model is limited to one factorization
36 / 71
37 / 71
◮ Handle unordered labels for multiclass link prediction ◮ Exploit numerical structure of labels for collaborative filtering ◮ Incorporate side-information in a cold-start setting.
◮ More detailed study of link prediction ◮ Complementarity of explicit and latent features. 38 / 71
◮ MMSB, IRM assume interactions set by cluster membership. ◮ IBP has binary latent features.
39 / 71
40 / 71
0.5 1 1.5 2 1 2 3 Estimated stdev MAE Lowest variance Highest variance Kazaam Grateful Dead Lawnmower Man 2: Beyond Cyberspace The Rescuers Problem Child 2 Prizzi’s Honor Meatballs III Homeward Bound: The Incredible Journey Pokemon the Movie 2000 The Fly
41 / 71
◮ Standard: No cold-start for users or movies ◮ Cold-start users: Randomly discard ratings of 50 users ◮ Cold-start users + movies: Randomly discard ratings of 50
Standard Cold-start users Cold-start users + movies
0.0000 0.2000 0.4000 0.6000 0.8000 1.0000 1.2000
0.7162 0.8039 0.9608 0.7063 0.7118 0.7451
Baseline LFL Setting Test set MAE 42 / 71
1
2
3
4
5
43 / 71
? ? ? ?
◮ e.g. common neighbors, Katz measure, Adamic-Adar.
44 / 71
◮ For author-author linking, side-information can be words in
◮ For country-country conflict, side-information is geographic
45 / 71
◮ latent vectors αi ∈ Rk for each node i ◮ scaling factors Λ ∈ Rk×k for asymmetric graphs ◮ weights W ∈ Rd×d for node features ◮ weights v ∈ Rd′ for edge features.
i Λαj + xT i Wxj + vTzij)
α,Λ,W,v
46 / 71
◮ Not influenced by relative size of positive and negative classes.
◮ Sampling is popular, but loses information. ◮ Weighting is merely heuristic. 47 / 71
α,Λ,W,v
48 / 71
◮ latent features versus unsupervised scores ◮ latent features versus explicit features.
◮ Computational biology: Protein-protein interaction network,
◮ Citation networks: NIPS authors, condensed matter physicists ◮ Social phenomena: Military conflicts between countries,
49 / 71
Dataset Nodes |T +| |T −| +ve:−ve ratio Average degree Prot-Prot 2617 23710 6,824,979 1 : 300 9.1 Metabolic 668 5564 440,660 1 : 80 8.3 NIPS 2865 9466 8,198,759 1 : 866 3.3 Condmat 14230 2392 429,232 1 : 179 0.17 Conflict 130 320 16580 1 : 52 2.5 PowerGrid 4941 13188 24,400,293 1 : 2000 2.7 Protein-protein interaction data from Noble. Each protein has a 76 dimensional explicit feature vector. Metabolic pathway interaction data for S. cerevisiae provided in the KEGG/PATHWAY database [ISMB]. Each node has three feature sets: a 157 dimensional vector of phylogenetic information, a 145 dimensional vector of gene expression information, and a 23 dimensional vector of gene location information. NIPS: Each node has a 14035 dimensional bag-of-words feature vector, the words used by the author in her publications. LSI reduces the number of features to 100. Co-author network of condensed-matter physicists [Newman]. Military disputes between countries [MID 3.0]. Each node has 3 features: population, GDP and polity. Each dyad has 6 features, e.g. the countries’ geographic distance. US electric power grid network [Watts and Strogatz].
50 / 71
51 / 71
52 / 71
53 / 71
1
2
3
4
5
54 / 71
◮ Two OCR images are similar if they are versions of the same
◮ Two eHarmony members are compatible if they were mutually
◮ An advertisement is relevant for a query if a user clicks on it. ◮ An action is suitable for a state if it has high long-term value.
55 / 71
56 / 71
57 / 71
m
n
58 / 71
59 / 71
60 / 71
◮ Distance is defined only if u and v belong to the same space. ◮ In information retrieval, u can be a query in one language and v
◮ Because queries are shorter than documents, the relatedness of
61 / 71
n
n
62 / 71
63 / 71
25 30 35 40 45 50 55
64 / 71
65 / 71
66 / 71
67 / 71
68 / 71
1
2
3
4
5
69 / 71
70 / 71
71 / 71
Backstrom, L. and Leskovec, J. (2011). Supervised random walks: Predicting and recommending links in social networks. In Proceedings of the Forth International Conference on Web Search and Web Data Mining (WSDM), pages 635–644. McFee, B. and Lanckriet, G. R. G. (2010). Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 775–782. Rennie, J. D. M. and Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In ICML ’05, pages 713–719, New York, NY, USA. ACM.
72 / 71