Heterogeneous Subgraph Features for Information Networks Andreas - PowerPoint PPT Presentation

Heterogeneous Subgraph Features for Information Networks Andreas Spitz , Diego Costa, Kai Chen, Jan Greulich, Johanna Geiß, Stefan Wiesberg, and Michael Gertz June 10, 2018 — GRADES-NDA, Houston, Texas, USA Heidelberg University, Germany Database Systems Research Group

Learning and Predicting in Heterogeneous Networks Many information networks are heterogeneous ◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · · 1

Learning and Predicting in Heterogeneous Networks Many information networks are heterogeneous ◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · · How do you learn in heterogeneous networks? 1

Learning and Predicting in Heterogeneous Networks Many information networks are heterogeneous ◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · · How do you learn in heterogeneous networks? ◮ With features, of course 1

Learning and Predicting in Heterogeneous Networks Many information networks are heterogeneous ◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · · How do you learn in heterogeneous networks? ◮ With features, of course ◮ But how do you get the features? 1

Problems of Established Feature Extraction Approaches Classic features: ◮ Require domain knowledge ◮ Are time-consuming to engineer ◮ Require metadata that may not be available 2

Problems of Established Feature Extraction Approaches Classic features: ◮ Require domain knowledge ◮ Are time-consuming to engineer ◮ Require metadata that may not be available Neural node embeddings: ◮ Sample neighbourhoods through random walks ◮ Require extensive parameter tuning 2

Problems of Established Feature Extraction Approaches Classic features: ◮ Require domain knowledge ◮ Are time-consuming to engineer ◮ Require metadata that may not be available Neural node embeddings: ◮ Sample neighbourhoods through random walks ◮ Require extensive parameter tuning Alternative idea: use labeled subgraph counts as features 2

Heterogeneous Subgraph Features

Motivation: Heterogeneous Subgraph Features Labeled subgraphs around a node: ◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks 3

Motivation: Heterogeneous Subgraph Features Labeled subgraphs around a node: ◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks Conjecture: The subgraph neighbourhood of a node is representative of its function and label. 3

Isomorphism of Subgraphs Problem: depending on the iteration order, the nodes of structurally identical subgraphs may be visited in different order. 4

Heterogeneous Subgraph Encoding Core approach: ◮ Explore the local neighbourhood around each node ◮ Represent subgraphs by their characteristic string ◮ Count subgraphs by hashing the characteristic string ◮ Use the counts of subgraphs as node features 5

Heterogeneous Subgraph Encoding Core approach: ◮ Explore the local neighbourhood around each node ◮ Represent subgraphs by their characteristic string ◮ Count subgraphs by hashing the characteristic string ◮ Use the counts of subgraphs as node features Characteristic string construction: ◮ Encode each node as a block ◮ Blocks start with the node label ◮ Subsequent entries denote neighbours of all given labels ◮ Blocks are sorted lexicographically 5

Encoding Collisions Heterogeneous degree sequences: ◮ Are a pseudo -canonical encoding ◮ May result in colliding encodings 6

Encoding Collisions Heterogeneous degree sequences: ◮ Are a pseudo -canonical encoding ◮ May result in colliding encodings Encoding collisions: ◮ Can only be enumerated (no closed formula) ◮ Depend on the network structure and the labels ◮ Have negligible frequency in practice 6

Heuristic for Hub Mitigation Real-world networks have: ◮ Skewed degree distributions ◮ Highly connected nodes (hubs) 7

Heuristic for Hub Mitigation Real-world networks have: ◮ Skewed degree distributions ◮ Highly connected nodes (hubs) Due to hubs: ◮ Feature extraction time is strongly increased ◮ Random walks retrieve non-local information 7

Heuristic for Hub Mitigation Real-world networks have: ◮ Skewed degree distributions ◮ Highly connected nodes (hubs) Due to hubs: ◮ Feature extraction time is strongly increased ◮ Random walks retrieve non-local information Intuition: Do not explore beyond nodes with degree > d max . 7

Evaluation: Label Prediction

Label Prediction: Task Definition Given: Predict: ◮ Heterogeneous network ◮ Missing node labels ◮ Some nodes with missing labels 8

Label Prediction: Task Definition Given: Predict: ◮ Heterogeneous network ◮ Missing node labels ◮ Some nodes with missing labels Formal approach: ◮ Model as a classification task using logistic regression ◮ Evaluate with F 1 -score 8

Label Prediction: Data Sets Movie network (IMDB): ◮ Star-shaped structure around movies ◮ Low edge density Scientific publication network (MAG): ◮ Intermediate structure ◮ Papers form the core component Entity cooccurrence network (LOAD): ◮ Cooccurrences of named entities in text ◮ Strongly connected structure ◮ High edge density 9

Feature Engineering and Extraction Subgraph features: ◮ Maximum number of edges: 5 ◮ No exploration beyond 10% of highest degree nodes ◮ Masked starting node label Embedded features: ◮ DeepWalk ◮ LINE ◮ node2vec 10

Extraction Runtime Estimation (seconds per node) subgraph features node2vec DeepWalk LINE mean 75% 90% 95% max mean LOAD 32.1 19.6 29.7 53.0 1046 0.19 0.11 0.66 IMDB 2.6 1.7 3.0 6.7 47 0.01 0.01 0.64 MAG 25.2 10.4 11.0 19.5 2493 0.02 0.01 0.49 Percentages denote nodes for which the extraction finished in at most the shown time. 11

Evaluation Results (Training Size) 0.8 1 0.6 Subgraph node2vec Subgraph node2vec Subgraph node2vec 0.9 0.7 DeepWalk LINE DeepWalk LINE DeepWalk LINE 0.8 0.5 0.6 F 1 score 0.7 0.4 0.5 0.6 0.5 0.4 0.3 0.4 0.3 0.3 0.2 0.2 0.2 10% 30% 50% 70% 90% 10% 30% 50% 70% 90% 10% 30% 50% 70% 90% MAG LOAD IMDB 12

Evaluation Results (Missing Labels) 0.8 0.6 1 Subgraph node2vec Subgraph node2vec Subgraph node2vec 0.9 0.7 DeepWalk LINE DeepWalk LINE DeepWalk LINE 0.5 0.8 0.6 0.7 F 1 score 0.5 0.4 0.6 0.5 0.4 0.3 0.4 0.3 0.3 0.2 0.2 0.2 0% 15% 30% 45% 60% 75% 0% 15% 30% 45% 60% 75% 0% 15% 30% 45% 60% 75% MAG LOAD IMDB 13

Evaluation: Institution Ranking

Institution Ranking: Task Definition Given: Predict ranking of institutions: ◮ Scientific publication network ◮ For upcoming conferences ◮ A range of years ◮ By accepted papers ◮ A set of conferences ◮ For the next conference KDDCup 2016 . https://kddcup2016.azurewebsites.net 14

Institution Ranking: Task Definition Given: Predict ranking of institutions: ◮ Scientific publication network ◮ For upcoming conferences ◮ A range of years ◮ By accepted papers ◮ A set of conferences ◮ For the next conference Formal approach: ◮ Model as a regression task for the institution relevance score ◮ Evaluate with normalized discounted cumulative gain (NDCG20) KDDCup 2016 . https://kddcup2016.azurewebsites.net 14

Institution Ranking: Data Set Subset of the Microsof Academic Graph: ◮ Institutions I ◮ Authors A ◮ Papers P ◮ Publication data from 2011 - 2016 Data preparation: ◮ Focus on 5 conferences KDD, FSE, ICML, MM, MOBICOM ◮ Use citations to a depth of 3 15

Feature Types and Extraction Classic features (manually engineered): ◮ Previous relevance scores, publication counts, etc. (8) ◮ Linguistic features (32) Subgraph features: ◮ Maximum number of edges: 5 ◮ No maximum degree exploration limit Embedded features: ◮ DeepWalk ◮ LINE ◮ node2vec 16

NDCG Scores for Institution Ranking Classic Subgraphs Combined node2vec DeepWalk LINE 1.00 Linear Regression 0.80 0.60 0.40 0.20 0.00 1.00 Decision Tree 0.80 0.60 0.40 0.20 0.00 1.00 Random Forest 0.80 0.60 0.40 0.20 0.00 1.00 Bayesian Ridge 0.80 0.60 0.40 0.20 0.00 KDD FSE ICML MM MOBICOM 17

Average NDCG Scores for Institution Ranking LinRegr DecTree RanForest BayRidge classic 0.65 0.58 0.64 0.51 subgraph 0.58 0.51 0.68 0.65 combined 0.62 0.46 0.68 0.60 node2vec 0.18 0.19 0.39 0.27 DeepWalk 0.14 0.17 0.25 0.18 LINE 0.17 0.23 0.56 0.23 18

Heterogeneous Subgraph Features for Information Networks Andreas - PowerPoint PPT Presentation

Heterogeneous Subgraph Features for Information Networks Andreas Spitz , Diego Costa, Kai Chen, Jan Greulich, Johanna Gei, Stefan Wiesberg, and Michael Gertz June 10, 2018 GRADES-NDA, Houston, Texas, USA Heidelberg University, Germany

Mining Large Single Networks under Subgraph Mining Large Single Networks under Subgraph

CORE DECOMPOSITION AND DENSEST SUBGRAPH IN MULTILAYER NETWORKS CORE DECOMPOSITION AND DENSEST

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Finding Dense Subgraphs with Size Bounds Reid Andersen Kumar Chellapilla Microsoft Live Labs

Everything you always wanted to know about the parameterized complexity of Subgraph Isomorphism

Topic II.1: Frequent Subgraph Mining Discrete Topics in Data Mining Universitt des Saarlandes,

Spanning G is a subgraph that has all the vertices of G. Trees Albert R Meyer, April 8, 2013

Densest/Heaviest k -subgraph on Interval Graphs, Chordal Graphs and Planar Graphs Presented by

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS Yizhou Sun College of Computer and

Efficient Densest Subgraph Computation in Evolving Graphs Alessandro Epasto Joint work with

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

metapath2vec Scalable Representation Learning for Heterogeneous Networks Yuxiao Dong Nitesh V.

Charon-Suite Module Framework Modular Algorithms with Serializable C++ Objects Jens-Malte

Distributing Secrets Securely ? Presented by Simo Sorce Red Hat, Inc. Flock 2015 Historically

Java Enterprise Edition (JEE) Core Design Patterns JEE Core Design Patterns Presentation

Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang

Information Information systems/infrastructure systems/infrastructure complexity complexity

The network and the OS David Clark MIT CSAIL October,

Background p Network A ubiquitous data structure to model the relationships between entities p