Heterogeneous Subgraph Features for Information Networks Andreas - - PowerPoint PPT Presentation

heterogeneous subgraph features for information networks
SMART_READER_LITE
LIVE PREVIEW

Heterogeneous Subgraph Features for Information Networks Andreas - - PowerPoint PPT Presentation

Heterogeneous Subgraph Features for Information Networks Andreas Spitz , Diego Costa, Kai Chen, Jan Greulich, Johanna Gei, Stefan Wiesberg, and Michael Gertz June 10, 2018 GRADES-NDA, Houston, Texas, USA Heidelberg University, Germany


slide-1
SLIDE 1

Heterogeneous Subgraph Features for Information Networks

Andreas Spitz, Diego Costa, Kai Chen, Jan Greulich, Johanna Geiß, Stefan Wiesberg, and Michael Gertz June 10, 2018 — GRADES-NDA, Houston, Texas, USA

Heidelberg University, Germany Database Systems Research Group

slide-2
SLIDE 2

Learning and Predicting in Heterogeneous Networks

Many information networks are heterogeneous

◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · · 1

slide-3
SLIDE 3

Learning and Predicting in Heterogeneous Networks

Many information networks are heterogeneous

◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · ·

How do you learn in heterogeneous networks?

1

slide-4
SLIDE 4

Learning and Predicting in Heterogeneous Networks

Many information networks are heterogeneous

◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · ·

How do you learn in heterogeneous networks?

◮ With features, of course 1

slide-5
SLIDE 5

Learning and Predicting in Heterogeneous Networks

Many information networks are heterogeneous

◮ Scientific publication networks ◮ Knowledge bases ◮ Metabolic networks ◮ · · ·

How do you learn in heterogeneous networks?

◮ With features, of course ◮ But how do you get the features? 1

slide-6
SLIDE 6

Problems of Established Feature Extraction Approaches

Classic features:

◮ Require domain knowledge ◮ Are time-consuming to engineer ◮ Require metadata that may not be available 2

slide-7
SLIDE 7

Problems of Established Feature Extraction Approaches

Classic features:

◮ Require domain knowledge ◮ Are time-consuming to engineer ◮ Require metadata that may not be available

Neural node embeddings:

◮ Sample neighbourhoods through random walks ◮ Require extensive parameter tuning 2

slide-8
SLIDE 8

Problems of Established Feature Extraction Approaches

Classic features:

◮ Require domain knowledge ◮ Are time-consuming to engineer ◮ Require metadata that may not be available

Neural node embeddings:

◮ Sample neighbourhoods through random walks ◮ Require extensive parameter tuning

Alternative idea: use labeled subgraph counts as features

2

slide-9
SLIDE 9

Heterogeneous Subgraph Features

slide-10
SLIDE 10

Motivation: Heterogeneous Subgraph Features

Labeled subgraphs around a node:

◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks 3

slide-11
SLIDE 11

Motivation: Heterogeneous Subgraph Features

Labeled subgraphs around a node:

◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks

Conjecture: The subgraph neighbourhood of a node is representative of its function and label.

3

slide-12
SLIDE 12

Motivation: Heterogeneous Subgraph Features

Labeled subgraphs around a node:

◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks

Conjecture: The subgraph neighbourhood of a node is representative of its function and label.

3

slide-13
SLIDE 13

Motivation: Heterogeneous Subgraph Features

Labeled subgraphs around a node:

◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks

Conjecture: The subgraph neighbourhood of a node is representative of its function and label.

3

slide-14
SLIDE 14

Motivation: Heterogeneous Subgraph Features

Labeled subgraphs around a node:

◮ Encode neighbourhood information ◮ Are extremely diverse in heterogeneous networks

Conjecture: The subgraph neighbourhood of a node is representative of its function and label.

3

slide-15
SLIDE 15

Isomorphism of Subgraphs

Problem: depending on the iteration order, the nodes of structurally identical subgraphs may be visited in different order.

4

slide-16
SLIDE 16

Heterogeneous Subgraph Encoding

Core approach:

◮ Explore the local neighbourhood around each node ◮ Represent subgraphs by their characteristic string ◮ Count subgraphs by hashing the characteristic string ◮ Use the counts of subgraphs as node features 5

slide-17
SLIDE 17

Heterogeneous Subgraph Encoding

Core approach:

◮ Explore the local neighbourhood around each node ◮ Represent subgraphs by their characteristic string ◮ Count subgraphs by hashing the characteristic string ◮ Use the counts of subgraphs as node features

Characteristic string construction:

◮ Encode each node as a block ◮ Blocks start with the node label ◮ Subsequent entries denote neighbours of all given labels ◮ Blocks are sorted lexicographically 5

slide-18
SLIDE 18

Encoding Collisions

Heterogeneous degree sequences:

◮ Are a pseudo-canonical encoding ◮ May result in colliding encodings 6

slide-19
SLIDE 19

Encoding Collisions

Heterogeneous degree sequences:

◮ Are a pseudo-canonical encoding ◮ May result in colliding encodings

Encoding collisions:

◮ Can only be enumerated (no closed formula) ◮ Depend on the network structure and the labels ◮ Have negligible frequency in practice 6

slide-20
SLIDE 20

Heuristic for Hub Mitigation

Real-world networks have:

◮ Skewed degree distributions ◮ Highly connected nodes (hubs) 7

slide-21
SLIDE 21

Heuristic for Hub Mitigation

Real-world networks have:

◮ Skewed degree distributions ◮ Highly connected nodes (hubs)

Due to hubs:

◮ Feature extraction time is strongly increased ◮ Random walks retrieve non-local information 7

slide-22
SLIDE 22

Heuristic for Hub Mitigation

Real-world networks have:

◮ Skewed degree distributions ◮ Highly connected nodes (hubs)

Due to hubs:

◮ Feature extraction time is strongly increased ◮ Random walks retrieve non-local information

Intuition: Do not explore beyond nodes with degree > dmax.

7

slide-23
SLIDE 23

Evaluation: Label Prediction

slide-24
SLIDE 24

Label Prediction: Task Definition

Given:

◮ Heterogeneous network ◮ Some nodes with missing labels

Predict:

◮ Missing node labels 8

slide-25
SLIDE 25

Label Prediction: Task Definition

Given:

◮ Heterogeneous network ◮ Some nodes with missing labels

Predict:

◮ Missing node labels

Formal approach:

◮ Model as a classification task using logistic regression ◮ Evaluate with F1-score 8

slide-26
SLIDE 26

Label Prediction: Data Sets

Movie network (IMDB):

◮ Star-shaped structure around movies ◮ Low edge density

Scientific publication network (MAG):

◮ Intermediate structure ◮ Papers form the core component

Entity cooccurrence network (LOAD):

◮ Cooccurrences of named entities in text ◮ Strongly connected structure ◮ High edge density 9

slide-27
SLIDE 27

Feature Engineering and Extraction

Subgraph features:

◮ Maximum number of edges: 5 ◮ No exploration beyond 10% of highest degree nodes ◮ Masked starting node label

Embedded features:

◮ DeepWalk ◮ LINE ◮ node2vec 10

slide-28
SLIDE 28

Extraction Runtime Estimation (seconds per node)

subgraph features node2vec DeepWalk LINE mean 75% 90% 95% max mean LOAD 32.1 19.6 29.7 53.0 1046 0.19 0.11 0.66 IMDB 2.6 1.7 3.0 6.7 47 0.01 0.01 0.64 MAG 25.2 10.4 11.0 19.5 2493 0.02 0.01 0.49 Percentages denote nodes for which the extraction finished in at most the shown time.

11

slide-29
SLIDE 29

Evaluation Results (Training Size)

0.2 0.3 0.4 0.5 0.6 10% 30% 50% 70% 90%

Subgraph node2vec DeepWalk LINE

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10% 30% 50% 70% 90%

Subgraph node2vec DeepWalk LINE

0.2 0.3 0.4 0.5 0.6 0.7 0.8 10% 30% 50% 70% 90%

F1 score

Subgraph node2vec DeepWalk LINE

MAG LOAD IMDB

12

slide-30
SLIDE 30

Evaluation Results (Missing Labels)

0.2 0.3 0.4 0.5 0.6 0% 15% 30% 45% 60% 75%

Subgraph node2vec DeepWalk LINE

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0% 15% 30% 45% 60% 75%

Subgraph node2vec DeepWalk LINE

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0% 15% 30% 45% 60% 75%

F1 score

Subgraph node2vec DeepWalk LINE

MAG LOAD IMDB

13

slide-31
SLIDE 31

Evaluation: Institution Ranking

slide-32
SLIDE 32

Institution Ranking: Task Definition

Given:

◮ Scientific publication network ◮ A range of years ◮ A set of conferences

Predict ranking of institutions:

◮ For upcoming conferences ◮ By accepted papers ◮ For the next conference

KDDCup 2016. https://kddcup2016.azurewebsites.net

14

slide-33
SLIDE 33

Institution Ranking: Task Definition

Given:

◮ Scientific publication network ◮ A range of years ◮ A set of conferences

Predict ranking of institutions:

◮ For upcoming conferences ◮ By accepted papers ◮ For the next conference

Formal approach:

◮ Model as a regression task for the institution relevance score ◮ Evaluate with normalized discounted cumulative gain (NDCG20)

KDDCup 2016. https://kddcup2016.azurewebsites.net

14

slide-34
SLIDE 34

Institution Ranking: Data Set

Subset of the Microsof Academic Graph:

◮ Institutions I ◮ Authors A ◮ Papers P ◮ Publication data from 2011 - 2016

Data preparation:

◮ Focus on 5 conferences

KDD, FSE, ICML, MM, MOBICOM

◮ Use citations to a depth of 3 15

slide-35
SLIDE 35

Feature Types and Extraction

Classic features (manually engineered):

◮ Previous relevance scores, publication counts, etc. (8) ◮ Linguistic features (32)

Subgraph features:

◮ Maximum number of edges: 5 ◮ No maximum degree exploration limit

Embedded features:

◮ DeepWalk ◮ LINE ◮ node2vec 16

slide-36
SLIDE 36

NDCG Scores for Institution Ranking

0.00 0.20 0.40 0.60 0.80 1.00

Random Forest

0.00 0.20 0.40 0.60 0.80 1.00

Linear Regression Classic Subgraphs Combined node2vec DeepWalk LINE

0.00 0.20 0.40 0.60 0.80 1.00

Decision Tree

0.00 0.20 0.40 0.60 0.80 1.00

KDD FSE ICML MM MOBICOM Bayesian Ridge

17

slide-37
SLIDE 37

Average NDCG Scores for Institution Ranking

LinRegr DecTree RanForest BayRidge classic 0.65 0.58 0.64 0.51 subgraph 0.58 0.51 0.68 0.65 combined 0.62 0.46 0.68 0.60 node2vec 0.18 0.19 0.39 0.27 DeepWalk 0.14 0.17 0.25 0.18 LINE 0.17 0.23 0.56 0.23

18

slide-38
SLIDE 38

Feature Importance Analysis (Random Forest)

19

slide-39
SLIDE 39

Summary & Resources

slide-40
SLIDE 40

Summary

Heterogeneous subgraph features:

◮ Extracted by local exploration and enumeration ◮ Avoid isomorphism test by encoding degree sequences 20

slide-41
SLIDE 41

Summary

Heterogeneous subgraph features:

◮ Extracted by local exploration and enumeration ◮ Avoid isomorphism test by encoding degree sequences

In comparison to classic features:

◮ Similar performance ◮ Require no domain knowledge for extraction ◮ No engineering process necessary 20

slide-42
SLIDE 42

Summary

Heterogeneous subgraph features:

◮ Extracted by local exploration and enumeration ◮ Avoid isomorphism test by encoding degree sequences

In comparison to classic features:

◮ Similar performance ◮ Require no domain knowledge for extraction ◮ No engineering process necessary

In comparison to embedded features:

◮ Beter predictive performance ◮ Longer extraction time 20

slide-43
SLIDE 43

Resources

The implementation is available online:

◮ C++ (core extraction routines) ◮ Python (wrapper)

https://dbs.ifi.uni-heidelberg.de/resources/hsgf/

21

slide-44
SLIDE 44

Resources

The implementation is available online:

◮ C++ (core extraction routines) ◮ Python (wrapper)

https://dbs.ifi.uni-heidelberg.de/resources/hsgf/

21