Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu, - - PowerPoint PPT Presentation
Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu, - - PowerPoint PPT Presentation
Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu, Yiming Yang Carnegie Mellon University { hanxiaol, yiming } @cs.cmu.edu June 22, 2016 1 / 24 Outline Task Description New Contributions Framework Scalable Inference
Outline
Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary
2 / 24
Task Description
Goal: Predict associations among heterogeneous graphs.
Protein
Compound Structure Similarity Sequence Similarity Interact
(a) Drug-Target Interaction
Paper Author Venue
Coauthorship Citation Shared Foci Write Publish Attend
(b) Citation Network “John publish a reinforcement learning paper at ICML.” (John,RL Paper,ICML) 3 / 24
Outline
Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary
4 / 24
New Contributions
◮ A unified framework to integrating heterogeneous
information in multiple graphs.
◮ Transductive learning to leverage both labeled data
(sparse) and unlabeled data (massive).
◮ A convex approximation for the scalable inference over
the combinatorial number of possible tuples.
5 / 24
Outline
Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary
6 / 24
Framework
Notation
◮ G(1), G(2), . . . , G(J) are individual graphs; ◮ nj is the #nodes in G(j); ◮ (i1, i2, . . . , iJ) is a tuple (multi-relation); ◮ fi1,i2,...,iJ is the predicted score for the tuple; ◮ f is a tensor in Rn1×n2×···×nJ.
7 / 24
Framework
Product Graph (P) induced from G(1), . . . , G(J).
P
- G(1)
,
G(2)
,
G(3)
- =
Tensor product: P(G(1), G(2), G(3)) = G(1) ⊗ G(2) ⊗ G(3)
8 / 24
Framework
Product Graph (P) induced from G(1), . . . , G(J).
P
- G(1)
,
G(2)
,
G(3)
- =
Tensor product: P(G(1), G(2), G(3)) = G(1) ⊗ G(2) ⊗ G(3)
8 / 24
Framework
Why product graph?
◮ Mapping heterogeneous graphs onto a unified graph for
label propagation (transductive learning).
9 / 24
Framework
Assuming vec(f) ∼ N (0, P) (1) which implies: − log p (f|P) ∝ vec(f)⊤P−1vec(f) := f2
P
(2) Optimization problem min
f
ℓO(f) + γ 2f2
P
(3)
10 / 24
Framework
Assuming vec(f) ∼ N (0, P) (1) which implies: − log p (f|P) ∝ vec(f)⊤P−1vec(f) := f2
P
(2) Optimization problem min
f
ℓO(f) + γ 2f2
P
(3)
10 / 24
Framework
Assuming vec(f) ∼ N (0, P) (1) which implies: − log p (f|P) ∝ vec(f)⊤P−1vec(f) := f2
P
(2) Optimization problem min
f
ℓO(f) + γ 2f2
P
(3)
10 / 24
Framework
For computational tractability, we focus on the spectral graph product family of P. Spectral Graph Product (SGP) The eigensystem of Pκ
- G(1), . . . , G(J)
is parametrized by the eigensystems of individual graphs, i.e.,
- κ
- λi1, . . . , λiJ
- ,
- j
vij
- i1,...,iJ
(4) λij/vij is the ij-th eigenvalue/eigenvector of the j-th graph.
11 / 24
Framework
Nice properties of SGP: Subsuming basic operations κ(x, y) = x × y = ⇒ Pκ(G, H) = G ⊗ H Tensor (5) κ(x, y) = x + y = ⇒ Pκ(G, H) = G ⊕ H Cartesian (6) Supporting graph diffusions σHeat(Pκ) =I + Pκ + 1 2P2
κ + · · · = Peκ
(7) σvon−Neumann(Pκ) =I + Pκ + P2
κ + · · · = P
1 1−κ
(8) Order-insensitive: If κ is commutative, then SGP is commutative (up to graph isomorphism).
12 / 24
Framework
Nice properties of SGP: Subsuming basic operations κ(x, y) = x × y = ⇒ Pκ(G, H) = G ⊗ H Tensor (5) κ(x, y) = x + y = ⇒ Pκ(G, H) = G ⊕ H Cartesian (6) Supporting graph diffusions σHeat(Pκ) =I + Pκ + 1 2P2
κ + · · · = Peκ
(7) σvon−Neumann(Pκ) =I + Pκ + P2
κ + · · · = P
1 1−κ
(8) Order-insensitive: If κ is commutative, then SGP is commutative (up to graph isomorphism).
12 / 24
Framework
Nice properties of SGP: Subsuming basic operations κ(x, y) = x × y = ⇒ Pκ(G, H) = G ⊗ H Tensor (5) κ(x, y) = x + y = ⇒ Pκ(G, H) = G ⊕ H Cartesian (6) Supporting graph diffusions σHeat(Pκ) =I + Pκ + 1 2P2
κ + · · · = Peκ
(7) σvon−Neumann(Pκ) =I + Pκ + P2
κ + · · · = P
1 1−κ
(8) Order-insensitive: If κ is commutative, then SGP is commutative (up to graph isomorphism).
12 / 24
Outline
Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary
13 / 24
Scalable Inference
For general GP, the semi-norm is computed as f2
P = vec(f)⊤P−1vec(f)
(9) For SGP, Pκ no longer has to be explicitly computed. f2
Pκ = n1,n2,...,nJ
- i1,i2,...,iJ
f
- vi1, . . . , viJ
2 κ
- λi1, . . . , λiJ
- (10)
◮ f(vi1, vi2, . . . , viJ) = f ×1 vi1 ×2 vi2 · · · ×J viJ ◮ However, even evaluating (10) is expensive.
14 / 24
Scalable Inference
For general GP, the semi-norm is computed as f2
P = vec(f)⊤P−1vec(f)
(9) For SGP, Pκ no longer has to be explicitly computed. f2
Pκ = n1,n2,...,nJ
- i1,i2,...,iJ
f
- vi1, . . . , viJ
2 κ
- λi1, . . . , λiJ
- (10)
◮ f(vi1, vi2, . . . , viJ) = f ×1 vi1 ×2 vi2 · · · ×J viJ ◮ However, even evaluating (10) is expensive.
14 / 24
Scalable Inference
For general GP, the semi-norm is computed as f2
P = vec(f)⊤P−1vec(f)
(9) For SGP, Pκ no longer has to be explicitly computed. f2
Pκ = n1,n2,...,nJ
- i1,i2,...,iJ
f
- vi1, . . . , viJ
2 κ
- λi1, . . . , λiJ
- (10)
◮ f(vi1, vi2, . . . , viJ) = f ×1 vi1 ×2 vi2 · · · ×J viJ ◮ However, even evaluating (10) is expensive.
14 / 24
Scalable Inference
Using low-rank SGP
◮ f lies in the linear span of the eigenvectors of P. ◮ Eigenvectors of high volatility can be pruned away.
15 / 24
Scalable Inference
Using low-rank SGP
◮ f lies in the linear span of the eigenvectors of P. ◮ Eigenvectors of high volatility can be pruned away.
Figure : Eigenvectors of G (blue), H (red) and P(G, H). 15 / 24
Scalable Inference
Restrict f in the linear span of “smooth” bases of P. f(α) =
d1,d2,··· ,dJ
- i1,i2,··· ,iJ=1
αi1,i2,··· ,iJ
- j
vij (11) where the core tensor α ∈ Rd1×d2×···×dJ, dj ≪ nj. The semi-norm becomes f(α)2
Pκ = d1,d2,··· ,dJ
- i1,i2,...,iJ=1
α2
i1,i2,··· ,iJ
κ
- λi1, λi2, . . . , λiJ
- (12)
We then optimize w.r.t. α instead of f. Parameter size:
- j nj →
j dj.
16 / 24
Scalable Inference
Restrict f in the linear span of “smooth” bases of P. f(α) =
d1,d2,··· ,dJ
- i1,i2,··· ,iJ=1
αi1,i2,··· ,iJ
- j
vij (11) where the core tensor α ∈ Rd1×d2×···×dJ, dj ≪ nj. The semi-norm becomes f(α)2
Pκ = d1,d2,··· ,dJ
- i1,i2,...,iJ=1
α2
i1,i2,··· ,iJ
κ
- λi1, λi2, . . . , λiJ
- (12)
We then optimize w.r.t. α instead of f. Parameter size:
- j nj →
j dj.
16 / 24
Scalable Inference
Figure : Tucker Decomposition, where α is the core tensor. 17 / 24
Scalable Inference
Revised optimization objective min
α∈Rd1×d2···×dJ ℓO (f(α)) + γ
2f(α)2
Pκ
(13) Ranking loss function ℓO(f) =
- (i1, . . . , iJ ) ∈ O
(i′
1, . . . , i′ J ) ∈ ¯
O
- fi1...iJ − fi′
1...i′ J
2
+
|O × ¯ O| (14) ∇α = ∂ℓO ∂f ∂fi1,...,iJ ∂α − ∂fi′
1,...,i′ J
∂α
- + γα ⊘ κ
(15) Tensor algebras are carried out on GPU.
18 / 24
Scalable Inference
Revised optimization objective min
α∈Rd1×d2···×dJ ℓO (f(α)) + γ
2f(α)2
Pκ
(13) Ranking loss function ℓO(f) =
- (i1, . . . , iJ ) ∈ O
(i′
1, . . . , i′ J ) ∈ ¯
O
- fi1...iJ − fi′
1...i′ J
2
+
|O × ¯ O| (14) ∇α = ∂ℓO ∂f ∂fi1,...,iJ ∂α − ∂fi′
1,...,i′ J
∂α
- + γα ⊘ κ
(15) Tensor algebras are carried out on GPU.
18 / 24
Scalable Inference
Revised optimization objective min
α∈Rd1×d2···×dJ ℓO (f(α)) + γ
2f(α)2
Pκ
(13) Ranking loss function ℓO(f) =
- (i1, . . . , iJ ) ∈ O
(i′
1, . . . , i′ J ) ∈ ¯
O
- fi1...iJ − fi′
1...i′ J
2
+
|O × ¯ O| (14) ∇α = ∂ℓO ∂f ∂fi1,...,iJ ∂α − ∂fi′
1,...,i′ J
∂α
- + γα ⊘ κ
(15) Tensor algebras are carried out on GPU.
18 / 24
Outline
Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary
19 / 24
Empirical Evaluation
Datasets Enzyme 445 compounds, 664 proteins. DBLP 34K authors, 11K papers, 22 venues. Representative Baselines TF/GRTF Tensor Factorization/Graph-Regularized TF NN One-class Nearest Neighbor RSVM Ranking SVMs LTKM Low-Rank Tensor Kernel Machines
20 / 24
Empirical Evaluation
Datasets Enzyme 445 compounds, 664 proteins. DBLP 34K authors, 11K papers, 22 venues. Representative Baselines TF/GRTF Tensor Factorization/Graph-Regularized TF NN One-class Nearest Neighbor RSVM Ranking SVMs LTKM Low-Rank Tensor Kernel Machines
20 / 24
Empirical Evaluation
Our method: “TOP” (blue).
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 12.5 25 50 100 MAP Training Size (%) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 12.5 25 50 100 AUC Training Size (%) TOP LTKM NN RSVM TF GRTF 5 10 15 20 25 30 35 12.5 25 50 100 Hits@5 (%) Training Size (%)
Figure : Performance on Enzyme (above) and DBLP (below).
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 12.5 25 50 100 MAP Training Size (%) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 12.5 25 50 100 AUC Training Size (%) TOP LTKM NN RSVM TF GRTF 2 4 6 8 10 12 12.5 25 50 100 Hits@5 (%) Training Size (%)
21 / 24
Outline
Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary
22 / 24
Summary
Contribution
◮ A unified framework to integrating heterogeneous
information in multiple graphs.
◮ Transductive learning to leverage both labeled data
(sparse) and unlabeled data (massive).
◮ A convex approximation for the scalable inference over
the combinatorial number of possible tuples. Future/On-going Work
◮ Learning structured associations. ◮ Larger problems: Microsoft Academic Graph (37 GB).
23 / 24
Summary
Contribution
◮ A unified framework to integrating heterogeneous
information in multiple graphs.
◮ Transductive learning to leverage both labeled data
(sparse) and unlabeled data (massive).
◮ A convex approximation for the scalable inference over
the combinatorial number of possible tuples. Future/On-going Work
◮ Learning structured associations. ◮ Larger problems: Microsoft Academic Graph (37 GB).