Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu, - - PowerPoint PPT Presentation

cross graph learning of multi relational associations
SMART_READER_LITE
LIVE PREVIEW

Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu, - - PowerPoint PPT Presentation

Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu, Yiming Yang Carnegie Mellon University { hanxiaol, yiming } @cs.cmu.edu June 22, 2016 1 / 24 Outline Task Description New Contributions Framework Scalable Inference


slide-1
SLIDE 1

Cross-Graph Learning of Multi-Relational Associations

Hanxiao Liu, Yiming Yang Carnegie Mellon University {hanxiaol, yiming}@cs.cmu.edu June 22, 2016

1 / 24

slide-2
SLIDE 2

Outline

Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary

2 / 24

slide-3
SLIDE 3

Task Description

Goal: Predict associations among heterogeneous graphs.

Protein

Compound Structure Similarity Sequence Similarity Interact

(a) Drug-Target Interaction

Paper Author Venue

Coauthorship Citation Shared Foci Write Publish Attend

(b) Citation Network “John publish a reinforcement learning paper at ICML.” (John,RL Paper,ICML) 3 / 24

slide-4
SLIDE 4

Outline

Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary

4 / 24

slide-5
SLIDE 5

New Contributions

◮ A unified framework to integrating heterogeneous

information in multiple graphs.

◮ Transductive learning to leverage both labeled data

(sparse) and unlabeled data (massive).

◮ A convex approximation for the scalable inference over

the combinatorial number of possible tuples.

5 / 24

slide-6
SLIDE 6

Outline

Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary

6 / 24

slide-7
SLIDE 7

Framework

Notation

◮ G(1), G(2), . . . , G(J) are individual graphs; ◮ nj is the #nodes in G(j); ◮ (i1, i2, . . . , iJ) is a tuple (multi-relation); ◮ fi1,i2,...,iJ is the predicted score for the tuple; ◮ f is a tensor in Rn1×n2×···×nJ.

7 / 24

slide-8
SLIDE 8

Framework

Product Graph (P) induced from G(1), . . . , G(J).

P

  • G(1)

,

G(2)

,

G(3)

  • =

Tensor product: P(G(1), G(2), G(3)) = G(1) ⊗ G(2) ⊗ G(3)

8 / 24

slide-9
SLIDE 9

Framework

Product Graph (P) induced from G(1), . . . , G(J).

P

  • G(1)

,

G(2)

,

G(3)

  • =

Tensor product: P(G(1), G(2), G(3)) = G(1) ⊗ G(2) ⊗ G(3)

8 / 24

slide-10
SLIDE 10

Framework

Why product graph?

◮ Mapping heterogeneous graphs onto a unified graph for

label propagation (transductive learning).

9 / 24

slide-11
SLIDE 11

Framework

Assuming vec(f) ∼ N (0, P) (1) which implies: − log p (f|P) ∝ vec(f)⊤P−1vec(f) := f2

P

(2) Optimization problem min

f

ℓO(f) + γ 2f2

P

(3)

10 / 24

slide-12
SLIDE 12

Framework

Assuming vec(f) ∼ N (0, P) (1) which implies: − log p (f|P) ∝ vec(f)⊤P−1vec(f) := f2

P

(2) Optimization problem min

f

ℓO(f) + γ 2f2

P

(3)

10 / 24

slide-13
SLIDE 13

Framework

Assuming vec(f) ∼ N (0, P) (1) which implies: − log p (f|P) ∝ vec(f)⊤P−1vec(f) := f2

P

(2) Optimization problem min

f

ℓO(f) + γ 2f2

P

(3)

10 / 24

slide-14
SLIDE 14

Framework

For computational tractability, we focus on the spectral graph product family of P. Spectral Graph Product (SGP) The eigensystem of Pκ

  • G(1), . . . , G(J)

is parametrized by the eigensystems of individual graphs, i.e.,

  • κ
  • λi1, . . . , λiJ
  • ,
  • j

vij

  • i1,...,iJ

(4) λij/vij is the ij-th eigenvalue/eigenvector of the j-th graph.

11 / 24

slide-15
SLIDE 15

Framework

Nice properties of SGP: Subsuming basic operations κ(x, y) = x × y = ⇒ Pκ(G, H) = G ⊗ H Tensor (5) κ(x, y) = x + y = ⇒ Pκ(G, H) = G ⊕ H Cartesian (6) Supporting graph diffusions σHeat(Pκ) =I + Pκ + 1 2P2

κ + · · · = Peκ

(7) σvon−Neumann(Pκ) =I + Pκ + P2

κ + · · · = P

1 1−κ

(8) Order-insensitive: If κ is commutative, then SGP is commutative (up to graph isomorphism).

12 / 24

slide-16
SLIDE 16

Framework

Nice properties of SGP: Subsuming basic operations κ(x, y) = x × y = ⇒ Pκ(G, H) = G ⊗ H Tensor (5) κ(x, y) = x + y = ⇒ Pκ(G, H) = G ⊕ H Cartesian (6) Supporting graph diffusions σHeat(Pκ) =I + Pκ + 1 2P2

κ + · · · = Peκ

(7) σvon−Neumann(Pκ) =I + Pκ + P2

κ + · · · = P

1 1−κ

(8) Order-insensitive: If κ is commutative, then SGP is commutative (up to graph isomorphism).

12 / 24

slide-17
SLIDE 17

Framework

Nice properties of SGP: Subsuming basic operations κ(x, y) = x × y = ⇒ Pκ(G, H) = G ⊗ H Tensor (5) κ(x, y) = x + y = ⇒ Pκ(G, H) = G ⊕ H Cartesian (6) Supporting graph diffusions σHeat(Pκ) =I + Pκ + 1 2P2

κ + · · · = Peκ

(7) σvon−Neumann(Pκ) =I + Pκ + P2

κ + · · · = P

1 1−κ

(8) Order-insensitive: If κ is commutative, then SGP is commutative (up to graph isomorphism).

12 / 24

slide-18
SLIDE 18

Outline

Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary

13 / 24

slide-19
SLIDE 19

Scalable Inference

For general GP, the semi-norm is computed as f2

P = vec(f)⊤P−1vec(f)

(9) For SGP, Pκ no longer has to be explicitly computed. f2

Pκ = n1,n2,...,nJ

  • i1,i2,...,iJ

f

  • vi1, . . . , viJ

2 κ

  • λi1, . . . , λiJ
  • (10)

◮ f(vi1, vi2, . . . , viJ) = f ×1 vi1 ×2 vi2 · · · ×J viJ ◮ However, even evaluating (10) is expensive.

14 / 24

slide-20
SLIDE 20

Scalable Inference

For general GP, the semi-norm is computed as f2

P = vec(f)⊤P−1vec(f)

(9) For SGP, Pκ no longer has to be explicitly computed. f2

Pκ = n1,n2,...,nJ

  • i1,i2,...,iJ

f

  • vi1, . . . , viJ

2 κ

  • λi1, . . . , λiJ
  • (10)

◮ f(vi1, vi2, . . . , viJ) = f ×1 vi1 ×2 vi2 · · · ×J viJ ◮ However, even evaluating (10) is expensive.

14 / 24

slide-21
SLIDE 21

Scalable Inference

For general GP, the semi-norm is computed as f2

P = vec(f)⊤P−1vec(f)

(9) For SGP, Pκ no longer has to be explicitly computed. f2

Pκ = n1,n2,...,nJ

  • i1,i2,...,iJ

f

  • vi1, . . . , viJ

2 κ

  • λi1, . . . , λiJ
  • (10)

◮ f(vi1, vi2, . . . , viJ) = f ×1 vi1 ×2 vi2 · · · ×J viJ ◮ However, even evaluating (10) is expensive.

14 / 24

slide-22
SLIDE 22

Scalable Inference

Using low-rank SGP

◮ f lies in the linear span of the eigenvectors of P. ◮ Eigenvectors of high volatility can be pruned away.

15 / 24

slide-23
SLIDE 23

Scalable Inference

Using low-rank SGP

◮ f lies in the linear span of the eigenvectors of P. ◮ Eigenvectors of high volatility can be pruned away.

Figure : Eigenvectors of G (blue), H (red) and P(G, H). 15 / 24

slide-24
SLIDE 24

Scalable Inference

Restrict f in the linear span of “smooth” bases of P. f(α) =

d1,d2,··· ,dJ

  • i1,i2,··· ,iJ=1

αi1,i2,··· ,iJ

  • j

vij (11) where the core tensor α ∈ Rd1×d2×···×dJ, dj ≪ nj. The semi-norm becomes f(α)2

Pκ = d1,d2,··· ,dJ

  • i1,i2,...,iJ=1

α2

i1,i2,··· ,iJ

κ

  • λi1, λi2, . . . , λiJ
  • (12)

We then optimize w.r.t. α instead of f. Parameter size:

  • j nj →

j dj.

16 / 24

slide-25
SLIDE 25

Scalable Inference

Restrict f in the linear span of “smooth” bases of P. f(α) =

d1,d2,··· ,dJ

  • i1,i2,··· ,iJ=1

αi1,i2,··· ,iJ

  • j

vij (11) where the core tensor α ∈ Rd1×d2×···×dJ, dj ≪ nj. The semi-norm becomes f(α)2

Pκ = d1,d2,··· ,dJ

  • i1,i2,...,iJ=1

α2

i1,i2,··· ,iJ

κ

  • λi1, λi2, . . . , λiJ
  • (12)

We then optimize w.r.t. α instead of f. Parameter size:

  • j nj →

j dj.

16 / 24

slide-26
SLIDE 26

Scalable Inference

Figure : Tucker Decomposition, where α is the core tensor. 17 / 24

slide-27
SLIDE 27

Scalable Inference

Revised optimization objective min

α∈Rd1×d2···×dJ ℓO (f(α)) + γ

2f(α)2

(13) Ranking loss function ℓO(f) =

  • (i1, . . . , iJ ) ∈ O

(i′

1, . . . , i′ J ) ∈ ¯

O

  • fi1...iJ − fi′

1...i′ J

2

+

|O × ¯ O| (14) ∇α = ∂ℓO ∂f ∂fi1,...,iJ ∂α − ∂fi′

1,...,i′ J

∂α

  • + γα ⊘ κ

(15) Tensor algebras are carried out on GPU.

18 / 24

slide-28
SLIDE 28

Scalable Inference

Revised optimization objective min

α∈Rd1×d2···×dJ ℓO (f(α)) + γ

2f(α)2

(13) Ranking loss function ℓO(f) =

  • (i1, . . . , iJ ) ∈ O

(i′

1, . . . , i′ J ) ∈ ¯

O

  • fi1...iJ − fi′

1...i′ J

2

+

|O × ¯ O| (14) ∇α = ∂ℓO ∂f ∂fi1,...,iJ ∂α − ∂fi′

1,...,i′ J

∂α

  • + γα ⊘ κ

(15) Tensor algebras are carried out on GPU.

18 / 24

slide-29
SLIDE 29

Scalable Inference

Revised optimization objective min

α∈Rd1×d2···×dJ ℓO (f(α)) + γ

2f(α)2

(13) Ranking loss function ℓO(f) =

  • (i1, . . . , iJ ) ∈ O

(i′

1, . . . , i′ J ) ∈ ¯

O

  • fi1...iJ − fi′

1...i′ J

2

+

|O × ¯ O| (14) ∇α = ∂ℓO ∂f ∂fi1,...,iJ ∂α − ∂fi′

1,...,i′ J

∂α

  • + γα ⊘ κ

(15) Tensor algebras are carried out on GPU.

18 / 24

slide-30
SLIDE 30

Outline

Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary

19 / 24

slide-31
SLIDE 31

Empirical Evaluation

Datasets Enzyme 445 compounds, 664 proteins. DBLP 34K authors, 11K papers, 22 venues. Representative Baselines TF/GRTF Tensor Factorization/Graph-Regularized TF NN One-class Nearest Neighbor RSVM Ranking SVMs LTKM Low-Rank Tensor Kernel Machines

20 / 24

slide-32
SLIDE 32

Empirical Evaluation

Datasets Enzyme 445 compounds, 664 proteins. DBLP 34K authors, 11K papers, 22 venues. Representative Baselines TF/GRTF Tensor Factorization/Graph-Regularized TF NN One-class Nearest Neighbor RSVM Ranking SVMs LTKM Low-Rank Tensor Kernel Machines

20 / 24

slide-33
SLIDE 33

Empirical Evaluation

Our method: “TOP” (blue).

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 12.5 25 50 100 MAP Training Size (%) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 12.5 25 50 100 AUC Training Size (%) TOP LTKM NN RSVM TF GRTF 5 10 15 20 25 30 35 12.5 25 50 100 Hits@5 (%) Training Size (%)

Figure : Performance on Enzyme (above) and DBLP (below).

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 12.5 25 50 100 MAP Training Size (%) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 12.5 25 50 100 AUC Training Size (%) TOP LTKM NN RSVM TF GRTF 2 4 6 8 10 12 12.5 25 50 100 Hits@5 (%) Training Size (%)

21 / 24

slide-34
SLIDE 34

Outline

Task Description New Contributions Framework Scalable Inference Empirical Evaluation Summary

22 / 24

slide-35
SLIDE 35

Summary

Contribution

◮ A unified framework to integrating heterogeneous

information in multiple graphs.

◮ Transductive learning to leverage both labeled data

(sparse) and unlabeled data (massive).

◮ A convex approximation for the scalable inference over

the combinatorial number of possible tuples. Future/On-going Work

◮ Learning structured associations. ◮ Larger problems: Microsoft Academic Graph (37 GB).

23 / 24

slide-36
SLIDE 36

Summary

Contribution

◮ A unified framework to integrating heterogeneous

information in multiple graphs.

◮ Transductive learning to leverage both labeled data

(sparse) and unlabeled data (massive).

◮ A convex approximation for the scalable inference over

the combinatorial number of possible tuples. Future/On-going Work

◮ Learning structured associations. ◮ Larger problems: Microsoft Academic Graph (37 GB).

23 / 24

slide-37
SLIDE 37

Thank You

24 / 24