An Interpretable Knowledge Transfer Model for Knowledge Base - - PowerPoint PPT Presentation

an interpretable knowledge transfer model for knowledge
SMART_READER_LITE
LIVE PREVIEW

An Interpretable Knowledge Transfer Model for Knowledge Base - - PowerPoint PPT Presentation

An Interpretable Knowledge Transfer Model for Knowledge Base Completion Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy Carnegie Mellon University Language Technologies Institute August 2, 2017 1 / 28 Outline Introduction Task Motivation


slide-1
SLIDE 1

An Interpretable Knowledge Transfer Model for Knowledge Base Completion

Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy

Carnegie Mellon University Language Technologies Institute

August 2, 2017

1 / 28

slide-2
SLIDE 2

Outline

Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness

2 / 28

slide-3
SLIDE 3

Outline

Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness

3 / 28

slide-4
SLIDE 4

Task: Knowledge base completion (KBC)

◮ Recover missing facts in knowledge bases

◮ Given lots of triples such as

(Leonardo DiCaprio, won award, Oscar)

◮ Predict missing facts (Leonardo DiCaprio, Profession, ?)

◮ Embedding-based approaches

4 / 28

slide-5
SLIDE 5

Data Sparsity Issue

Log(Frequency)

2.75 5.5 8.25 11

Frequency

10000 20000 30000 40000

Relation

Frequency Log(Frequency)

(a) WN18

Log(Frequency)

2.5 5 7.5 10

Frequency

4000 8000 12000 16000

Relation

Frequency Log(Frequency)

(b) FB15k Figure 1: Frequencies of relations are subject to Zipf’s law.

5 / 28

slide-6
SLIDE 6

Problems Our Model Tackle

◮ Data-sparsity: Transfer learning

◮ On WN18, the rarer the relation is, the greater the improvements

are

◮ Interpretability: ℓ0-regularized representation

◮ Reverse relations, undirected relations and similar relations are

identified by the sparse representation

◮ Model size: Compression

◮ On FB15k, the number of parameters can be reduced to 1/90 of

the original model

6 / 28

slide-7
SLIDE 7

Outline

Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness

7 / 28

slide-8
SLIDE 8

Notation and Previous Models

◮ Data: Triples (h, r, t)

◮ Training data: (h = Leonardo DiCaprio, r = won award, t = Oscar) ◮ Test data: (h = Leonardo DiCaprio, r = Profession, t = ?)

◮ Energy function fr(h, t) of triples (h, r, t) ◮ Minimize the energy of true triples and maximize the energy of false triples

◮ TransE [Bordes et al., 2013]:

fr(h, t) = h + r − tℓ Parameters: entity embeddings h, t, relation embeddings r

◮ STransE [Nguyen et al., 2016]:

fr(h, t) = Wr,1h + r − Wr,2tℓ Parameters: relation-specific projection matrices Wr,1, Wr,2 and embeddings

◮ All parameters are trained by SGD

8 / 28

slide-9
SLIDE 9

STransE: Parametrizing Each Relation Separately

◮ Prone to the data sparsity problem

9 / 28

slide-10
SLIDE 10

Sharing Parameters through Common Concepts

◮ Relation-concept mapping example with attention weights: ◮ Parametrize concepts instead of relations ◮ Relation matrices are weighted averages of concept matrices with attention

weights Wr1,1 = 0.2D1 + 0.8D2

10 / 28

slide-11
SLIDE 11

Sharing Parameters through Common Concepts

◮ Suppose a ground-truth mapping is given, then

◮ Transfer learning can be done effectively through parameter

sharing

◮ We can interpret similar relations ◮ All parameters are trainable by SGD

◮ Concepts need to be learned end-to-end ◮ How do we obtain the mapping?

11 / 28

slide-12
SLIDE 12

Dense Mapping

◮ Dense attention: Construct a dense bipartite graph and train attention weights ◮ Problems:

◮ Uninterpretable: In practice, even with ℓ1 regularization, we get a distributed

weights Wr1,1 = 0.2D1 + 0.52D2 + 0.1D3 + 0.15D4 + 0.03D5

◮ Inefficient: Computation involves all concept matrices ◮ Unnecessary: Intuitively, each relation can be composed of at most K concepts 12 / 28

slide-13
SLIDE 13

Sparse Mapping

◮ Problem: Not differentiable ◮ An approximate approach:

◮ Given current embeddings, a correct mapping should minimize the loss function ◮ For each relation, assign a single concept to the relation and compute the loss ◮ Greedily choose the top K concepts that minimize the loss 13 / 28

slide-14
SLIDE 14

Block Iterative Optimization

◮ Randomly initialize mappings and concepts. ◮ Repeat

◮ Optimize embeddings and attention weights with SGD ◮ Reassign mappings 14 / 28

slide-15
SLIDE 15

A Better Sampling Approach: Domain sampling

◮ Loss function involves negative sampling ◮ Sample from domain-specific entities with an adaptive

probability

◮ E.g., negative sample of (Steve Jobs, was born in, US):

◮ Uniform negative sample: (Steve Jobs, was born in, CMU) ◮ Domain negative sample: (Steve Jobs, was born in, China) 15 / 28

slide-16
SLIDE 16

Outline

Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness

16 / 28

slide-17
SLIDE 17

Main Results

Model Additional Information WN18 FB15k Mean Rank Hits@10 Mean Rank Hits@10 SE [Bordes et al., 2011] No 985 80.5 162 39.8 Unstructured [Bordes et al., 2014] No 304 38.2 979 6.3 TransE [Bordes et al., 2013] No 251 89.2 125 47.1 TransH [Wang et al., 2014] No 303 86.7 87 64.4 TransR [Lin et al., 2015b] No 225 92.0 77 68.7 CTransR [Lin et al., 2015b] No 218 92.3 75 70.2 KG2E [He et al., 2015] No 348 93.2 59 74.0 TransD [Ji et al., 2015] No 212 92.2 91 77.3 TATEC [Garc´ ıa-Dur´ an et al., 2016] No

  • 58

76.7 NTN [Socher et al., 2013] No

  • 66.1
  • 41.4

DISTMULT [Yang et al., 2015] No

  • 94.2
  • 57.7

STransE [Nguyen et al., 2016] No 206 (244) 93.4 (94.7) 69 79.7 ITransF No 205 94.2 65 81.0 ITransF (domain sampling) No 223 95.2 77 81.4

RTransE [Garc´

ıa-Dur´ an et al., 2015] Path

  • 50

76.2 PTransE [Lin et al., 2015a] Path

  • 58

84.6 NLFeat [Toutanova and Chen, 2015] Node + Link Features

  • 94.3
  • 87.0

Random Walk [Wei et al., 2016] Path

  • 94.8
  • 74.7

IRN [Shen et al., 2016] External Memory 249 95.3 38 92.7

Table 1: Link prediction results on two datasets. Hits@10 is the top-10 accuracy. Higher Hits@10 or lower Mean Rank indicates better performance.

17 / 28

slide-18
SLIDE 18

Performance on Rare Relations

Hits@10

25 50 75 100

Relations: Frequent —> Rare

ITransF (ours) STransE

Figure 2: Average Hits@10 on WN18 relations

18 / 28

slide-19
SLIDE 19

Performance on Rare Relations

Hits@10

25 50 75 100

Relation Bin

Frequent Medium Rare ITransF (ours) STransE

(a) WN18

Hits@10

25 50 75 100

Relation Bin

Frequent Medium Rare ITransF (ours) STransE

(b) FB15k Figure 3: Average Hits@10 on relations of different frequencies

19 / 28

slide-20
SLIDE 20

Interpretability: How Is Knowledge Shared?

◮ Each relation’s head and tail have their own concepts.

(a) WN18 (b) FB15k Figure 4: Heatmap visualization of attention weights on WN18 and FB15k.

20 / 28

slide-21
SLIDE 21

Interpretability: How Is Knowledge Shared?

◮ Each relation’s head and tail have their own concepts. ◮ Interpretation:

◮ Reverse relations: hyponym and hypernym; award winning work and

won award for. (a) WN18 (b) FB15k Figure 5: Heatmap visualization of attention weights on WN18 and FB15k.

21 / 28

slide-22
SLIDE 22

Interpretability: How Is Knowledge Shared?

◮ Each relation’s head and tail have their own concepts. ◮ Interpretation:

◮ Reverse relations: hyponym and hypernym; award winning work and

won award for.

◮ Undirected relations: spouse; similar to.

(a) WN18 (b) FB15k Figure 6: Heatmap visualization of attention weights on WN18 and FB15k.

22 / 28

slide-23
SLIDE 23

Interpretability: How Is Knowledge Shared?

◮ Each relation’s head and tail have their own concepts. ◮ Interpretation:

◮ Reverse relations: hyponym and hypernym; award winning work and

won award for.

◮ Undirected relations: spouse; similar to. ◮ Similar relations: was anominated for and won award for; instance hypernym

and hypernym. (a) WN18 (b) FB15k

23 / 28

slide-24
SLIDE 24

Interpretability of ℓ1 regularized dense mapping

(a) WN18 (b) FB15k Figure 8: Heatmap visualization of ℓ1 regularized dense mapping

◮ The mapping cannot be sparse without performance loss.

24 / 28

slide-25
SLIDE 25

A Byproduct of Parameter Sharing: Model Compression

Hits@10

70 73.25 76.5 79.75 83

# concepts

15 30 75 300 600 1200 1345 2200 2690 ITransF STransE CTransR

(a) FB15k

Hits@10

90 91.25 92.5 93.75 95

# concepts

18 22 26 30 36 45 ITransF STransE CTransR

(b) WN18 Figure 9: Performance with different number of concepts

◮ On FB15k, the model can be compressed by nearly 90 times.

25 / 28

slide-26
SLIDE 26

Analysis on Sparseness

◮ Does sparseness hurt performance?

Method WN18 FB15k MR H10 Time MR H10 Time Dense 199 94.0 4m34s 69 79.4 4m30s Dense + ℓ1 228 94.2 4m25s 131 78.9 5m47s Sparse 207 94.1 2m32s 67 79.6 1m52s Table 2: Performance of model with dense graph or sparse graph with only 15 or 22

  • concepts. The time gap is more significant when we use more concepts.

◮ How does our approach compare to sparse encoding methods?

Method WN18 FB15k MR H10 MR H10 Pretrain + Sparse Encoding [Faruqui et al., 2015] 211 86.6 66 79.1 Ours 205 94.2 65 81.0 Table 3: Different methods to obtain sparse representations

26 / 28

slide-27
SLIDE 27

Conclusion

◮ Propose a knowledge embedding model which can discover shared hidden

concepts

◮ Perform transfer learning through parameter sharing ◮ Design a learning algorithm to induce the interpretable sparse representation ◮ Outperform baselines on two benchmark datasets for the knowledge base

completion task

27 / 28

slide-28
SLIDE 28

Reference

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2014). A Semantic Matching Energy Function for Learning with Multi-relational Data. Machine Learning, 94(2):233–259. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013). Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26, pages 2787–2795. Bordes, A., Weston, J., Collobert, R., and Bengio, Y. (2011). Learning Structured Embeddings of Knowledge Bases. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pages 301–306. Faruqui, M., Tsvetkov, Y., Yogatama, D., Dyer, C., and Smith, N. A. (2015). Sparse overcomplete word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on

28 / 28