Semantic Documents Relatedness using Concept Graph Representation - - PowerPoint PPT Presentation

semantic documents relatedness using concept graph
SMART_READER_LITE
LIVE PREVIEW

Semantic Documents Relatedness using Concept Graph Representation - - PowerPoint PPT Presentation

Semantic Documents Relatedness using Concept Graph Representation Date : 2016/07/12 Author : Yuan Ni et al. IBM Research, China Source : ACM WSDM16 Advisor : Jia-ling Koh Speaker : Yi-hui Lee WSDM 2016 The 9th ACM International


slide-1
SLIDE 1

Semantic Documents Relatedness using Concept Graph Representation

Date : 2016/07/12 Author : Yuan Ni et al. IBM Research, China Source : ACM WSDM’16 Advisor : Jia-ling Koh Speaker : Yi-hui Lee

slide-2
SLIDE 2

WSDM 2016

  • The 9th ACM International Conference on Web

Search and Data Mining

  • Feb 22-25, 2016
  • San Francisco, USA

2

slide-3
SLIDE 3

Authors

  • Yuan Ni, Qiong Kai Xu, Feng Cao, Hui Jia Zhu
  • IBM Research, China
  • Yogi Mass, Dafna Sheinwald
  • IBM Research, Haifa
  • Shaw Sheng Cao
  • Xidian University

3

slide-4
SLIDE 4

Outline

  • Introduction
  • Approach
  • Experiment
  • Conclusion

4

slide-5
SLIDE 5

Introduction

  • Bag-of-words models
  • Does not address multiple meanings of same word of synonymy
  • f words.
  • LSA, LDA, Word2Vec, …
  • The latent topics are hard to interpret.

5

slide-6
SLIDE 6

Introduction

  • Proposed a document representation
  • Use topic models that are based on explicit concepts.
  • Consider the relationships among the concepts.
  • Use neural network based methods to represent concepts as

continuous vectors.

6

slide-7
SLIDE 7
  • A document is represented as a concept graph.
  • Node represents concepts extracted from the

document through through references to the entity in a knowledge base.

  • Edge represent the semantic and structural

relationships among the concepts.

7

Introduction

slide-8
SLIDE 8

Outline

  • Introduction
  • Approach
  • Experiment
  • Conclusion

8

slide-9
SLIDE 9

Spotlight or TagMe

  • 1. Content association
  • 2. Category association
  • 3. Structure association

9

Approach

slide-10
SLIDE 10

Mention Detector

  • “Gambling increases aggregate demand for goods

and services in the economy.”

  • Extract document into concepts (nodes)
  • gambling
  • aggregate demand
  • goods and services
  • economy

10

slide-11
SLIDE 11

Knowledge Base

  • 1. definition

  • 2. synonym


11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

Concept Graph Builder

  • Find relationships (edges) between the concepts

through the three kinds of association

  • Context association
  • Category association
  • Structure association

13

slide-14
SLIDE 14

Context Association

  • Relative frequency of occurrence of both concepts

in same contexts.

  • A large number of shared incoming links indicates a high

context association.

14

Concept Sets of incoming links to the concept Total number of concepts in the knowledge base

slide-15
SLIDE 15

Category Association

  • How close the concepts are in the knowledge base

taxonomy of categories.

  • If two concepts belong to the same Wikipedia category, they are

associated with the same topic.

  • Measure the category association between two concepts by

first finding the pairwise similarity between any two individual categories they belong to.

15

black politic president republic concept: Obama

Information content score for the taxonomy Common ancestor of ci and cj with highest information content

slide-16
SLIDE 16

Category Association

  • Different metrics are proposed to measure the

information content of a node in the taxonomy.

  • An intrinsic metric considers only the topological properties of

the taxonomy backbone.

  • An extrinsic metric also considers the instances that belong to

each category in the taxonomy.

  • More specific the node is in the taxonomy, the

higher is its information content.

16

I E

slide-17
SLIDE 17

Category Association

17

The depth of the category in the taxonomy The maximum depth of the taxonomy I The number of descendants of a category in the taxonomy The set of descendants

  • f the category

The set of all categories in the taxonomy I Also the number of instances that belong to a category The number of instances that belong to the category

  • r its descendants

The total number of instances in DBPedia E

slide-18
SLIDE 18

Category Association

  • Groupwise category similarity denoted by cat(C1, C2) is

defined as the average best similarity over all pairs.

  • Given two concept m1 and m2, let C1 = {c11, c12, … c1p},

and C2 = {c21, c22, … c2q} denote the respective sets of categories that m1 and m2 belong to.

18

The maximal pairwise similarity between c1i and any category in C2 The maximal pairwise similarity between c2j and any category in C1 The number of category in the C1 which is related to concept m1 The number of category in the C2 which is related to concept m2

slide-19
SLIDE 19

Structure Association

  • Wikipedia infoboxes contain information about

concepts and their various types of relationships.

  • Induce a structural graph, G(V, E), whose edges e ∈ E are

labeled by predicates pred(e) that indicate the type of the relationship.

19

Frequent predicates represent a general, less significant relationship

slide-20
SLIDE 20

Concept Centrality Finder

  • Assign weights to the concepts (nodes) to reflect

three relevance to the aspects of the document.

  • Global method
  • Concept graph based weights
  • Local method
  • Content based weights

20

slide-21
SLIDE 21

Concept Graph Based Weights

  • According to different criteria of “importance”

which suit different purposes, different graph properties of a node are considered for the evaluation of its centrality.

  • Degree
  • Closeness
  • Betweenness

21

The set of concepts in the concept graph Weight parameter Weight parameter Weight parameters

slide-22
SLIDE 22

Content Based Weights

  • The local method which is based on content

similarity between the Wikipedia page of the concept m and the given document d.

  • Combine with the centrality based weights.
  • Filter out concepts that may be erroneously detected by the

mention detection tool.

22

slide-23
SLIDE 23

Similarity Calculator

  • Compute the semantic relatedness between given

documents in terms of features of concept graphs.

  • Pairwise concept similarity
  • Document similarity

23

slide-24
SLIDE 24

Pairwise Concept Similarity

  • Concept2Vector measures between two concepts that uses a

neural network to represent concepts as continuous vectors.

  • He is an alumnus of Georgetown University, where he was a member of

Kappa Kappa Psi and Phi Beta Kappa and earned a Rhodes Scholarship to attend the Oxford University.

  • He is an alumnus of Georgetown_University, where he was a member of

Kappa_Kappa_Psi and Phi_Beta_ Kappa_Society and earned a Rhodes_Scholarship to attend the University_of_Oxford.

24

slide-25
SLIDE 25

Pairwise Concept Similarity

  • Use Skip-Gram model to generate infrequent-terms

concept vectors.

  • Wikipedia: 4.27 millions concepts
  • DBPedia: 4.58 millions concepts
  • The coverage of the concepts is 93.2%

25

slide-26
SLIDE 26

Document Similarity

  • ConceptGraphSim comprises both a pairwise

similarity and cosine similarity.

26

{m1i}pi=1 is the concepts in the concept graph of D1 {w1i}pi=1 is the weights to the concepts of D1 The best pairwise similarity for m2j (m1i respectively)

slide-27
SLIDE 27
  • Experimental testbed
  • Data sets — LP50 which consists of 50 news articles

from the Australian Broadcasting Corporation (ABC)

  • Evaluation metric — Pearson linear correlation
  • Experimental setup — process Wikipedia to

generate the Concept2Vector model

  • Generate vectors for 3 million words and 4 million concepts
  • The vector for each term has 200 dimensions

27

Experiment

slide-28
SLIDE 28

Compare to State-of-the-art

Parameter Pearson correlation LSA 0.60 GED 0.63 ESA paper 0.720 ESA implemented 0.727 ConceptGraphSim 0.745 WikiWalk + ESA 0.772 ConceptGraphSim + ESA 0.786 ConceptsLearned 0.808

Might be overfitted to the LP50

28

Not domain specific

slide-29
SLIDE 29

Effect of Centrality Metrics and Concept Weight Strategies

29

slide-30
SLIDE 30

Effect of Pairwise Concept Similarity

30

slide-31
SLIDE 31

Effect of IC on Category Level Association

31

slide-32
SLIDE 32

Conclusions

  • Propose concept graph for representing a

document using its detected concepts.

  • Present a novel similarity measure

ConceptGraphSim between two documents by comparing their concept graphs.

  • The combination of concept based centrality with

neural networks has a high potential that can be further exploited to achieve better results.

32