semantic documents relatedness using concept graph
play

Semantic Documents Relatedness using Concept Graph Representation - PowerPoint PPT Presentation

Semantic Documents Relatedness using Concept Graph Representation Date : 2016/07/12 Author : Yuan Ni et al. IBM Research, China Source : ACM WSDM16 Advisor : Jia-ling Koh Speaker : Yi-hui Lee WSDM 2016 The 9th ACM International


  1. Semantic Documents Relatedness using Concept Graph Representation Date : 2016/07/12 Author : Yuan Ni et al. IBM Research, China Source : ACM WSDM’16 Advisor : Jia-ling Koh Speaker : Yi-hui Lee

  2. WSDM 2016 • The 9th ACM International Conference on Web Search and Data Mining • Feb 22-25, 2016 • San Francisco, USA 2

  3. Authors • Yuan Ni, Qiong Kai Xu, Feng Cao, Hui Jia Zhu • IBM Research, China • Yogi Mass, Dafna Sheinwald • IBM Research, Haifa • Shaw Sheng Cao • Xidian University 3

  4. Outline • Introduction • Approach • Experiment • Conclusion 4

  5. Introduction • Bag-of-words models • Does not address multiple meanings of same word of synonymy of words. • LSA, LDA, Word2Vec, … • The latent topics are hard to interpret. 5

  6. Introduction • Proposed a document representation • Use topic models that are based on explicit concepts. • Consider the relationships among the concepts. • Use neural network based methods to represent concepts as continuous vectors. 6

  7. Introduction • A document is represented as a concept graph . • Node represents concepts extracted from the document through through references to the entity in a knowledge base. • Edge represent the semantic and structural relationships among the concepts. 7

  8. Outline • Introduction • Approach • Experiment • Conclusion 8

  9. Approach Spotlight or TagMe 1. Content association 2. Category association 3. Structure association 9

  10. Mention Detector • “Gambling increases aggregate demand for goods and services in the economy.” • Extract document into concepts (nodes) • gambling • aggregate demand • goods and services • economy 10

  11. Knowledge Base 1. definition 
 2. synonym 
 … 11

  12. 12

  13. Concept Graph Builder • Find relationships (edges) between the concepts through the three kinds of association • Context association • Category association • Structure association 13

  14. Context Association • Relative frequency of occurrence of both concepts in same contexts. • A large number of shared incoming links indicates a high context association. Sets of incoming links to the concept Concept Total number of concepts in the knowledge base 14

  15. Category Association • How close the concepts are in the knowledge base taxonomy of categories. • If two concepts belong to the same Wikipedia category, they are associated with the same topic. • Measure the category association between two concepts by first finding the pairwise similarity between any two individual categories they belong to. Common ancestor of c i and c j with highest information content black republic president politic Information content score concept: Obama for the taxonomy 15

  16. Category Association • Different metrics are proposed to measure the information content of a node in the taxonomy. • An i ntrinsic metric considers only the topological properties of I the taxonomy backbone. • An e xtrinsic metric also considers the instances that belong to E each category in the taxonomy. • More specific the node is in the taxonomy, the higher is its information content. 16

  17. Category Association The depth of the category in the taxonomy I Also the number of instances that belong to a category E The maximum depth of the taxonomy The number of descendants of The number of instances a category in the taxonomy that belong to the category I or its descendants The total number of instances in DBPedia The set of all categories The set of descendants in the taxonomy of the category 17

  18. Category Association • Groupwise category similarity denoted by cat ( C 1 , C 2 ) is defined as the average best similarity over all pairs. • Given two concept m 1 and m 2 , let C 1 = {c 11 , c 12 , … c 1p }, and C 2 = {c 21 , c 22 , … c 2q } denote the respective sets of categories that m 1 and m 2 belong to. The maximal pairwise similarity The maximal pairwise similarity between c 1i and any category in C 2 between c 2j and any category in C 1 The number of category in the C 1 The number of category in the C 2 which is related to concept m 1 which is related to concept m 2 18

  19. Structure Association • Wikipedia infoboxes contain information about concepts and their various types of relationships. • Induce a structural graph, G(V, E) , whose edges e ∈ E are labeled by predicates pred(e) that indicate the type of the relationship. Frequent predicates represent a general, less significant relationship 19

  20. Concept Centrality Finder • Assign weights to the concepts (nodes) to reflect three relevance to the aspects of the document. • Global method • Concept graph based weights • Local method • Content based weights 20

  21. Concept Graph Based Weights • According to different criteria of “importance” which suit different purposes, different graph properties of a node are considered for the evaluation of its centrality. The set of concepts in the concept graph • Degree • Closeness • Betweenness Weight parameters Weight parameter Weight parameter 21

  22. Content Based Weights • The local method which is based on content similarity between the Wikipedia page of the concept m and the given document d . • Combine with the centrality based weights. • Filter out concepts that may be erroneously detected by the mention detection tool. 22

  23. Similarity Calculator • Compute the semantic relatedness between given documents in terms of features of concept graphs. • Pairwise concept similarity • Document similarity 23

  24. Pairwise Concept Similarity • Concept2Vector measures between two concepts that uses a neural network to represent concepts as continuous vectors. • He is an alumnus of Georgetown University , where he was a member of Kappa Kappa Psi and Phi Beta Kappa and earned a Rhodes Scholarship to attend the Oxford University . • He is an alumnus of Georgetown_University , where he was a member of Kappa_Kappa_Psi and Phi_Beta_ Kappa_Society and earned a Rhodes_Scholarship to attend the University_of_Oxford . 24

  25. Pairwise Concept Similarity • Use Skip-Gram model to generate infrequent-terms concept vectors. • Wikipedia: 4.27 millions concepts • DBPedia: 4.58 millions concepts • The coverage of the concepts is 93.2% 25

  26. Document Similarity • ConceptGraphSim comprises both a pairwise similarity and cosine similarity. {m 1i } pi=1 is the concepts in the concept graph of D 1 The best pairwise similarity for m 2j (m 1i respectively) {w 1i } pi=1 is the weights to the concepts of D 1 26

  27. Experiment • Experimental testbed • Data sets — LP50 which consists of 50 news articles from the Australian Broadcasting Corporation (ABC) • Evaluation metric — Pearson linear correlation • Experimental setup — process Wikipedia to generate the Concept2Vector model • Generate vectors for 3 million words and 4 million concepts • The vector for each term has 200 dimensions 27

  28. Compare to State-of-the-art Parameter Pearson correlation LSA 0.60 GED 0.63 ESA paper 0.720 ESA implemented 0.727 ConceptGraphSim 0.745 WikiWalk + ESA 0.772 ConceptGraphSim + ESA 0.786 ConceptsLearned 0.808 Might be overfitted to the LP50 Not domain specific 28

  29. Effect of Centrality Metrics and Concept Weight Strategies 29

  30. Effect of Pairwise Concept Similarity 30

  31. Effect of IC on Category Level Association 31

  32. Conclusions • Propose concept graph for representing a document using its detected concepts. • Present a novel similarity measure ConceptGraphSim between two documents by comparing their concept graphs. • The combination of concept based centrality with neural networks has a high potential that can be further exploited to achieve better results. 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend