semantic tagging using topic models exploiting wikipedia
play

Semantic Tagging Using Topic Models Exploiting Wikipedia Category - PowerPoint PPT Presentation

Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article about in terms of


  1. Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh

  2. Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article “about” in terms of categories? - Helpful for information access and retrieval

  3. Model Overview (sOntoLDA) Modify LDA to suit problem’s needs - Pre-define topics as Wiki categories - Use prior knowledge to improve topic-word distribution - Wikipedia articles labeled with categories - Represent with � matrix LDA sOntoLDA � ~ Dir( � ) � ~ Dir( � x � )

  4. Building Prior Matrix ( � ) How do we represent prior word-topic knowledge? - Start with tf-idf matrix - Each “document” is the set of Wiki articles tagged with a given category - Add subcategories down to a specific level ℓ

  5. Inference using Gibbs Sampling Now that we have a generative LDA model and the � priors we need to ● reverse the process to infer from the observed documents: Denominator cannot be computed with C n terms where n is number of ● words in vocabulary. Collapsed Gibbs Sampling which uses a Markov Chain Monte Carlo to ● converge to a posterior distribution over categories c, conditioned on the observed words w, and hyperparameters � and �

  6. Inference using Gibbs Sampling Probability of a category ● given a document Probability of a word given a ● category

  7. Health Tagging Example Health Health Care Sciences Structure of and relationships ● between Wikipedia categories as represented by SKOS properties. Self Care Dentistry 0.0403 0.0478 Sub Categories and Super ● Categories Personal Dentistry Consider super-categories in Hygiene ● Hygiene Branches Products addition to exact match 0.0302 Categories assigned to article on Chiropractic ● Oral Hygiene Treatment “tooth brushing” and the related 0.1533 Techniques 0.0227 category hierarchy Tooth Brushing

  8. Experiments 1. How well does the model predict the categories of a collection of the Wikipedia articles? 2. Assign Wikipedia tags to Reuters News articles and compare top-k topics

  9. Preprocessing Final Topic Graph 1,353 categories ● 30,300 articles ● Vocabulary size 99,665 ●

  10. Evaluation metric Precision@k and Mean Average Precision (MAP)

  11. Tagging wikipedia article results

  12. Real-world document set Evaluation on Reuters news (2,914 articles) Applied the “Hierarchical match” method used for the Wikipedia dataset Removed words not defined in Prior Matrix ( � )

  13. Example of topic and word distribution

  14. Conclusions Utilizing prior knowledge from Wikipedia’s hierarchical ontology can be ● successfully used for semantic tagging for documents Future work - Expand to other topics - Explore richer topic models - Incorporate hierarchical structure of categories

  15. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend