Semantic Tagging Using Topic Models Exploiting Wikipedia Category - - PowerPoint PPT Presentation
Semantic Tagging Using Topic Models Exploiting Wikipedia Category - - PowerPoint PPT Presentation
Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article about in terms of
Introduction
Goal: tag web articles with most probable Wikipedia categories
- What is the article “about” in terms of
categories?
- Helpful for information access
and retrieval
Model Overview (sOntoLDA)
Modify LDA to suit problem’s needs
- Pre-define topics as Wiki categories
- Use prior knowledge to improve
topic-word distribution
- Wikipedia articles labeled with categories
- Represent with matrix
LDA ~ Dir() sOntoLDA ~ Dir( x )
Building Prior Matrix ()
How do we represent prior word-topic knowledge?
- Start with tf-idf matrix
- Each “document” is the set of Wiki articles tagged with a given category
- Add subcategories down to a specific level ℓ
Inference using Gibbs Sampling
- Now that we have a generative LDA model and the priors we need to
reverse the process to infer from the observed documents:
- Denominator cannot be computed with Cn terms where n is number of
words in vocabulary.
- Collapsed Gibbs Sampling which uses a Markov Chain Monte Carlo to
converge to a posterior distribution over categories c, conditioned on the
- bserved words w, and hyperparameters and
Inference using Gibbs Sampling
- Probability of a category
given a document
- Probability of a word given a
category
Tagging Example
- Structure of and relationships
between Wikipedia categories as represented by SKOS properties.
- Sub Categories and Super
Categories
- Consider super-categories in
addition to exact match
- Categories assigned to article on
“tooth brushing” and the related category hierarchy
Health Health Care Health Sciences Dentistry 0.0478 Self Care 0.0403 Dentistry Branches Hygiene Personal Hygiene Products 0.0302 Tooth Brushing Oral Hygiene 0.1533 Chiropractic Treatment Techniques 0.0227
Experiments
1. How well does the model predict the categories of a collection of the Wikipedia articles?
- 2. Assign Wikipedia tags to Reuters News articles and compare top-k topics
Preprocessing
Final Topic Graph
- 1,353 categories
- 30,300 articles
- Vocabulary size 99,665
Evaluation metric
Precision@k and Mean Average Precision (MAP)
Tagging wikipedia article results
Evaluation on Reuters news (2,914 articles)
Applied the “Hierarchical match” method used for the Wikipedia dataset Removed words not defined in Prior Matrix ()
Real-world document set
Example of topic and word distribution
Conclusions
- Utilizing prior knowledge from Wikipedia’s hierarchical ontology can be
successfully used for semantic tagging for documents Future work
- Expand to other topics
- Explore richer topic models
- Incorporate hierarchical structure of categories