Semantic Tagging Using Topic Models Exploiting Wikipedia Category - - PowerPoint PPT Presentation

semantic tagging using topic models exploiting wikipedia
SMART_READER_LITE
LIVE PREVIEW

Semantic Tagging Using Topic Models Exploiting Wikipedia Category - - PowerPoint PPT Presentation

Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article about in terms of


slide-1
SLIDE 1

Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network

Nitesh Prakash, Duncan Rule, Boh Young Suh

slide-2
SLIDE 2

Introduction

Goal: tag web articles with most probable Wikipedia categories

  • What is the article “about” in terms of

categories?

  • Helpful for information access

and retrieval

slide-3
SLIDE 3

Model Overview (sOntoLDA)

Modify LDA to suit problem’s needs

  • Pre-define topics as Wiki categories
  • Use prior knowledge to improve

topic-word distribution

  • Wikipedia articles labeled with categories
  • Represent with matrix

LDA ~ Dir() sOntoLDA ~ Dir( x )

slide-4
SLIDE 4

Building Prior Matrix ()

How do we represent prior word-topic knowledge?

  • Start with tf-idf matrix
  • Each “document” is the set of Wiki articles tagged with a given category
  • Add subcategories down to a specific level ℓ
slide-5
SLIDE 5

Inference using Gibbs Sampling

  • Now that we have a generative LDA model and the priors we need to

reverse the process to infer from the observed documents:

  • Denominator cannot be computed with Cn terms where n is number of

words in vocabulary.

  • Collapsed Gibbs Sampling which uses a Markov Chain Monte Carlo to

converge to a posterior distribution over categories c, conditioned on the

  • bserved words w, and hyperparameters and
slide-6
SLIDE 6

Inference using Gibbs Sampling

  • Probability of a category

given a document

  • Probability of a word given a

category

slide-7
SLIDE 7

Tagging Example

  • Structure of and relationships

between Wikipedia categories as represented by SKOS properties.

  • Sub Categories and Super

Categories

  • Consider super-categories in

addition to exact match

  • Categories assigned to article on

“tooth brushing” and the related category hierarchy

Health Health Care Health Sciences Dentistry 0.0478 Self Care 0.0403 Dentistry Branches Hygiene Personal Hygiene Products 0.0302 Tooth Brushing Oral Hygiene 0.1533 Chiropractic Treatment Techniques 0.0227

slide-8
SLIDE 8

Experiments

1. How well does the model predict the categories of a collection of the Wikipedia articles?

  • 2. Assign Wikipedia tags to Reuters News articles and compare top-k topics
slide-9
SLIDE 9

Preprocessing

Final Topic Graph

  • 1,353 categories
  • 30,300 articles
  • Vocabulary size 99,665
slide-10
SLIDE 10

Evaluation metric

Precision@k and Mean Average Precision (MAP)

slide-11
SLIDE 11

Tagging wikipedia article results

slide-12
SLIDE 12

Evaluation on Reuters news (2,914 articles)

Applied the “Hierarchical match” method used for the Wikipedia dataset Removed words not defined in Prior Matrix ()

Real-world document set

slide-13
SLIDE 13

Example of topic and word distribution

slide-14
SLIDE 14

Conclusions

  • Utilizing prior knowledge from Wikipedia’s hierarchical ontology can be

successfully used for semantic tagging for documents Future work

  • Expand to other topics
  • Explore richer topic models
  • Incorporate hierarchical structure of categories
slide-15
SLIDE 15

Questions?