A Phrase Mining Framework for Recursive Construction of a Topical - - PowerPoint PPT Presentation
A Phrase Mining Framework for Recursive Construction of a Topical - - PowerPoint PPT Presentation
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Akshat Pandey Dipshil Agrawal Shrirag Kodoor March 27th, 2018. 1. Introduction 1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a
- 1. Introduction
1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a data set 1.2. Motivations:
◮ Summarization: User could familiarize themselves with a new
domain by browsing hierarchy
◮ Search: User could discover which phrases are representative
- f their topic of interest
◮ Browsing: User could search for relevant work done by others,
possibly discovering subtopics to focus on 1.3. Data set:
◮ Content-representative documents: a concise description of an
accompanying full document eg titles (in scientific papers)
◮ Probabilistic priors for which terms most likely to generate
representative phrases
- 1. Introduction
1.4. Framework Features
◮ Phrase-centric approach: Determine topical frequency of
phrases instead of unigrams
◮ Ranking of topical phrases: Rank phrases based on 4 specified
criteria (Problem Formulation)
◮ Recursive clustering for hierarchy construction: Topic
inference is based on term co-occurrence network clustering, which can be performed recursively
- 2. Problem Formulation
2.1. Problem
◮ Traditional hierarchy formulation: Phrase is a consecutive
sequence of unigrams.Too sensitive to term order variation or morphological structure.
◮ Proposed solution: Define phrase as an order-free set of terms
appearing in the same document 2.2. Example
◮ Mining frequent patterns: { “mining frequent patterns”,
“frequent pattern mining”, “mining top-k frequent closed patterns” }
- 2. Problem Formulation
2.3. Criteria
◮ Coverage: Phrase for a topic should cover many documents
within that topic
◮ Purity: Frequent in documents belonging to that topic, not in
documents within other topics
◮ Phraseness: Phrase if terms co-occur significantly more often
expected chance co-occurrence frequency, given that each term in the phrase occurs independently
◮ Completeness: A phrase is not complete if it is a subset of a
longer phrase 2.4. Topical Frequency
◮ Measures which represent these criteria can all be
characterized by topical frequency
◮ Topical frequency of a phrase is the count of the number of
times the phrase is attributed to a topic
- 3. CATHY Framework
3.1 Overview
◮ Construct the term co-occurrence network for the entire
document collection
◮ For a topic t, cluster the co-occurrence network Gt into
subtopic sub-networks and estimate the sub-topical frequency for its sub-topical phrases using a generative model
◮ For each topic, extract candidate phrases based on estimated
topical frequency
◮ For each topic, rank the topical phrases based on topical
frequency
◮ Recursively apply steps 2-5 to each subtopic to construct the
hierarchy in a top-down fashion
- 3. CATHY Framework
3.1 Clustering: Estimating Topical Frequency
◮ 3.1.1 A generative model for term co-occurrence network
analysis
◮ This approach uses a generative model of the term
co-occurrence network to estimate topical frequency
◮ The observed information is the total number of links between
every pair of nodes
◮ parameters which must be learned are the role of each node in
each topic and the expected number of links in each topic
3.CATHY Framework
3.1.1 A generative model for term co-occurrence network analysis
◮ Θz i , Θz j are the p(wi|z), p(wj|z) for a multinomial distribution ◮ ρz is the number of iterations that links are generated ◮ ez ij ≈ Poisson(ρzΘz i Θz j ), when ρz is large ◮
- Θz
i = 1 ◮ eij = k
- z=1
ez
i,j, where k is the number of topics, and z is the
topic
◮ E(
- i,j
ez
ij) = ρz
- i,j
Θz
i Θz j = ρz, where E is the expectancy
property of the Poisson distribution and ρz is the expected number of links in topic z
- 3. CATHY Framework
3.1.1 A generative model for term co-occurrence network analysis
◮ p({eij}|Θ, ρ) =
- wi,wj∈W
(k
z=1 ρzΘz i Θz j )eij exp(− k z=1 ρzΘz i Θz j )
eij!
◮ E-step: ˆ
ez
ij = eij
ρzΘz
i Θz j
k
t=1 ρtΘt i Θt j ◮ M-step:
◮ ρz =
- i,j
ˆ ez
ij
◮ Θz
i =
- j ˆ
ez
ij
ρz
◮ If ˆ
ez
ij ≥ 1, then apply the same model recursively on the
sub-network
- 3. CATHY Framework
◮ 3.1.2 Topical frequency estimation
◮ The topical frequency estimation is based on two assumptions: ◮ When generating a topic phrase, each of the terms is
generated with the multinomial distribution Θz
◮ The total number of topic-z phrases of length n is proportional
to ρz
◮ fz(P) = fpar(z)(P)
ρz n
i=1 Θz xi
- t∈C par(z) ρt
n
i=1 Θt xi
- 3. CATHY Framework
3.2 Topical Phrase Extraction
◮ Approach defines an algorithm to mine frequent topical
patterns
◮ The goal is to extract patters with topical frequency larger
than some threshold for every topic z
◮ Steps
◮ To get candidate phrases use a traditional pattern mining
algorithm
◮ Then filter them using the topical frequency estimation ◮ To remove incomplete phrases, use the notion of closed
patterns and maximal patterns
◮ For 2 phrases P, P′ in a topic such that P ⊂ P′,
fz(P′) ≥ γfz(P)
◮ 0 ≤ gamma ≤ 1, where closer to 0 is maximal, and closer to 1
is closed pattern
- 3. CATHY Framework
3.3 Ranking
◮ Comparability - ranking function able to compare phrases of
different lengths
◮ Consider occurrence probability of seeing a phrase p in a
random document with topic t
- 3. CATHY Framework
3.3 Ranking
◮ Can calculate the occurrence probability of a phrase P
conditioned on topic z as p(P|z) = fz(P)
mz , where mz is the
number of documents where the phrase has been seen at least
- nce for the topic z:
◮ pindep(P|z) =
n
- i=1
fz(P) mz
◮ The mixture contrastive probability is the probability of a
phrase P conditioned on a mixture of multiple sibling topics Z
◮ p(P|Z) =
- t ft(P)
mz
◮ The three criteria are unified by the following ranking
function:
◮ r z(P) = p(P|z)(log p(P|z)
p(P|Z) + ω log p(P|z) pindep(P|z))
- 4. Related Work
Ontology Learning
◮ Topic hierarchies, concept hierarchies, ontologies provide a
hierarchical organization of data at different levels of granularity
◮ This framework is different from a subsumption hierarchy ◮ Approach uses statistics-based techniques, without resorting
to external knowledge resources Topical key phrase extraction and ranking
◮ Key phrases are traditionally extracted as n-grams using
statistical modeling
◮ This approach relaxes the restriction that a phrase must be a
consecutive n-gram, and instead uses document co-location which is effective when considering the content-representative documents used
- 4. Related Work
Topic Modeling
◮ Traditional topic-modeling techniques (LDA) have a more
restrictive definition of phrases, and cannot find hierarchical topics
◮ These techniques are not used due to the sparseness of the
data set and because they cant be used recursively
- 5. Experiments
5.1 Datasets
◮ DBLP - titles of CS papers related to Databases, IR, ML, and
NLP
◮ Library - University of Illinois Library catalogue in 6 categories:
Titles of books from Architecture, Literature, Mass Media, Motion Pictures, Music, and Theater
- 5. Experiments
5.2 methods for Comparison
◮ SpecClus: Baseline - extracts all concepts from the text and
then hierarchically clusters them. Similarity between two phrases is their co-occurrence count in the data set
◮ hPAM: Second baseline - state-of-the-art topic modeling
approach - takes documents as input and outputs a specified number of supertopics and subtopics
◮ hPAMrr: Implement a method that re-ranks the unigrams in
each topic generated by hPAM
◮ CATHYcp: This version of CATHY only considers the
coverage and purity criteria
◮ CATHY: All criteria are used
- 5. Experiments
5.3 Topical Hierarchy of DBLP Paper Titles
◮ Assesses the ability of method to construct topical phrases
that appear to be high quality to human judges via a human study
◮ Create hierarchies using all 5 methods ◮ Topic Intrusion tasks
◮ Judges are shown a topic t, and T candidate child topics ◮ One of the child topics is not actually a child topic, judge must
pick wrong one
◮ Test quality of parent child relationships
◮ Phrase Intrusion tasks
◮ Judges are shown a phrase t, and T candidate child phrases ◮ One of the child phrases is not actually a child phrase, judge
must pick wrong one
◮ Evaluates how well hierarchy is able to separate phrases in
different topics
- 5. Experiments
5.3 Topical Hierarchy of DBLP Paper Titles
- 5. Experiments
5.4 Topical Hierarchy of Book Titles
◮ Examine how well a high quality topical phrase can predict its
category and vice-versa
◮ Construct a hierarchy and measure the coverage-conscious
mutual information (CCMI) at K of the labels with the top level branches 5.5 On Defining Term Co-occurrence
◮ Traditional methods of key phrase extraction only consider
phrases to be sequences of terms which explicitly occur in the text
◮ This approach consistently defined term co-occurrence to
mean co-occurring in the same document
- 5. Experiments