A Phrase Mining Framework for Recursive Construction of a Topical - - PowerPoint PPT Presentation

a phrase mining framework for recursive construction of a
SMART_READER_LITE
LIVE PREVIEW

A Phrase Mining Framework for Recursive Construction of a Topical - - PowerPoint PPT Presentation

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Akshat Pandey Dipshil Agrawal Shrirag Kodoor March 27th, 2018. 1. Introduction 1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a


slide-1
SLIDE 1

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy

Akshat Pandey Dipshil Agrawal Shrirag Kodoor March 27th, 2018.

slide-2
SLIDE 2
  • 1. Introduction

1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a data set 1.2. Motivations:

◮ Summarization: User could familiarize themselves with a new

domain by browsing hierarchy

◮ Search: User could discover which phrases are representative

  • f their topic of interest

◮ Browsing: User could search for relevant work done by others,

possibly discovering subtopics to focus on 1.3. Data set:

◮ Content-representative documents: a concise description of an

accompanying full document eg titles (in scientific papers)

◮ Probabilistic priors for which terms most likely to generate

representative phrases

slide-3
SLIDE 3
  • 1. Introduction

1.4. Framework Features

◮ Phrase-centric approach: Determine topical frequency of

phrases instead of unigrams

◮ Ranking of topical phrases: Rank phrases based on 4 specified

criteria (Problem Formulation)

◮ Recursive clustering for hierarchy construction: Topic

inference is based on term co-occurrence network clustering, which can be performed recursively

slide-4
SLIDE 4
  • 2. Problem Formulation

2.1. Problem

◮ Traditional hierarchy formulation: Phrase is a consecutive

sequence of unigrams.Too sensitive to term order variation or morphological structure.

◮ Proposed solution: Define phrase as an order-free set of terms

appearing in the same document 2.2. Example

◮ Mining frequent patterns: { “mining frequent patterns”,

“frequent pattern mining”, “mining top-k frequent closed patterns” }

slide-5
SLIDE 5
  • 2. Problem Formulation

2.3. Criteria

◮ Coverage: Phrase for a topic should cover many documents

within that topic

◮ Purity: Frequent in documents belonging to that topic, not in

documents within other topics

◮ Phraseness: Phrase if terms co-occur significantly more often

expected chance co-occurrence frequency, given that each term in the phrase occurs independently

◮ Completeness: A phrase is not complete if it is a subset of a

longer phrase 2.4. Topical Frequency

◮ Measures which represent these criteria can all be

characterized by topical frequency

◮ Topical frequency of a phrase is the count of the number of

times the phrase is attributed to a topic

slide-6
SLIDE 6
  • 3. CATHY Framework

3.1 Overview

◮ Construct the term co-occurrence network for the entire

document collection

◮ For a topic t, cluster the co-occurrence network Gt into

subtopic sub-networks and estimate the sub-topical frequency for its sub-topical phrases using a generative model

◮ For each topic, extract candidate phrases based on estimated

topical frequency

◮ For each topic, rank the topical phrases based on topical

frequency

◮ Recursively apply steps 2-5 to each subtopic to construct the

hierarchy in a top-down fashion

slide-7
SLIDE 7
  • 3. CATHY Framework

3.1 Clustering: Estimating Topical Frequency

◮ 3.1.1 A generative model for term co-occurrence network

analysis

◮ This approach uses a generative model of the term

co-occurrence network to estimate topical frequency

◮ The observed information is the total number of links between

every pair of nodes

◮ parameters which must be learned are the role of each node in

each topic and the expected number of links in each topic

slide-8
SLIDE 8

3.CATHY Framework

3.1.1 A generative model for term co-occurrence network analysis

◮ Θz i , Θz j are the p(wi|z), p(wj|z) for a multinomial distribution ◮ ρz is the number of iterations that links are generated ◮ ez ij ≈ Poisson(ρzΘz i Θz j ), when ρz is large ◮

  • Θz

i = 1 ◮ eij = k

  • z=1

ez

i,j, where k is the number of topics, and z is the

topic

◮ E(

  • i,j

ez

ij) = ρz

  • i,j

Θz

i Θz j = ρz, where E is the expectancy

property of the Poisson distribution and ρz is the expected number of links in topic z

slide-9
SLIDE 9
  • 3. CATHY Framework

3.1.1 A generative model for term co-occurrence network analysis

◮ p({eij}|Θ, ρ) =

  • wi,wj∈W

(k

z=1 ρzΘz i Θz j )eij exp(− k z=1 ρzΘz i Θz j )

eij!

◮ E-step: ˆ

ez

ij = eij

ρzΘz

i Θz j

k

t=1 ρtΘt i Θt j ◮ M-step:

◮ ρz =

  • i,j

ˆ ez

ij

◮ Θz

i =

  • j ˆ

ez

ij

ρz

◮ If ˆ

ez

ij ≥ 1, then apply the same model recursively on the

sub-network

slide-10
SLIDE 10
  • 3. CATHY Framework

◮ 3.1.2 Topical frequency estimation

◮ The topical frequency estimation is based on two assumptions: ◮ When generating a topic phrase, each of the terms is

generated with the multinomial distribution Θz

◮ The total number of topic-z phrases of length n is proportional

to ρz

◮ fz(P) = fpar(z)(P)

ρz n

i=1 Θz xi

  • t∈C par(z) ρt

n

i=1 Θt xi

slide-11
SLIDE 11
  • 3. CATHY Framework

3.2 Topical Phrase Extraction

◮ Approach defines an algorithm to mine frequent topical

patterns

◮ The goal is to extract patters with topical frequency larger

than some threshold for every topic z

◮ Steps

◮ To get candidate phrases use a traditional pattern mining

algorithm

◮ Then filter them using the topical frequency estimation ◮ To remove incomplete phrases, use the notion of closed

patterns and maximal patterns

◮ For 2 phrases P, P′ in a topic such that P ⊂ P′,

fz(P′) ≥ γfz(P)

◮ 0 ≤ gamma ≤ 1, where closer to 0 is maximal, and closer to 1

is closed pattern

slide-12
SLIDE 12
  • 3. CATHY Framework

3.3 Ranking

◮ Comparability - ranking function able to compare phrases of

different lengths

◮ Consider occurrence probability of seeing a phrase p in a

random document with topic t

slide-13
SLIDE 13
  • 3. CATHY Framework

3.3 Ranking

◮ Can calculate the occurrence probability of a phrase P

conditioned on topic z as p(P|z) = fz(P)

mz , where mz is the

number of documents where the phrase has been seen at least

  • nce for the topic z:

◮ pindep(P|z) =

n

  • i=1

fz(P) mz

◮ The mixture contrastive probability is the probability of a

phrase P conditioned on a mixture of multiple sibling topics Z

◮ p(P|Z) =

  • t ft(P)

mz

◮ The three criteria are unified by the following ranking

function:

◮ r z(P) = p(P|z)(log p(P|z)

p(P|Z) + ω log p(P|z) pindep(P|z))

slide-14
SLIDE 14
  • 4. Related Work

Ontology Learning

◮ Topic hierarchies, concept hierarchies, ontologies provide a

hierarchical organization of data at different levels of granularity

◮ This framework is different from a subsumption hierarchy ◮ Approach uses statistics-based techniques, without resorting

to external knowledge resources Topical key phrase extraction and ranking

◮ Key phrases are traditionally extracted as n-grams using

statistical modeling

◮ This approach relaxes the restriction that a phrase must be a

consecutive n-gram, and instead uses document co-location which is effective when considering the content-representative documents used

slide-15
SLIDE 15
  • 4. Related Work

Topic Modeling

◮ Traditional topic-modeling techniques (LDA) have a more

restrictive definition of phrases, and cannot find hierarchical topics

◮ These techniques are not used due to the sparseness of the

data set and because they cant be used recursively

slide-16
SLIDE 16
  • 5. Experiments

5.1 Datasets

◮ DBLP - titles of CS papers related to Databases, IR, ML, and

NLP

◮ Library - University of Illinois Library catalogue in 6 categories:

Titles of books from Architecture, Literature, Mass Media, Motion Pictures, Music, and Theater

slide-17
SLIDE 17
  • 5. Experiments

5.2 methods for Comparison

◮ SpecClus: Baseline - extracts all concepts from the text and

then hierarchically clusters them. Similarity between two phrases is their co-occurrence count in the data set

◮ hPAM: Second baseline - state-of-the-art topic modeling

approach - takes documents as input and outputs a specified number of supertopics and subtopics

◮ hPAMrr: Implement a method that re-ranks the unigrams in

each topic generated by hPAM

◮ CATHYcp: This version of CATHY only considers the

coverage and purity criteria

◮ CATHY: All criteria are used

slide-18
SLIDE 18
  • 5. Experiments

5.3 Topical Hierarchy of DBLP Paper Titles

◮ Assesses the ability of method to construct topical phrases

that appear to be high quality to human judges via a human study

◮ Create hierarchies using all 5 methods ◮ Topic Intrusion tasks

◮ Judges are shown a topic t, and T candidate child topics ◮ One of the child topics is not actually a child topic, judge must

pick wrong one

◮ Test quality of parent child relationships

◮ Phrase Intrusion tasks

◮ Judges are shown a phrase t, and T candidate child phrases ◮ One of the child phrases is not actually a child phrase, judge

must pick wrong one

◮ Evaluates how well hierarchy is able to separate phrases in

different topics

slide-19
SLIDE 19
  • 5. Experiments

5.3 Topical Hierarchy of DBLP Paper Titles

slide-20
SLIDE 20
  • 5. Experiments

5.4 Topical Hierarchy of Book Titles

◮ Examine how well a high quality topical phrase can predict its

category and vice-versa

◮ Construct a hierarchy and measure the coverage-conscious

mutual information (CCMI) at K of the labels with the top level branches 5.5 On Defining Term Co-occurrence

◮ Traditional methods of key phrase extraction only consider

phrases to be sequences of terms which explicitly occur in the text

◮ This approach consistently defined term co-occurrence to

mean co-occurring in the same document

slide-21
SLIDE 21
  • 5. Experiments

5.4 Topical Hierarchy of Book Titles

slide-22
SLIDE 22

Questions?