Representing Documents via Latent Keyphrase Inference
- April. 15th, 2016
Representing Documents via Latent Keyphrase Inference April. 15 th , - - PowerPoint PPT Presentation
Representing Documents via Latent Keyphrase Inference April. 15 th , 2016 Document Representation in Vector Space Critical for document retrieval, categorization 2 Traditional Methods q Bag-of-Words or Phrases q Cons: Sparse on short texts 3 q
2
Critical for document retrieval, categorization
3 q Bag-of-Words or Phrases q Cons: Sparse on short texts
4 q Topic models [LDA] q Cons: Difficult for human to infer topic semantics
Each topic is a distribution over words, each document is a mixture of corpus-wide topics
5 q Concept-based models [ESA] q Cons: Low coverage of concepts in human-curated knowledge base
Every Wikipedia article represents a concept
Article words are associated with the concept (TF.IDF), which help infer concepts from document
Leopard [0.84] Roar [0.77]
6 q Word/Document embedding models [word2vec paragraph2vec] q Cons: Difficult to explain what each dimension means
7 q Use domain keyphrases as the entries in the vector and q Identify document keyphrases (subset of domain keyphrases)
q Unsupervised model
Corpus Domain Keyphrases <K1, K2, …, KM>
8 q Where to get domain keyphrases from a given corpus?
q
Mining Quality Phrases from Massive Text Corpora [SIGMOD15]
q How to identify document keyphrases? q Can be latent mentions (short text) q Relatedness scores
9 q Powered by Bayesian Inference on “Domain Keyphrase Silhouette”
q
Domain Keyphrase Silhouette: Topic centered on domain keyphrase
q
“Reverse” topic models
q
Learned from corpus
10
11
12 q Learning Hierarchical Bayesian Network (DAG)
Binary Variables
Task 1: Model Learning: learning link weights Task 2: Structure Learning: learning network structure
13
Noise / Prior Aggregated over all other links connected with 𝑎"
q Use Z to represent K (domain keyphrases)
q Noisy-OR
q
A parent node is easier to activate its children when the link weight is larger
q
A child node is influenced by all its parents
Toy example
14 q Training data: Documents q Expectation-step: q For each document, collect sufficient statistics q Link firing (Parent, child both being activated) probability q Node activation probability q Maximization-step: q Update link weight
Fully observed content units Partially observed document keyphrases
15 q Domain keyphrases are connected to content units
q
Help infer document keyphrases from content units
q Domain keyphrases are interconnected
q
Help infer document keyphrases from other keyphrases
16 q Data-Driven, DAG, similar to ontology q Heuristic: q Two nodes are connected only
q Closely Related: word2vec q Co-occur frequently
q Links are always point to less frequent nodes q Work well in practice
17
18 q Exact inference is slow! q NP hard to compute posterior probability for Noisy-Or networks q Approximate inference instead q Pruning irrelevant nodes using an efficient scoring function q Gibbs sampling
19 q Two text-related tasks to evaluate document representation quality q Phrase relatedness q Document classification q Two datasets
20
q
ESA (Explicit Semantic Analysis)
q
KBLink uses link structure in Wikipedia
q
BoW (bag-of-words)
q
ESA-C extends ESA by replacing Wiki with domain corpus
q
LSA (Latent Semantic Analysis)
q
LDA (Latent Dirichlet Allocation)
q
Word2Vec is a neural network computing word embeddings
q
EKM uses explicit keyphrase detection
21
22
23
24
#Samples 1000 3000 5000 7000 9000
Runing Time (ms)
100 200 300 400 500
Academia Yelp
#Quality Phrases After Pruning 10 100 200 300 400 500
Runing Time (ms)
100 200 300 400 500
Academia Yelp
#Words 0 100 200 400 800
Runing Time (ms)
500 1000 1500
Academia Yelp
25
26
q
We have introduced a novel document representation method using latent keyphrases
q
Each dimension is explainable
q
Works for short text
q
Works for closed-domain text
q
We have developed an efficient inference method to do real time keyphrase identification
q
Future work
q
Better structure learning approach
q
Combined with knowledge base
q
Try other inference method other than Gibbs sampling
q
Code available at http://jialu.info 26