Representing Documents via Latent Keyphrase Inference April. 15 th , - - PowerPoint PPT Presentation

representing documents via latent keyphrase inference
SMART_READER_LITE
LIVE PREVIEW

Representing Documents via Latent Keyphrase Inference April. 15 th , - - PowerPoint PPT Presentation

Representing Documents via Latent Keyphrase Inference April. 15 th , 2016 Document Representation in Vector Space Critical for document retrieval, categorization 2 Traditional Methods q Bag-of-Words or Phrases q Cons: Sparse on short texts 3 q


slide-1
SLIDE 1

Representing Documents via Latent Keyphrase Inference

  • April. 15th, 2016
slide-2
SLIDE 2

2

Document Representation in Vector Space

Critical for document retrieval, categorization

slide-3
SLIDE 3

3 q Bag-of-Words or Phrases q Cons: Sparse on short texts

Traditional Methods

slide-4
SLIDE 4

4 q Topic models [LDA] q Cons: Difficult for human to infer topic semantics

Each topic is a distribution over words, each document is a mixture of corpus-wide topics

slide-5
SLIDE 5

5 q Concept-based models [ESA] q Cons: Low coverage of concepts in human-curated knowledge base

Every Wikipedia article represents a concept

Article words are associated with the concept (TF.IDF), which help infer concepts from document

Concept: Panthera

Cat [0.92]

Leopard [0.84] Roar [0.77]

slide-6
SLIDE 6

6 q Word/Document embedding models [word2vec paragraph2vec] q Cons: Difficult to explain what each dimension means

slide-7
SLIDE 7

7 q Use domain keyphrases as the entries in the vector and q Identify document keyphrases (subset of domain keyphrases)

by evaluating relatedness between (doc, domain keyphrase)

q Unsupervised model

Document Representation Using Keyphrases

Corpus Domain Keyphrases <K1, K2, …, KM>

slide-8
SLIDE 8

8 q Where to get domain keyphrases from a given corpus?

q

Mining Quality Phrases from Massive Text Corpora [SIGMOD15]

q How to identify document keyphrases? q Can be latent mentions (short text) q Relatedness scores

Challenges

slide-9
SLIDE 9

9 q Powered by Bayesian Inference on “Domain Keyphrase Silhouette”

q

Domain Keyphrase Silhouette: Topic centered on domain keyphrase

q

“Reverse” topic models

q

Learned from corpus

How to identify document keyphrases?

slide-10
SLIDE 10

10

Framework for Latent Keyphrase Inference (LAKI)

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12 q Learning Hierarchical Bayesian Network (DAG)

Binary Variables

Domain Keyphrase Silhouette

Task 1: Model Learning: learning link weights Task 2: Structure Learning: learning network structure

slide-13
SLIDE 13

13

Noise / Prior Aggregated over all other links connected with 𝑎"

q Use Z to represent K (domain keyphrases)

and T (content units)

q Noisy-OR

q

A parent node is easier to activate its children when the link weight is larger

q

A child node is influenced by all its parents

Task 1: Model Learning given Structure

Toy example

slide-14
SLIDE 14

14 q Training data: Documents q Expectation-step: q For each document, collect sufficient statistics q Link firing (Parent, child both being activated) probability q Node activation probability q Maximization-step: q Update link weight

Fully observed content units Partially observed document keyphrases

Maximum Likelihood Estimation

slide-15
SLIDE 15

15 q Domain keyphrases are connected to content units

q

Help infer document keyphrases from content units

q Domain keyphrases are interconnected

q

Help infer document keyphrases from other keyphrases

Task 2: Structure Learning

slide-16
SLIDE 16

16 q Data-Driven, DAG, similar to ontology q Heuristic: q Two nodes are connected only

q Closely Related: word2vec q Co-occur frequently

q Links are always point to less frequent nodes q Work well in practice

A Heuristic Approach

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18 q Exact inference is slow! q NP hard to compute posterior probability for Noisy-Or networks q Approximate inference instead q Pruning irrelevant nodes using an efficient scoring function q Gibbs sampling

Inference

slide-19
SLIDE 19

19 q Two text-related tasks to evaluate document representation quality q Phrase relatedness q Document classification q Two datasets

Experiments

slide-20
SLIDE 20

20

q

ESA (Explicit Semantic Analysis)

q

KBLink uses link structure in Wikipedia

q

BoW (bag-of-words)

q

ESA-C extends ESA by replacing Wiki with domain corpus

q

LSA (Latent Semantic Analysis)

q

LDA (Latent Dirichlet Allocation)

q

Word2Vec is a neural network computing word embeddings

q

EKM uses explicit keyphrase detection

Methods

slide-21
SLIDE 21

21

Phrase Relatedness Correlation Document Classification

slide-22
SLIDE 22

22

Case Study

slide-23
SLIDE 23

23

slide-24
SLIDE 24

24

Time Complexity

#Samples 1000 3000 5000 7000 9000

Runing Time (ms)

100 200 300 400 500

Academia Yelp

#Quality Phrases After Pruning 10 100 200 300 400 500

Runing Time (ms)

100 200 300 400 500

Academia Yelp

#Words 0 100 200 400 800

Runing Time (ms)

500 1000 1500

Academia Yelp

slide-25
SLIDE 25

25

Breakdown of Processing Time

slide-26
SLIDE 26

26

q

We have introduced a novel document representation method using latent keyphrases

q

Each dimension is explainable

q

Works for short text

q

Works for closed-domain text

q

We have developed an efficient inference method to do real time keyphrase identification

q

Future work

q

Better structure learning approach

q

Combined with knowledge base

q

Try other inference method other than Gibbs sampling

q

Code available at http://jialu.info 26

Conclusion