Mining Topics in Documents Standing on the Shoulders of Big Data - - PowerPoint PPT Presentation

mining topics in documents
SMART_READER_LITE
LIVE PREVIEW

Mining Topics in Documents Standing on the Shoulders of Big Data - - PowerPoint PPT Presentation

Mining Topics in Documents Standing on the Shoulders of Big Data Zhiyuan (Brett) Chen and Bing Liu Topic Models Widely used in many applications Most of them are unsupervised However, topic models Require a large amount of docs Generate


slide-1
SLIDE 1

Mining Topics in Documents Standing on the Shoulders of Big Data Zhiyuan (Brett) Chen and Bing Liu

slide-2
SLIDE 2

Topic Models

Widely used in many applications Most of them are unsupervised

slide-3
SLIDE 3

However, topic models Require a large amount of docs Generate incoherent topics

slide-4
SLIDE 4

Finding product features from reviews Most products do not even have 100 reviews LDA performs very poorly with 100 reviews

Example Task Application

slide-5
SLIDE 5

Can we improve modeling using Big Data?

slide-6
SLIDE 6

Human Learning

A person sees a new situation uses previous experience (Years of Experience)

slide-7
SLIDE 7

Model Learning

A model sees a new domain uses data of many previous domains (Big Data)

Model

slide-8
SLIDE 8

Motivation

Learn as humans do, Lifelong Learning Retain the results learned in the past Use them to help learning in the future

slide-9
SLIDE 9

Proposed Model Flow

Retain the topics from previous domains Learn the knowledge from these topics Apply the knowledge to a new domain

slide-10
SLIDE 10

What’s the knowledge representation?

slide-11
SLIDE 11

How does a gain knowledge? Should / Should not

slide-12
SLIDE 12

Knowledge Representation

Should => Must-Links e.g., {battery, life} Should not => Cannot-Links e.g., {battery, beautiful}

slide-13
SLIDE 13

Knowledge Extraction

Motivation: a person learns knowledge when it happens repetitively. A piece of knowledge is reliable if it appears frequently.

slide-14
SLIDE 14

Frequent Itemset Mining (FIM)

Discover frequent word sets Multiple minimum supports frequent itemset mining (Liu et al., KDD 1999) Directly applied to extract Must-Links

slide-15
SLIDE 15

Extracting Cannot-Links

O(V^2) Cannot-links in total Many words do not appear in a domain Only for those top topical words

slide-16
SLIDE 16

Related Work about Cannot-Links

Only two topic models were proposed to deal with cannot-type knowledge: DF-LDA (Andrzejewski et al., ICML 2009) MC-LDA (Chen et al., EMNLP 2013)

slide-17
SLIDE 17

However, both of them assume the knowledge to be correct and manually provided.

slide-18
SLIDE 18

Knowledge Verification

Motivation: a person’s experience may not be applicable to a particular situation. The knowledge needs to be verified towards a particular domain.

slide-19
SLIDE 19

Must-Link Graph

Vertex: must-link Edge: must-links have original topic

  • verlapping

{Bank, Money} {Bank, Finance} {Bank, River}

slide-20
SLIDE 20

Pointwise Mutual Information

Estimate the correctness of a must-link A positive PMI value implies positive semantic correlation Will be used in the Gibbs sampling

slide-21
SLIDE 21

Cannot-Links Verification

Most words do not co-occur with most other words Low co-occurrence does not mean negative sematic correlation

slide-22
SLIDE 22

Proposed Gibbs Sampler

M-GPU (multi-generalized Pólya urn) model Must-links: increase the prob of both words

  • f a must-link

Cannot-links: decrease the prob of one of words of a cannot-link

slide-23
SLIDE 23

Example

See word speed under topic 0: Increase prob of seeing fast under topic 0 given must-link: {speed, fast} Decrease prob of seeing beauty under topic 0 given cannot-link: {speed, beauty}

slide-24
SLIDE 24

M-GPU

Increase prob by putting must-link words into the sampled topic:

slide-25
SLIDE 25

M-GPU

Increase prob by putting must-link words into the sampled topic:

slide-26
SLIDE 26

M-GPU

Increase prob by putting must-link words into the sampled topic:

slide-27
SLIDE 27

M-GPU

Decrease prob by transferring cannot-link word into other topic with higher word prob:

slide-28
SLIDE 28

M-GPU

Decrease prob by transferring cannot-link word into other topic with higher word prob:

slide-29
SLIDE 29

M-GPU

Note that we do not increase the number of topics as MC-LDA did. Rational: cannot-links may not be correct, e.g., {battery, life}.

slide-30
SLIDE 30

Evaluation

100 Domains (50 Electronics, 50 Non- Electronics), 1,000 reviews each 100 reviews for each test domain Knowledge extracted from 1,000 reviews from other domains

slide-31
SLIDE 31

Model Comparison

AMC (AMC-M: must-links only) LTM (Chen et al., 2014) GK-LDA (Chen et al., 2013) DF-LDA (Andrzejewski et al., 2009) MC-LDA (Chen et al., 2013) LDA (Blei et al., 2003)

slide-32
SLIDE 32

Topic Coherence

Proposed by Mimno et al., EMNLP 2011 Higher score means more coherent topics

slide-33
SLIDE 33

Topic Coherence Results

slide-34
SLIDE 34

Human Evaluation Results

Red: AMC; Blue: LTM; Green: LDA

slide-35
SLIDE 35

Electronics vs. Non-Electronics

slide-36
SLIDE 36

Conclusions

Learn as humans do Use big data to help small data Knowledge extraction and verification M-GPU model

slide-37
SLIDE 37

Future Work

Knowledge engineering: how to store/maintain the knowledge Domain order, domain selection

slide-38
SLIDE 38

Q&A