Topic Modeling using Topics from Many Domains, Lifelong Learning and - - PowerPoint PPT Presentation
Topic Modeling using Topics from Many Domains, Lifelong Learning and - - PowerPoint PPT Presentation
Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and Bing Liu University Of Illinois at Chicago liub@cs.uic.edu Introduction Topic models, such as LDA (Blei et al., 2003) , pLSA (Hofmann, 1999) and
ICML-2014, Beijing, June 22-24, 2014 2
Introduction
Topic models, such as LDA (Blei et al., 2003),
pLSA (Hofmann, 1999) and their variants
Widely used to discover topics in docs
Unsupervised models are often insufficient
because their objective functions may not correlate
well with human judgments (Chang et al., 2009).
Knowledge-based topic models (KBTM) better
(Andrzejewski et al., 2009 Mukherjee & Liu, 2012, etc)
But not automatic need user-given prior knowledge for each domain
How to Improve Further?
We can invent better topic models But how about: Learn like humans
What we learn in the past helps future learning Whenever we see a new situation, we almost know
it already;
few aspects are really new. It shares a lot of things with what we’ve seen in the past.
(a systems approach)
ICML-2014, Beijing, June 22-24, 2014 3
Take a Major Step Forward
Knowledge-based modeling is still traditional
Knowledge provided by user and assumed correct. Not automatic (each domain needs new knowledge
from user)
Question: Can we mine prior knowledge
systematically and automatically?
Answer: Yes - with big data (many domains) Implication:
Learn forever: past learning results help future learning lifelong learning
ICML-2014, Beijing, June 22-24, 2014 4
Why? (an example from opinion
mining)
Topic overlap across domains: Although every
domain is different, there is a fair amount of topic
- verlapping across domains, e.g.,
Every product review domain has the topic price, Most electronic products share the topic battery Some products share the topic screen.
If we have good topics from a large number of
past domain collections (big data):
for a new collection, we can use existing topics
To generate high quality prior knowledge automatically
ICML-2014, Beijing, June 22-24, 2014 5
An example
We have reviews from 3 domains and each
domain gives a topic about price.
Domain 1: {price, color, cost, life} Domain 2: {cost, picture, price, expensive} Domain 3: {price, money, customer, expensive}
Mining quality knowledge: require words to
appear in at least two domains. We get:
{price, cost} and {price, expensive}. Each set is likely to belong to the same topic.
ICML-2014, Beijing, June 22-24, 2014 6
Run a KBTM: an Example (cont.)
If we run a KBtM on reviews of Domain 1, we
may find the new topic about price:
{price, cost, expensive, color},
We get 3 coherent words in top 4 positions,
rather than only 2 words as in the old topic.
Old: {price, color, cost, life}
A good topic improvement
ICML-2014, Beijing, June 22-24, 2014 7
Problem Statement
Given a large set of document collections
𝐸 = {𝐸1, … , 𝐸𝑜}, learn from D to produce results S.
Goal: Given a test collection 𝐸𝑢, learn from
𝐸𝑢 with the help of S (and possibly D).
𝐸𝑢 D or 𝐸𝑢 D. The results learned this way should be better than
without the guidance of S (and D).
ICML-2014, Beijing, June 22-24, 2014 8
LTM – Lifelong learning Topic Model
Cold start (initialization)
Run LDA on each Di D => topics Si
S = U Si
Given a new domain collection Dt
Run LDA on Dt => topics At Find matching topics Mj from S for each topic aj At Mine knowledge kj from each Mj
Kt = U kj
Run a KBTM on Dt with the help of Kt => new At
KBTM uses Kt and also deals with wrong knowledge in Kt
Update S with At
ICML-2014, Beijing, June 22-24, 2014 9
Prior Topic Generation (cold start)
Runs LDA on each 𝐸𝑗 ∈ 𝐸 to produce a set of
topics 𝑇𝑗 called prior topics (or p-topics).
ICML-2014, Beijing, June 22-24, 2014 10
LTM Topic Model
(1) Mine prior knowledge (pk-sets) (2) use prior
knowledge to guide modeling.
ICML-2014, Beijing, June 22-24, 2014 11
Knowledge Mining Function
Topic match: find similar topics (𝑁
𝑘∗ 𝑢 ) from p-topics
for each current topic
Pattern mining: find frequent itemsets from 𝑁
𝑘∗ 𝑢
ICML-2014, Beijing, June 22-24, 2014 12
Model inference: Gibbs sampling
How to use prior knowledge (pk-sets)?
e.g., {price, cost} & {price, expensive}
How to tackle wrong knowledge? Graphical model: same as LDA But the model inference is different
Generalized Pólya Urn Model (GPU) (Mimno et al., 2011)
Idea: When assigning a topic t to a word w, also
assign a fraction of t to words in prior knowledge sets (pk-sets) sharing with w.
ICML-2014, Beijing, June 22-24, 2014 13
Dealing with Wrong Knowledge
Some pieces of automatically generated
knowledge (pk-sets) may be wrong.
Deal with them in sampling (decide fraction).
ensure words in a pk-set {𝑥, 𝑥′} are associated
𝔹𝑥′,𝑥 =
1 𝑥 = 𝑥′ 𝜈 × 𝑄𝑁𝐽 𝑥, 𝑥′ {𝑥, 𝑥′}
- therwise
is a pk-set
𝑄𝑁𝐽 𝑥, 𝑥′ = log
𝑄(𝑥,𝑥′) 𝑄(𝑥)𝑄(𝑥′)
ICML-2014, Beijing, June 22-24, 2014 14
Gibbs Sampler
𝑄 𝑨𝑗 = 𝑢 𝒜−𝑗, 𝒙, 𝛽, 𝛾 ∝
𝑜𝑛,𝑢
−𝑗 + 𝛽
𝑢′=1
𝑈
𝑜𝑛,𝑢′
−𝑗
+ 𝛽 × 𝑥′=1
𝑊
𝔹𝑥′,𝑥𝑗 × 𝑜𝑢,𝑥′
−𝑗
+𝛾 𝑤=1
𝑊
𝑥′=1
𝑊
𝔹𝑥′,𝑤 × 𝑜𝑢,𝑥′
−𝑗
+𝛾
ICML-2014, Beijing, June 22-24, 2014 15
Evaluation
We used review collections D from 50 domains.
Each domain has 1000 reviews. Four domains with 10000 reviews for large data test.
Test settings: Two test settings to evaluate LTM,
representing two possible uses of LTM
seen the test domain before, i.e., 𝐸𝑢 D. Not seen test domain before, i.e., 𝐸𝑢 D.
ICML-2014, Beijing, June 22-24, 2014 16
Topic Coherence (Mimno et al., EMNLP-2011)
ICML-2014, Beijing, June 22-24, 2014 17
Topic Coherence on 4 Large Datasets
Can LTM improve with larger data?
ICML-2014, Beijing, June 22-24, 2014 18
Split a large data to 10 smaller
- nes
Here we use only one domain data Better topic coherence and better efficiency (30%)
ICML-2014, Beijing, June 22-24, 2014 19
Summary
Proposed a lifelong learning topic model LTM. It keeps a large topic base S For each new topic modeling task,
run LDA to generate a set of initial topics find matching old topics from S mine quality knowledge from the old topics use the knowledge to help generate better topics
With big data (from diverse domains): we can
do what we cannot do or haven’t done before.
ICML-2014, Beijing, June 22-24, 2014 20