Topic Modeling using Topics from Many Domains, Lifelong Learning and - PowerPoint PPT Presentation

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and Bing Liu University Of Illinois at Chicago liub@cs.uic.edu

Introduction  Topic models, such as LDA (Blei et al., 2003) , pLSA (Hofmann, 1999) and their variants  Widely used to discover topics in docs  Unsupervised models are often insufficient  because their objective functions may not correlate well with human judgments (Chang et al., 2009) .  Knowledge-based topic models (KBTM) better (Andrzejewski et al., 2009 Mukherjee & Liu, 2012, etc)  But not automatic  need user-given prior knowledge for each domain ICML-2014, Beijing, June 22-24, 2014 2

How to Improve Further?  We can invent better topic models  But how about: Learn like humans  What we learn in the past helps future learning  Whenever we see a new situation, we almost know it already;  few aspects are really new.  It shares a lot of things with what we’ve seen in the past.  (a systems approach) ICML-2014, Beijing, June 22-24, 2014 3

Take a Major Step Forward  Knowledge-based modeling is still traditional  Knowledge provided by user and assumed correct.  Not automatic (each domain needs new knowledge from user)  Question : Can we mine prior knowledge systematically and automatically?  Answer : Yes - with big data (many domains)  Implication :  Learn forever: past learning results help future learning  lifelong learning ICML-2014, Beijing, June 22-24, 2014 4

Why? (an example from opinion mining)  Topic overlap across domains: Although every domain is different, there is a fair amount of topic overlapping across domains, e.g.,  Every product review domain has the topic price ,  Most electronic products share the topic battery  Some products share the topic screen .  If we have good topics from a large number of past domain collections ( big data ):  for a new collection, we can use existing topics  To generate high quality prior knowledge automatically ICML-2014, Beijing, June 22-24, 2014 5

An example  We have reviews from 3 domains and each domain gives a topic about price .  Domain 1: { price , color , cost , life }  Domain 2: { cost , picture , price, expensive}  Domain 3: { price , money , customer , expensive}  Mining quality knowledge: require words to appear in at least two domains. We get:  { price , cost } and { price , expensive }.  Each set is likely to belong to the same topic. ICML-2014, Beijing, June 22-24, 2014 6

Run a KBTM: an Example (cont.)  If we run a KBtM on reviews of Domain 1, we may find the new topic about price :  { price , cost , expensive , color },  We get 3 coherent words in top 4 positions, rather than only 2 words as in the old topic.  Old : { price , color , cost , life }  A good topic improvement ICML-2014, Beijing, June 22-24, 2014 7

Problem Statement  Given a large set of document collections 𝐸 = {𝐸 1 , … , 𝐸 𝑜 } , learn from D to produce results S .  Goal : Given a test collection 𝐸 𝑢 , learn from 𝐸 𝑢 with the help of S (and possibly D ).  𝐸 𝑢  D or 𝐸 𝑢  D .  The results learned this way should be better than without the guidance of S (and D ). ICML-2014, Beijing, June 22-24, 2014 8

LTM – Lifelong learning Topic Model  Cold start (initialization)  Run LDA on each D i  D => topics S i  S = U S i  Given a new domain collection D t  Run LDA on D t => topics A t  Find matching topics M j from S for each topic a j  A t  Mine knowledge k j from each M j  K t = U k j  Run a KBTM on D t with the help of K t => new A t  KBTM uses K t and also deals with wrong knowledge in K t  Update S with A t ICML-2014, Beijing, June 22-24, 2014 9

Prior Topic Generation (cold start)  Runs LDA on each 𝐸 𝑗 ∈ 𝐸 to produce a set of topics 𝑇 𝑗 called prior topics (or p-topics ). ICML-2014, Beijing, June 22-24, 2014 10

LTM Topic Model  (1) Mine prior knowledge ( pk-sets ) (2) use prior knowledge to guide modeling. ICML-2014, Beijing, June 22-24, 2014 11

Knowledge Mining Function 𝑢 ) from p-topics  Topic match: find similar topics ( 𝑁 𝑘 ∗ for each current topic 𝑢  Pattern mining: find frequent itemsets from 𝑁 𝑘 ∗ ICML-2014, Beijing, June 22-24, 2014 12

Model inference: Gibbs sampling  How to use prior knowledge ( pk-sets )?  e.g., { price , cost } & { price , expensive }  How to tackle wrong knowledge?  Graphical model: same as LDA  But the model inference is different  Generalized Pólya Urn Model (GPU) (Mimno et al., 2011)  Idea: When assigning a topic t to a word w , also assign a fraction of t to words in prior knowledge sets (pk-sets) sharing with w . ICML-2014, Beijing, June 22-24, 2014 13

Dealing with Wrong Knowledge  Some pieces of automatically generated knowledge (pk-sets) may be wrong.  Deal with them in sampling ( decide fraction ).  ensure words in a pk-set {𝑥, 𝑥 ′ } are associated 𝑥 = 𝑥 ′ 1 𝜈 × 𝑄𝑁𝐽 𝑥, 𝑥 ′ {𝑥, 𝑥 ′ }  𝔹 𝑥 ′ ,𝑥 = is a pk-set 0 otherwise 𝑄(𝑥,𝑥 ′ )  𝑄𝑁𝐽 𝑥, 𝑥 ′ = log 𝑄(𝑥)𝑄(𝑥 ′ ) ICML-2014, Beijing, June 22-24, 2014 14

Gibbs Sampler  𝑄 𝑨 𝑗 = 𝑢 𝒜 −𝑗 , 𝒙, 𝛽, 𝛾 ∝ −𝑗 + 𝛽 𝑊 −𝑗 𝑥 ′ =1 𝔹 𝑥 ′ ,𝑥 𝑗 × 𝑜 𝑢,𝑥 ′ +𝛾 𝑜 𝑛,𝑢 + 𝛽 × 𝑈 −𝑗 𝑊 −𝑗 𝑊 𝑢 ′ =1 𝑤=1 𝑥 ′ =1 𝑜 𝑛,𝑢 ′ 𝔹 𝑥 ′ ,𝑤 × 𝑜 𝑢,𝑥 ′ +𝛾 ICML-2014, Beijing, June 22-24, 2014 15

Evaluation  We used review collections D from 50 domains.  Each domain has 1000 reviews.  Four domains with 10000 reviews for large data test.  Test settings : Two test settings to evaluate LTM, representing two possible uses of LTM  seen the test domain before, i.e., 𝐸 𝑢  D.  Not seen test domain before, i.e., 𝐸 𝑢  D . ICML-2014, Beijing, June 22-24, 2014 16

Topic Coherence (Mimno et al., EMNLP-2011) ICML-2014, Beijing, June 22-24, 2014 17

Topic Coherence on 4 Large Datasets  Can LTM improve with larger data? ICML-2014, Beijing, June 22-24, 2014 18

Split a large data to 10 smaller ones  Here we use only one domain data  Better topic coherence and better efficiency (30%) ICML-2014, Beijing, June 22-24, 2014 19

Summary  Proposed a lifelong learning topic model LTM.  It keeps a large topic base S  For each new topic modeling task,  run LDA to generate a set of initial topics  find matching old topics from S  mine quality knowledge from the old topics  use the knowledge to help generate better topics  With big data (from diverse domains) : we can  do what we cannot do or haven’t done before. ICML-2014, Beijing, June 22-24, 2014 20

Topic Modeling using Topics from Many Domains, Lifelong Learning and - PowerPoint PPT Presentation

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and Bing Liu University Of Illinois at Chicago liub@cs.uic.edu Introduction Topic models, such as LDA (Blei et al., 2003) , pLSA (Hofmann, 1999) and

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Second Year Student Meeting PhD Candidacy Exam On-topic or Off-topic Candidacy Exam? On-Topic:

The Dynamic Earth Unit Topics Topic 1: Earths Interior Topic 2: Continental Drift

Strategic Considerations for Managing a Nanotechnology Patent Portfolio Sarah Korman, Ph.D., J.D.

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Researching Researching Your Paper Topic Your Paper Topic A HOW TO GUIDE A HOW TO GUIDE

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Linking words to topics Pavel Oleinikov Associate Director DataCamp Topic Modeling in R LDA

Journals! The name stays constant; The meaning shifts. Platforms in scholarly publishing

Finding Pennsylvanias Solar Future June 14, 2018 Philadelphia Overview David G. Hill, Ph.D.

Simulating Galaxies and the Universe Joel R. Primack University of California, Santa Cruz

TEMPERATURE Definition: Measure of the average kinetic energy of the molecules in substance

The United States in the World: An International History from Colonial Times to the Cold War

Welcome to the LIFE Webinar Series. We will be starting soon. The Low-Income Forum on Energy

Surveys the changing state and the behavior of the physical climate system Now tracks 41

Using driven cold atoms as quantum simulators Charles Creffield 1 , Germ an Sierra 2 , and