Reading Tea Leaves: How Humans Interpret Topic Models By Jonathan - - PowerPoint PPT Presentation

reading tea leaves how humans interpret topic models
SMART_READER_LITE
LIVE PREVIEW

Reading Tea Leaves: How Humans Interpret Topic Models By Jonathan - - PowerPoint PPT Presentation

Reading Tea Leaves: How Humans Interpret Topic Models By Jonathan Chang, Jordan Boyd-Graber, (Chong Wang), et al. NIPS 2009 Presented by Stephen Mayhew Feb 2013 Motivation How to evaluate topic models? Anecdotally,


slide-1
SLIDE 1

Reading Tea Leaves: How Humans Interpret Topic Models

By Jonathan Chang, Jordan Boyd-Graber, (Chong Wang), et al. NIPS 2009 Presented by Stephen Mayhew Feb 2013

slide-2
SLIDE 2

Motivation

  • How to evaluate topic models?
  • “Anecdotally”, “empirically”
  • Intrinsic vs. extrinsic
slide-3
SLIDE 3

SVM Document Classification on Reuters 21578

slide-4
SLIDE 4

Human Metrics

  • 1. Word intrusion
  • 2. Topic intrusion

Crowdsourced approach using Amazon Mechanical Turk Evaluating three different approaches: LDA, pLSI, CTM.

slide-5
SLIDE 5

Word Intrusion

“Spot the intruder word” Process:

  • 1. Select a topic at random
  • 2. Choose the 5 most probable words from the topic
  • 3. Choose an improbable word from this topic (which is probable in another topic)
  • 4. Shuffle
  • 5. Present to subject
slide-6
SLIDE 6

Word Intrusion

If the topic set is coherent, then the users will agree on the outlier. If the topic set is incoherent, then the users will choose the outlier at random.

slide-7
SLIDE 7

Topic Intrusion

“Spot the intruder topic” Process:

  • 1. Choose a document
  • 2. Choose the three highest-prob. topics for this document
  • 3. Choose one low-prob. topic for this document
  • 4. Shuffle
  • 5. Present to subject
slide-8
SLIDE 8

Topic Intrusion

slide-9
SLIDE 9

Word Intrusion: how to measure it

Model parameters: MP

𝑙 𝑛 = 𝑡

𝟚(𝑗𝑙,𝑡

𝑛 = 𝑥𝑙 𝑛)/𝑇

Which is just a fancy way of saying: 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑞𝑓𝑝𝑞𝑚𝑓 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑢𝑝𝑢𝑏𝑚 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑞𝑓𝑝𝑞𝑚𝑓

slide-10
SLIDE 10

Word Intrusion

slide-11
SLIDE 11

NYT corpus, 50 topic LDA model

slide-12
SLIDE 12

Topic intrusion: how to measure It

Topic Log Odds (TLO): TLO𝑒

𝑛 = ( 𝑡

log 𝜄𝑒,𝑘𝑒,∗

𝑛

𝑛

− log 𝜄𝑒,𝑘𝑒,𝑡

𝑛

𝑛

)/𝑇 Tran anslation: normalized difference between probability mass of actual “intruder” and selected “intruder”. Upper bound is 0, higher is better.

slide-13
SLIDE 13

Topic Intrusion

slide-14
SLIDE 14

Wikipedia corpus, 50 topic LDA model

slide-15
SLIDE 15
slide-16
SLIDE 16

Problems

Measures homogeneity (synonymy), not topic strength (coherence) Example le document: curling Pos

  • ssib

ible top

  • pic: broom, ice, Canada, rock, sheet, stone

Con

  • nsid

ider syn yntactic dif ifferences:

  • rganization, physicality, proportions, red