Latent Topic Feedback for Information Retrieval David Andrzejewski - - PowerPoint PPT Presentation

latent topic feedback for information retrieval
SMART_READER_LITE
LIVE PREVIEW

Latent Topic Feedback for Information Retrieval David Andrzejewski - - PowerPoint PPT Presentation

Latent Topic Feedback for Information Retrieval David Andrzejewski David Buttler Center for Applied Scientific Computing Lawrence Livermore National Laboratory (USA) August 22, 2011 Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR


slide-1
SLIDE 1

Latent Topic Feedback for Information Retrieval

David Andrzejewski David Buttler

Center for Applied Scientific Computing Lawrence Livermore National Laboratory (USA)

August 22, 2011

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 1 / 18

slide-2
SLIDE 2

euro opposition search

BigCo Internal Document Navigation Portal

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 2 / 18

slide-3
SLIDE 3

euro opposition search Hurd in passionate Maastricht defense Financial Times - 14 May 91 Russian President Yeltsin invited to G7 Financial Times - 24 Mar 92 Small companies may lose in EC deals Financial Times - 14 May 91

BigCo Internal Document Navigation Portal

Returned documents

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 2 / 18

slide-4
SLIDE 4

euro opposition search Hurd in passionate Maastricht defense Financial Times - 14 May 91 Russian President Yeltsin invited to G7 Financial Times - 24 Mar 92 Small companies may lose in EC deals Financial Times - 14 May 91 Emu T

  • ry Euro sceptics

social chapter, Liberal Democrat mps, Labour, bill, Commons debate economic monetary union Maastricht treaty, member states European, Europe, Community, Emu

BigCo Internal Document Navigation Portal

Returned documents Related topics

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 2 / 18

slide-5
SLIDE 5

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-6
SLIDE 6

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-7
SLIDE 7

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-8
SLIDE 8

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-9
SLIDE 9

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-10
SLIDE 10

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-11
SLIDE 11

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-12
SLIDE 12

Corpus navigation challenges

Condition Impaired IR technique Non-expert user keyword queries Lack of metadata faceted search Specialized domain WordNet Small user base query log mining, relevance feedback Proprietary data Crowdsourcing

Who has these problems?

Private organizations Government agencies

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 3 / 18

slide-13
SLIDE 13

Topic modeling with Latent Dirichlet Allocation (LDA)

Blei et al, JMLR 2003

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 4 / 18

slide-14
SLIDE 14

Topic modeling with Latent Dirichlet Allocation (LDA)

Blei et al, JMLR 2003

Human embryonic stem cell research may benefit patients with genetic risk factors... Patients at risk for drug- resistant infection...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 4 / 18

slide-15
SLIDE 15

Topic modeling with Latent Dirichlet Allocation (LDA)

Blei et al, JMLR 2003

Human embryonic stem cell research may benefit patients with genetic risk factors... Patients at risk for drug- resistant infection...

Patients at risk for drug-resistant

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 4 / 18

slide-16
SLIDE 16

How can we exploit latent topics?

Implicitly: language model smoothing (Wei & Croft, SIGIR 2006) This approach: explicit user feedback on topics

1

How to show topics?

2

Which topics to show?

3

How to use feedback?

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 5 / 18

slide-17
SLIDE 17

How can we exploit latent topics?

Implicitly: language model smoothing (Wei & Croft, SIGIR 2006) This approach: explicit user feedback on topics

1

How to show topics?

2

Which topics to show?

3

How to use feedback?

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 5 / 18

slide-18
SLIDE 18

How can we exploit latent topics?

Implicitly: language model smoothing (Wei & Croft, SIGIR 2006) This approach: explicit user feedback on topics

1

How to show topics?

2

Which topics to show?

3

How to use feedback?

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 5 / 18

slide-19
SLIDE 19

How can we exploit latent topics?

Implicitly: language model smoothing (Wei & Croft, SIGIR 2006) This approach: explicit user feedback on topics

1

How to show topics?

2

Which topics to show?

3

How to use feedback?

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 5 / 18

slide-20
SLIDE 20

How can we exploit latent topics?

Implicitly: language model smoothing (Wei & Croft, SIGIR 2006) This approach: explicit user feedback on topics

1

How to show topics?

2

Which topics to show?

3

How to use feedback?

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 5 / 18

slide-21
SLIDE 21

Question 1 - How to show topics to user?

“Top N” lists are hard to interpret We combine several techniques

topic label (Lau et al, COLING 2010) topic n-grams (Blei & Lafferty, arXiv 2009) capitalization recovery

Label Terms Topic 11

  • il, gas, production, exploration

sea, north, company, field, energy petroleum, companies Petroleum state oil company North Sea, natural gas production, exploration, field, energy

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 6 / 18

slide-22
SLIDE 22

Question 1 - How to show topics to user?

“Top N” lists are hard to interpret We combine several techniques

topic label (Lau et al, COLING 2010) topic n-grams (Blei & Lafferty, arXiv 2009) capitalization recovery

Label Terms Topic 11

  • il, gas, production, exploration

sea, north, company, field, energy petroleum, companies Petroleum state oil company North Sea, natural gas production, exploration, field, energy

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 6 / 18

slide-23
SLIDE 23

Question 1 - How to show topics to user?

“Top N” lists are hard to interpret We combine several techniques

topic label (Lau et al, COLING 2010) topic n-grams (Blei & Lafferty, arXiv 2009) capitalization recovery

Label Terms Topic 11

  • il, gas, production, exploration

sea, north, company, field, energy petroleum, companies Petroleum state oil company North Sea, natural gas production, exploration, field, energy

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 6 / 18

slide-24
SLIDE 24

Question 1 - How to show topics to user?

“Top N” lists are hard to interpret We combine several techniques

topic label (Lau et al, COLING 2010) topic n-grams (Blei & Lafferty, arXiv 2009) capitalization recovery

Label Terms Topic 11

  • il, gas, production, exploration

sea, north, company, field, energy petroleum, companies Petroleum state oil company North Sea, natural gas production, exploration, field, energy

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 6 / 18

slide-25
SLIDE 25

Question 1 - How to show topics to user?

“Top N” lists are hard to interpret We combine several techniques

topic label (Lau et al, COLING 2010) topic n-grams (Blei & Lafferty, arXiv 2009) capitalization recovery

Label Terms Topic 11

  • il, gas, production, exploration

sea, north, company, field, energy petroleum, companies Petroleum state oil company North Sea, natural gas production, exploration, field, energy

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 6 / 18

slide-26
SLIDE 26

Question 2 - Which topics to show?

Problems

A) Too many topics to present them all (T > 100) B) Incoherent “junk” topics

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 7 / 18

slide-27
SLIDE 27

Question 2 - Which topics to show?

Problems

A) Too many topics to present them all (T > 100) B) Incoherent “junk” topics Topic 248 ve, year, ll, time, don, good, lot, back years, things, make Topic 18 january, february, december march, month, year, rose feb, sales, fell, increase

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 7 / 18

slide-28
SLIDE 28

Problem A - Narrowing down the topics

Pseudo-relevance feedback → enriched topics E Topic covariance Σ → related topics R Top 2 docs, top 2 enriched, top 2 related ≤ 12 topics shown E =

  • d∈Dq

k-argmax

t

θd(t)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 8 / 18

slide-29
SLIDE 29

Problem A - Narrowing down the topics

Pseudo-relevance feedback → enriched topics E Topic covariance Σ → related topics R Top 2 docs, top 2 enriched, top 2 related ≤ 12 topics shown E =

  • d∈Dq

k-argmax

t

θd(t) R =

  • t∈E

k-argmax

t′ / ∈E

Σ(t, t′)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 8 / 18

slide-30
SLIDE 30

Problem A - Narrowing down the topics

Pseudo-relevance feedback → enriched topics E Topic covariance Σ → related topics R Top 2 docs, top 2 enriched, top 2 related ≤ 12 topics shown E =

  • d∈Dq

k-argmax

t

θd(t) R =

  • t∈E

k-argmax

t′ / ∈E

Σ(t, t′)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 8 / 18

slide-31
SLIDE 31

Problem B - Identifying junk topics

Newman et al (JCDL 2010)

Word co-occurrences in Wikipedia → topic PMI score

Incoherent topic

PMI = 0.63

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 9 / 18

slide-32
SLIDE 32

Problem B - Identifying junk topics

Newman et al (JCDL 2010)

Word co-occurrences in Wikipedia → topic PMI score

Coherent topic

PMI = 3.85

chicken food fried pork rice hot sauce beef meat sweet sweet meat beef sauce hot rice pork fried food chicken

Incoherent topic

PMI = 0.63

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 9 / 18

slide-33
SLIDE 33

Problem B - Identifying junk topics

Newman et al (JCDL 2010)

Word co-occurrences in Wikipedia → topic PMI score

Coherent topic

PMI = 3.85

chicken food fried pork rice hot sauce beef meat sweet sweet meat beef sauce hot rice pork fried food chicken

Incoherent topic

PMI = 0.63

patent patents inventors lewis clark gillette spiegel ice pole expedition expedition pole ice spiegel gillette clark lewis inventors patents patent Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 9 / 18

slide-34
SLIDE 34

Problem B - Discarding junk topics

1

Compute PMI scores for each topic t

2

Worst PMI scores → dropped topics D PMI(t) = 1 k(k − 1)

  • (w,w′)∈Wt

PMI(w, w′)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 10 / 18

slide-35
SLIDE 35

Problem B - Discarding junk topics

1

Compute PMI scores for each topic t

2

Worst PMI scores → dropped topics D PMI(t) = 1 k(k − 1)

  • (w,w′)∈Wt

PMI(w, w′) D ={t|t ∈ E ∪ R and PMI(t) < PMI25}

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 10 / 18

slide-36
SLIDE 36

Problem B - Discarding junk topics

Final topics shown

enriched and related, minus dropped → {E ∪ R} \ D

1

Compute PMI scores for each topic t

2

Worst PMI scores → dropped topics D PMI(t) = 1 k(k − 1)

  • (w,w′)∈Wt

PMI(w, w′) D ={t|t ∈ E ∪ R and PMI(t) < PMI25}

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 10 / 18

slide-37
SLIDE 37

Problem B - Discarding junk topics

Final topics shown

enriched and related, minus dropped → {E ∪ R} \ D

1

Compute PMI scores for each topic t

2

Worst PMI scores → dropped topics D PMI(t) = 1 k(k − 1)

  • (w,w′)∈Wt

PMI(w, w′) D ={t|t ∈ E ∪ R and PMI(t) < PMI25}

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 10 / 18

slide-38
SLIDE 38

Question 3 - How to incorporate feedback?

Mechanism should preserve original query intent incorporate the feedback “plug and play” with existing search technologies

Topic-driven query expansion

Weighted combination of Original query words q Top 10 topic words Wz

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 11 / 18

slide-39
SLIDE 39

Question 3 - How to incorporate feedback?

Mechanism should preserve original query intent incorporate the feedback “plug and play” with existing search technologies

Topic-driven query expansion

Weighted combination of Original query words q Top 10 topic words Wz

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 11 / 18

slide-40
SLIDE 40

Question 3 - How to incorporate feedback?

Mechanism should preserve original query intent incorporate the feedback “plug and play” with existing search technologies

Topic-driven query expansion

Weighted combination of Original query words q Top 10 topic words Wz

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 11 / 18

slide-41
SLIDE 41

Question 3 - How to incorporate feedback?

Mechanism should preserve original query intent incorporate the feedback “plug and play” with existing search technologies

Topic-driven query expansion

Weighted combination of Original query words q Top 10 topic words Wz

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 11 / 18

slide-42
SLIDE 42

Question 3 - How to incorporate feedback?

Mechanism should preserve original query intent incorporate the feedback “plug and play” with existing search technologies

Topic-driven query expansion

Weighted combination of Original query words q Top 10 topic words Wz

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 11 / 18

slide-43
SLIDE 43

Question 3 - How to incorporate feedback?

Mechanism should preserve original query intent incorporate the feedback “plug and play” with existing search technologies

Topic-driven query expansion

Weighted combination of Original query words q Top 10 topic words Wz

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 11 / 18

slide-44
SLIDE 44

Example TREC query

Corpus: 210K news articles (Financial Times, 1992-1994) Query: “euro opposition” (political opposition to the e currency union) Ground truth: 98 articles judged relevant

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 12 / 18

slide-45
SLIDE 45

Example TREC query

Corpus: 210K news articles (Financial Times, 1992-1994) Query: “euro opposition” (political opposition to the e currency union) Ground truth: 98 articles judged relevant

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 12 / 18

slide-46
SLIDE 46

Example TREC query

Corpus: 210K news articles (Financial Times, 1992-1994) Query: “euro opposition” (political opposition to the e currency union) Ground truth: 98 articles judged relevant

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 12 / 18

slide-47
SLIDE 47

“euro opposition” topics

Label Terms PMI percentile debate Tory Euro sceptics 47 social chapter, Liberal Democrat mps, Labour, bill, Commons business PERSONAL FILE Born 2 years ago, past years man, time, job, career Emu economic monetary union 63 Maastricht treaty, member states European, Europe, Community, Emu George President George Bush, White House 60 Mr Clinton, administration Democratic, Republican, Washington

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 13 / 18

slide-48
SLIDE 48

“euro opposition” topics

Label Terms PMI percentile debate Tory Euro sceptics 47 social chapter, Liberal Democrat mps, Labour, bill, Commons business PERSONAL FILE Born 2 years ago, past years man, time, job, career Emu economic monetary union 63 Maastricht treaty, member states European, Europe, Community, Emu George President George Bush, White House 60 Mr Clinton, administration Democratic, Republican, Washington

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 13 / 18

slide-49
SLIDE 49

“euro opposition” topics

Label Terms PMI percentile debate Tory Euro sceptics 47 social chapter, Liberal Democrat mps, Labour, bill, Commons business PERSONAL FILE Born 2 years ago, past years man, time, job, career Emu economic monetary union 63 Maastricht treaty, member states European, Europe, Community, Emu George President George Bush, White House 60 Mr Clinton, administration Democratic, Republican, Washington

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 13 / 18

slide-50
SLIDE 50

“euro opposition” topics

Label Terms PMI percentile debate Tory Euro sceptics 47 social chapter, Liberal Democrat mps, Labour, bill, Commons business PERSONAL FILE Born 2 years ago, past years man, time, job, career Emu economic monetary union 63 Maastricht treaty, member states European, Europe, Community, Emu George President George Bush, White House 60 Mr Clinton, administration Democratic, Republican, Washington

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 13 / 18

slide-51
SLIDE 51

“euro opposition” topics

Label Terms PMI percentile debate Tory Euro sceptics 47 social chapter, Liberal Democrat mps, Labour, bill, Commons business PERSONAL FILE Born 2 years ago, past years man, time, job, career Emu economic monetary union 63 Maastricht treaty, member states European, Europe, Community, Emu George President George Bush, White House 60 Mr Clinton, administration Democratic, Republican, Washington

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 13 / 18

slide-52
SLIDE 52

“Emu” topic feedback

Indri weighted query operator Original query #weight(0.375 euro, 0.375 opposition, 0.031 European, ..., 0.015 Emu) Topic expansion ROC curve (true/false positive rates) Measure Gain NDCG15 +0.22 NDCG +0.07 MAP +0.02

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 14 / 18

slide-53
SLIDE 53

“Emu” topic feedback

Indri weighted query operator Original query #weight (0.375 euro, 0.375 opposition, 0.031 European, ..., 0.015 Emu) Topic expansion ROC curve (true/false positive rates) Measure Gain NDCG15 +0.22 NDCG +0.07 MAP +0.02

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 14 / 18

slide-54
SLIDE 54

“Emu” topic feedback

Indri weighted query operator Original query #weight( 0.375 euro, 0.375 opposition, 0.031 European, ..., 0.015 Emu) Topic expansion ROC curve (true/false positive rates) Measure Gain NDCG15 +0.22 NDCG +0.07 MAP +0.02

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 14 / 18

slide-55
SLIDE 55

“Emu” topic feedback

Indri weighted query operator Original query #weight(0.375 euro, 0.375 opposition, 0.031 European, ..., 0.015 Emu ) Topic expansion ROC curve (true/false positive rates) Measure Gain NDCG15 +0.22 NDCG +0.07 MAP +0.02

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 14 / 18

slide-56
SLIDE 56

“Emu” topic feedback

Indri weighted query operator Original query #weight(0.375 euro, 0.375 opposition, 0.031 European, ..., 0.015 Emu) Topic expansion ROC curve (true/false positive rates) Measure Gain NDCG15 +0.22 NDCG +0.07 MAP +0.02

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 14 / 18

slide-57
SLIDE 57

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-58
SLIDE 58

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-59
SLIDE 59

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-60
SLIDE 60

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-61
SLIDE 61

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-62
SLIDE 62

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-63
SLIDE 63

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-64
SLIDE 64

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-65
SLIDE 65

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-66
SLIDE 66

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-67
SLIDE 67

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-68
SLIDE 68

Experimental results

TREC datasets

6 newswire corpora, 814K documents total Learn T = 500 topics per corpus 850 queries total (some overlap)

Assume user will select “right” topic (if presented) Summary (h = a helpful topic exists, s = we show it to the user)

Avg number of topics shown = 7.76 P(h) ≈ 40%, P(s|h) ≈ 40% → P(h ∧ s) = 15.6% Adding related topics helps (else P(h ∧ s) = 10.9%, avg shown = 2.70) Discarding dropped does not hurt (else P(h ∧ s) = 16.8%, avg shown = 9.79)

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 15 / 18

slide-69
SLIDE 69

Important idea

Even when topics do not improve NDCG and friends . . . they still may be useful/informative.

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 16 / 18

slide-70
SLIDE 70

Important idea

Even when topics do not improve NDCG and friends . . . they still may be useful/informative.

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 16 / 18

slide-71
SLIDE 71

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-72
SLIDE 72

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-73
SLIDE 73

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-74
SLIDE 74

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-75
SLIDE 75

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-76
SLIDE 76

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-77
SLIDE 77

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-78
SLIDE 78

Conclusions and Future Work

Conclusions

Explicit topic feedback can improve relevance Selection approach can find relevant topics

Future work

Better topics? (fancier topic models / user guidance) Better topic selection? (user modeling, learning to rank) Validate assumptions and presentation strategy (user study) Compare / combine with implicit topic usage

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 17 / 18

slide-79
SLIDE 79

Acknowledgments

Web demo: Kevin Lawrence (Florida A&M) This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-491932

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 18 / 18

slide-80
SLIDE 80

Acknowledgments

Web demo: Kevin Lawrence (Florida A&M) This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-491932

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 18 / 18

slide-81
SLIDE 81

Predicting relevant topics

Can we predict which topics will improve relevance? Short answer: no (well, I couldn’t get it to work. . .) Linear / logistic regression Feature Interpretation PMI(t) topic quality Entropy(P(d|t)) document-concentration of topic log(P(q|t)) query probability under the topic log(

d∈Dq θd(t))

topic probability across top documents Missed helpful topics: “far” from top baseline documents

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 19 / 18

slide-82
SLIDE 82

Predicting relevant topics

Can we predict which topics will improve relevance? Short answer: no (well, I couldn’t get it to work. . .) Linear / logistic regression Feature Interpretation PMI(t) topic quality Entropy(P(d|t)) document-concentration of topic log(P(q|t)) query probability under the topic log(

d∈Dq θd(t))

topic probability across top documents Missed helpful topics: “far” from top baseline documents

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 19 / 18

slide-83
SLIDE 83

Predicting relevant topics

Can we predict which topics will improve relevance? Short answer: no (well, I couldn’t get it to work. . .) Linear / logistic regression Feature Interpretation PMI(t) topic quality Entropy(P(d|t)) document-concentration of topic log(P(q|t)) query probability under the topic log(

d∈Dq θd(t))

topic probability across top documents Missed helpful topics: “far” from top baseline documents

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 19 / 18

slide-84
SLIDE 84

Predicting relevant topics

Can we predict which topics will improve relevance? Short answer: no (well, I couldn’t get it to work. . .) Linear / logistic regression Feature Interpretation PMI(t) topic quality Entropy(P(d|t)) document-concentration of topic log(P(q|t)) query probability under the topic log(

d∈Dq θd(t))

topic probability across top documents Missed helpful topics: “far” from top baseline documents

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 19 / 18

slide-85
SLIDE 85

Negative feedback

Could also allow user to mark topic as not relevant Use Indri #not operator to form new query Intuitively appealing, but did not seem to help in experiments...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 20 / 18

slide-86
SLIDE 86

Negative feedback

Could also allow user to mark topic as not relevant Use Indri #not operator to form new query Intuitively appealing, but did not seem to help in experiments...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 20 / 18

slide-87
SLIDE 87

Negative feedback

Could also allow user to mark topic as not relevant Use Indri #not operator to form new query Intuitively appealing, but did not seem to help in experiments...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 20 / 18

slide-88
SLIDE 88

“law enforcement dogs”

Label Terms heroin seized kg cocaine, drug traffickers, kg heroin, police, arrested, drugs, marijuana

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 21 / 18

slide-89
SLIDE 89

“King Hussein, peace”

Label Terms Amman Majesty King Husayn, al Aqabah, peace process, Jordan, Jordanian, Amman, Arab

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 22 / 18

slide-90
SLIDE 90

“bank failures”

Label Terms FDIC Federal Deposit Insurance, William Seidman, Insurance Corp, banks, bank, FDIC, banking

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 23 / 18

slide-91
SLIDE 91

“US-USSR Arms Control Agreements”

Label Terms missile Strategic Defense Initiative, United States, arms control, treaty, nuclear, missiles, range

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 24 / 18

slide-92
SLIDE 92

“Possible Contributions of Gene Mapping to Medicine”

Label Terms called British journal Nature, immune system, genetically engineered, cells, research, researchers, scientists

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 25 / 18

slide-93
SLIDE 93

“New Space Satellite Applications”

Label Terms communications European Space Agency, Air Force, Cape Canaveral, satellite, launch, rocket, satellites

1.0 FPR 0.0 1.0 TPR

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 26 / 18

slide-94
SLIDE 94

Wikipedia: Tax competition (4-word window)

... governmental strategy of attracting foreign direct investment,...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 27 / 18

slide-95
SLIDE 95

Wikipedia: Tax competition (4-word window)

...governmental strategy of attracting foreign direct investment,...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 27 / 18

slide-96
SLIDE 96

Wikipedia: Tax competition (4-word window)

...governmental strategy of attracting foreign direct investment,...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 27 / 18

slide-97
SLIDE 97

Wikipedia: Tax competition (4-word window)

...governmental strategy of attracting foreign direct investment ,...

Andrzejewski and Buttler (LLNL) Latent Topic Feedback for IR KDD 2011 27 / 18