Novelty&Diversity CISC489/689010,Lecture#25 Monday,May18 th - PDF document

5/24/09  Novelty & Diversity  CISC489/689‐010, Lecture #25  Monday, May 18 th   Ben CartereFe  IR Tasks  • Standard task:  ad hoc retrieval  – User submits query, receives ranked list of top‐scoring  documents  • Cross‐language retrieval  – User submits query in language E, receives ranked list  of top‐scoring documents in languages F, G, …  • QuesWon answering  – User submits natural language quesWon and receives  natural language answer  • Common thread:  documents are scored  independently of one another  1 

5/24/09  Independent Document Scoring  • Scoring documents independently means the  score of a document is computed without  considering other documents that might be  relevant to the query  – Example:  10 documents that are idenWcal to each  other will all receive the same score  – These 10 documents would then be ranked  consecuWvely  • Does a user really want to see 10 copies of the  same document?  Duplicate Removal  • Duplicate removal (or  de‐duping ) is a simple  way to reduce redundancy in the ranked list  • IdenWfy documents that have the same  content and remove all but one  • Simple approaches:  – Fingerprin+ng :  break documents down into  blocks and measure similarity between blocks  – If there are many blocks with high similarity,  documents are probably duplicates  2 

5/24/09  Redundancy and Novelty  • Simple de‐duping is not necessarily enough  – Picture 10 documents that contain the same  informaWon but are wriFen in very different styles  – A user probably doesn’t need all 10  • Though 2 might be OK  – De‐duping will not reduce the redundancy  • We would like ways to idenWfy documents that  contain  novel  informaWon  – InformaWon that is not present in the documents that  have already been ranked   Example: Two Biographies of Lincoln  3 

5/24/09  Novelty Ranking  • Maximum Marginal Relevance  (MMR) – Carbonell &  Goldstein, SIGIR 1998  • Combine a query‐document score S(Q, D) with a  similarity score based on the similarity between D and  the (k‐1) documents that have already been ranked  – If D has a low score give it low marginal relevance  – If D has a high score but is very similar to the documents  already ranked, give it low marginal relevance  – If D has a high score and is different from other  documents, give it high marginal relevance  • The k th  ranked document is the one with maximum  marginal relevance  MMR  MMR ( Q, D ) = λ S ( Q, D ) − (1 − λ ) max sim ( D, D i ) i Top‐ranked document = D 1  = max D  MMR(Q, D) = max D  S(Q, D)   Second‐ranked document = D 2  = max D  MMR(Q, D) = max D  λS(Q, D) – (1 – λ)sim(D, D 1 )  Third‐ranked document = D 3  = max D  MMR(Q, D) = max D  λS(Q, D) –                                  (1 – λ)max{sim(D, D 1 ), sim(D, D 2 )}  …   When λ = 1, MMR ranking is idenWcal to normal ranked retrieval  4 

5/24/09  A ProbabilisWc Approach  • Assume the following:  – c 1  = 0 – there is no cost for a new relevant  document  – c 2  > 0 – there is some cost for a redundant  relevant document  – c 3  = c 4  – the cost of a nonrelevant document is the  same whether its new or not  • Scoring funcWon reduces to  S ( Q, D ) = P ( Rel | D )(1 − c 3 − P ( New | D )) c 2 A ProbabilisWc Approach  • Requires esWmates of P(Rel | D) and P(New | D)  • P(Rel | D) = P(Q | D), the query‐likelihood  language model score  • P(New | D) is trickier  – One possibility:  KL‐divergence between language  model of document D and language model of ranked  documents  – Recall that KL‐divergence is a sort of “similarity”  between probability distribuWons/language models  6 

5/24/09  Novelty Probability  • P(New | D)  • The smoothed language model for D is  P ( w | D ) = (1 − α D ) tf w,D ctf w + α D | D | | C | • If we let C be the set of documents ranked above  D, then α D  can be thought of as a “novelty  coefficient”  – Higher α D  means the document is more like the ones  ranked above it  – Lower α D  means the document is less like the ones  ranked above it  Novelty Probability  • Find the value of α D  that maximizes the  likelihood of the document D  (1 − α D ) tf w,D ctf w � P ( New | D ) = arg max + α D | D | | C | α D w ∈ D • This is a novel use of the smoothing  parameter:  instead of giving small probability  to terms that don’t appear, use it to esWmate  how different the document is from the  background  7 

5/24/09  ProbabilisWc Model Summary  • EsWmate P(Rel | D) using usual language  model approaches  • EsWmate P(New | D) using smoothing  parameter  • Combine P(Rel | D) and P(New | D) using cost‐ based scoring funcWon and rank documents  accordingly  EvaluaWng Novelty  • EvaluaWon by precision, recall, average  precision, etc, is also based on independent  assessments of relevance  – Example:  if one of 10 duplicate documents is  relevant, all 10 must be relevant  – A system that ranks those 10 documents at ranks  1 to 10 gets a beFer precision than a system that  finds 5 relevant documents that are very different  • The evaluaWon does not reflect the uWlity to  the users  8 

5/24/09  Subtopic Assessment  • Instead of judging documents for relevance to  the query/informaWon need, judge them with  respect to subtopics of the informaWon need  • Example:  InformaWon need  Subtopics  Subtopics and Documents  • A document can be relevant to one or more  subtopics  – Or to none, in which case it is not relevant  • We want to evaluate the ability of the system  to find non‐duplicate subtopics  – If document 1 is relevant to “spot‐welding robots”  and “pipe‐laying robots” and document 2 is the  same, document 2 does not give any extra benefit  – If document 2 is relevant to “controlling  inventory”, it does give extra benefit  9 

Novelty&Diversity CISC489/689010,Lecture#25 Monday,May18 th - PDF document

5/24/09 Novelty&Diversity CISC489/689010,Lecture#25 Monday,May18 th BenCartereFe IRTasks Standardtask:adhocretrieval

5. Novelty & Diversity Outline 5.1. Why Novelty & Diversity? 5.2. Probability Ranking

#@&*$% The Power of Novelty Novelty is experiencing the familiar in a new light A Recipe for

Proof of Novelty A distributed consensus mechanism for securing content novelty Daniel Severo

Novel Is Not Always Better: On the Relation between Novelty and Dominance Pruning Joschka Gro,

Seek Novelty Personality Environment Predictable Unpredictable Seek Stability Seek Novelty

Patent Law Prof. Roger Ford September 28, 2016 Class 7 Novelty: (AIA) 102(a)(1) prior

1 CONTENTS 1. Supplier Diversity Data Call 2. Insurer Response Rate 3. Supplier Diversity

Fundamentals of Diversity Reception What is diversity? Diversity is a technique to combine

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Patent Law Prof. Roger Ford September 26, 2016 Class 6 Novelty: introduction &

Fun IP Prof. Roger Ford Class 6 February 29, 2016 Patents: Novelty and Statutory Bars

Patent Law Prof. Roger Ford February 17, 2016 Class 6 Novelty: introduction &

Patent Law Prof. Roger Ford October 5, 2016 Class 9 Novelty III: patent documents; priority

Patent Law Prof. Roger Ford February 4, 2015 Class 6 Novelty: introduction & anticipation

Patent Law Prof. Roger Ford Class 8 September 25, 2017 Novelty and statutory bars:

IAVMA March 9th, 2014 Bash Halow BA, CVPM and LVT 15 years experience as a practice

Road Infrastructure and Enterprise Development in Ethiopia Admasu Shiferaw (The College of

Political Science 17 . 20 Introduction to American Politics Professor Devin Caughey MIT

Was the Civil War Inevitable? Cornell Notes Lesson Objectives Content: I can understand the

Practical investigation 2.2 Diffraction through a single slit 1. Investigative question What

Register Allocation IR: arbitrary number of variables machine: limited number of registers Ideal

Visual Analytics Tools for Decision Support in Civil Crisis Management Coordinator: Dr. Natalia

Overview of Herschel Calibration A.P.Marston, Instrument and Calibration Scientist Team Lead,

Sambuz

Useful Links

Newsletter

Mail Us

Novelty&Diversity CISC489/689010,Lecture#25 Monday,May18 th - PDF document

5/24/09 Novelty&Diversity CISC489/689010,Lecture#25 Monday,May18 th BenCartereFe IRTasks Standardtask:adhocretrieval

5. Novelty &amp; Diversity Outline 5.1. Why Novelty &amp; Diversity? 5.2. Probability Ranking

#@&amp;*$% The Power of Novelty Novelty is experiencing the familiar in a new light A Recipe for

Proof of Novelty A distributed consensus mechanism for securing content novelty Daniel Severo

Novel Is Not Always Better: On the Relation between Novelty and Dominance Pruning Joschka Gro,

Seek Novelty Personality Environment Predictable Unpredictable Seek Stability Seek Novelty

Patent Law Prof. Roger Ford September 28, 2016 Class 7 Novelty: (AIA) 102(a)(1) prior

1 CONTENTS 1. Supplier Diversity Data Call 2. Insurer Response Rate 3. Supplier Diversity

Fundamentals of Diversity Reception What is diversity? Diversity is a technique to combine

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Patent Law Prof. Roger Ford September 26, 2016 Class 6 Novelty: introduction &amp;

Fun IP Prof. Roger Ford Class 6 February 29, 2016 Patents: Novelty and Statutory Bars

Patent Law Prof. Roger Ford February 17, 2016 Class 6 Novelty: introduction &amp;

Patent Law Prof. Roger Ford October 5, 2016 Class 9 Novelty III: patent documents; priority

Patent Law Prof. Roger Ford February 4, 2015 Class 6 Novelty: introduction &amp; anticipation

Patent Law Prof. Roger Ford Class 8 September 25, 2017 Novelty and statutory bars:

IAVMA March 9th, 2014 Bash Halow BA, CVPM and LVT 15 years experience as a practice

Road Infrastructure and Enterprise Development in Ethiopia Admasu Shiferaw (The College of

Political Science 17 . 20 Introduction to American Politics Professor Devin Caughey MIT

Was the Civil War Inevitable? Cornell Notes Lesson Objectives Content: I can understand the

Practical investigation 2.2 Diffraction through a single slit 1. Investigative question What

Register Allocation IR: arbitrary number of variables machine: limited number of registers Ideal

Visual Analytics Tools for Decision Support in Civil Crisis Management Coordinator: Dr. Natalia

Overview of Herschel Calibration A.P.Marston, Instrument and Calibration Scientist Team Lead,

Sambuz

Useful Links

Newsletter

Mail Us

5. Novelty & Diversity Outline 5.1. Why Novelty & Diversity? 5.2. Probability Ranking

#@&*$% The Power of Novelty Novelty is experiencing the familiar in a new light A Recipe for

Patent Law Prof. Roger Ford September 26, 2016 Class 6 Novelty: introduction &

Patent Law Prof. Roger Ford February 17, 2016 Class 6 Novelty: introduction &

Patent Law Prof. Roger Ford February 4, 2015 Class 6 Novelty: introduction & anticipation