Exploiting Time-based Synonyms in Searching Document Archives - - PowerPoint PPT Presentation

exploiting time based synonyms in searching document
SMART_READER_LITE
LIVE PREVIEW

Exploiting Time-based Synonyms in Searching Document Archives - - PowerPoint PPT Presentation

Outline Exploiting Time-based Synonyms in Searching Document Archives Nattiya Kanhabua and Kjetil Nrvg Database System Group Norwegian University of Science and Technology Trondheim, Norway JCDL 2010, June 21 - 25, Gold Coast,


slide-1
SLIDE 1

Outline

Exploiting Time-based Synonyms in Searching Document Archives

Nattiya Kanhabua and Kjetil Nørvåg

Database System Group Norwegian University of Science and Technology Trondheim, Norway

JCDL ’2010, June 21 - 25, Gold Coast, Australia

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-2
SLIDE 2

Outline

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-3
SLIDE 3

Outline

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-4
SLIDE 4

Outline

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-5
SLIDE 5

Outline

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-6
SLIDE 6

Outline

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-7
SLIDE 7

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-8
SLIDE 8

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Problem statement

In recent years, document archives are publicly available

E.g., Internet Archive, digital libraries and news archives

Searching in such resources is not straightforward

Contents in these resources are strongly time-dependent

Query “Pope Benedict XVI” and dates “before 2005”

Unable to retrieve documents about “Joseph Alois Ratzinger” To improve the retrieval effectiveness, query expansion using synonyms wrt. time can be employed

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-9
SLIDE 9

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Problem statement

In recent years, document archives are publicly available

E.g., Internet Archive, digital libraries and news archives

Searching in such resources is not straightforward

Contents in these resources are strongly time-dependent

Query “Pope Benedict XVI” and dates “before 2005”

Unable to retrieve documents about “Joseph Alois Ratzinger” To improve the retrieval effectiveness, query expansion using synonyms wrt. time can be employed

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-10
SLIDE 10

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Problem statement

In recent years, document archives are publicly available

E.g., Internet Archive, digital libraries and news archives

Searching in such resources is not straightforward

Contents in these resources are strongly time-dependent

Query “Pope Benedict XVI” and dates “before 2005”

Unable to retrieve documents about “Joseph Alois Ratzinger” To improve the retrieval effectiveness, query expansion using synonyms wrt. time can be employed

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-11
SLIDE 11

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Observation

Named entities (people, organization, location, etc.) constitute a major fraction of queries [Sanderson SIGIR’2008]

Very dynamic in appearance, i.e., relationships between terms changes over time E.g. changes of roles, name alterations, or semantic shift

Synonyms are different words with similar meanings In our context, synonyms are terms used as name variants (other names, titles, or roles) of a named entity

E.g., “Cardinal Joseph Ratzinger” is a synonym of “Pope Benedict XVI” before 2005

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-12
SLIDE 12

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Observation

Named entities (people, organization, location, etc.) constitute a major fraction of queries [Sanderson SIGIR’2008]

Very dynamic in appearance, i.e., relationships between terms changes over time E.g. changes of roles, name alterations, or semantic shift

Synonyms are different words with similar meanings In our context, synonyms are terms used as name variants (other names, titles, or roles) of a named entity

E.g., “Cardinal Joseph Ratzinger” is a synonym of “Pope Benedict XVI” before 2005

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-13
SLIDE 13

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Observation

Named entities (people, organization, location, etc.) constitute a major fraction of queries [Sanderson SIGIR’2008]

Very dynamic in appearance, i.e., relationships between terms changes over time E.g. changes of roles, name alterations, or semantic shift

Synonyms are different words with similar meanings In our context, synonyms are terms used as name variants (other names, titles, or roles) of a named entity

E.g., “Cardinal Joseph Ratzinger” is a synonym of “Pope Benedict XVI” before 2005

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-14
SLIDE 14

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

What are time-based synonyms?

Time-independent synonyms are invariant to time Time-dependent synonyms are relevant to a particular time period, i.e., entity-synonym relationships change over time

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-15
SLIDE 15

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Application

News archive search Search terms are named entities Publication dates of documents are temporal criteria

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-16
SLIDE 16

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Application

News archive search Search terms are named entities Publication dates of documents are temporal criteria Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-17
SLIDE 17

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Application

News archive search Search terms are named entities Publication dates of documents are temporal criteria Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady

  • f the United States” are relevant

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-18
SLIDE 18

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Application

News archive search Search terms are named entities Publication dates of documents are temporal criteria Challenge Semantic gaps in searching archives, or a lack of knowledge about a query and synonyms at particular time

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-19
SLIDE 19

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-20
SLIDE 20

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Contributions

1

Formal models

Wikipedia viewed as a temporal resource

2

Proposed approaches

Discover time-based synonyms over time Improve the accuracy of time of synonyms Expand a query using time-based synonyms

3

Experiments

Evaluate extracting and improving time of synonyms Evaluate query expansion using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-21
SLIDE 21

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Contributions

1

Formal models

Wikipedia viewed as a temporal resource

2

Proposed approaches

Discover time-based synonyms over time Improve the accuracy of time of synonyms Expand a query using time-based synonyms

3

Experiments

Evaluate extracting and improving time of synonyms Evaluate query expansion using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-22
SLIDE 22

Introduction Synonym Detection Query Expansion Evaluation Conclusions Problem Statement Contributions

Contributions

1

Formal models

Wikipedia viewed as a temporal resource

2

Proposed approaches

Discover time-based synonyms over time Improve the accuracy of time of synonyms Expand a query using time-based synonyms

3

Experiments

Evaluate extracting and improving time of synonyms Evaluate query expansion using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-23
SLIDE 23

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-24
SLIDE 24

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Recognizing named entities

Step 1: Partition Wikipedia regarding to the time granularity g = month to obtain its snapshots W = {Wt1, . . . , Wtz} Step 2: For each snapshot Wtk ∈ W, identify named entity pages to obtain a set of named entities Etk = {e1, . . . , ej} Step 3: For each name entity ei ∈ Etk , find a set of entity-synonym relationships Stk = {ξ1,1, . . . , ξn,m} Figure: A snapshot of Wikipedia and current revisions at time tk

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-25
SLIDE 25

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Recognizing named entities

Step 1: Partition Wikipedia regarding to the time granularity g = month to obtain its snapshots W = {Wt1, . . . , Wtz} Step 2: For each snapshot Wtk ∈ W, identify named entity pages to obtain a set of named entities Etk = {e1, . . . , ej} Step 3: For each name entity ei ∈ Etk , find a set of entity-synonym relationships Stk = {ξ1,1, . . . , ξn,m} Example

[Bunescu and Pa¸ sca EACL ’2006] 1) Multi-word titles and all words are capitalized President_of_the_United_ States ⇒ named entity 2) Single-word titles with multiple capital letters UNICEF and WHO are named entities 3) 75% of occurrences in the article text itself are capitalized

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-26
SLIDE 26

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Recognizing named entities

Step 1: Partition Wikipedia regarding to the time granularity g = month to obtain its snapshots W = {Wt1, . . . , Wtz} Step 2: For each snapshot Wtk ∈ W, identify named entity pages to obtain a set of named entities Etk = {e1, . . . , ej} Step 3: For each name entity ei ∈ Etk , find a set of entity-synonym relationships Stk = {ξ1,1, . . . , ξn,m}

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-27
SLIDE 27

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Recognizing named entities

Step 1: Partition Wikipedia regarding to the time granularity g = month to obtain its snapshots W = {Wt1, . . . , Wtz} Step 2: For each snapshot Wtk ∈ W, identify named entity pages to obtain a set of named entities Etk = {e1, . . . , ej} Step 3: For each name entity ei ∈ Etk , find a set of entity-synonym relationships Stk = {ξ1,1, . . . , ξn,m} Example

ei: President_of_the_ United_States tk: 11/2001 sj: “George W. Bush” ξi,: (ei, sj) or (President_of_the_United_ States,“George W. Bush”)

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-28
SLIDE 28

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Extracting synonyms

Step 1: For each entity ei ∈ Etk , find its synonyms by extracting anchor texts from article links Step 2: Accumulate entity-synonym relationships for all entities at time tk, i.e., a synonym snapshot Stk = {ξ1,1, . . . , ξn,m} Example

[[President_of_the_United_ States|BarackObama]], “Barack Obama” is anchor texts linking to the article President_of_the_United_ States

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-29
SLIDE 29

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Extracting synonyms

Step 1: For each entity ei ∈ Etk , find its synonyms by extracting anchor texts from article links Step 2: Accumulate entity-synonym relationships for all entities at time tk, i.e., a synonym snapshot Stk = {ξ1,1, . . . , ξn,m} Example

[[President_of_the_United_ States|BarackObama]], “Barack Obama” is anchor texts linking to the article President_of_the_United_ States

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-30
SLIDE 30

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Extracting synonyms

Output

Entity-synonym relationships and time periods

Named Entity Synonym Time Period Pope Benedict XVI Cardinal Joseph Ratzinger 05/2005 - 03/2009* Joseph Ratzinger 05/2005 - 03/2009 Pope Benedict XVI 05/2005 - 03/2009 Barack Obama Barack Hussein Obama II 02/2007 - 03/2009

  • Sen. Barack Obama

07/2007 - 03/2009 Senator Barack Obama 05/2006 - 03/2009 Hillary Rodham Clinton Hillary Clinton 08/2003 - 03/2009

  • Sen. Hillary Clinton

03/2007 - 03/2009 Senator Clinton 11/2007 - 03/2009

* The time of synonyms are timestamps of Wikipedia articles (8 years) in which they

appear, not temporal expression extracted from the contents

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-31
SLIDE 31

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-32
SLIDE 32

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Improving the accuracy of time using burst detection

Analyze the New York Time Annotated Corpus (NYT) to discover more accurate time 1.8M articles from January 1987 to June 2007 (20 years) Use the burst detection algorithm [Kleinberg in KDD’2002] Generate bursty periods of ξi,j by computing a rate of occurrence from document streams Output bursty intervals and bursty weight, i.e., periods of

  • ccurrence and intensity

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-33
SLIDE 33

Introduction Synonym Detection Query Expansion Evaluation Conclusions Entity Recognition and Synonym Extraction Improving the Accuracy of Time

Improving the accuracy of time using burst detection

Output Results from burst-detection algorithm

Synonym Entity Burst Weight Time Start End President Reagan Ronald Reagan 5506.858 01/1987 02/1989 President Ronald Ronald Reagan 100.401 01/1989 03/1990 President Ronald Ronald Reagan 67.208 07/1990 02/1993 Senator Clinton Hillary Rodham Clinton 18.214 01/2001 10/2001 Senator Clinton Hillary Rodham Clinton 17.732 05/2002 01/2003 Senator Clinton Hillary Rodham Clinton 172.356 06/2003 11/2004

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-34
SLIDE 34

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-35
SLIDE 35

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Classifying synonyms into two types

Definition

Class A: time-independent Robust to change over time and good synonym candidates for an ordinary search (no temporal criteria provided) E.g., “Barack Hussein Obama II” is a time-independent synonym of “Barack Obama” Class B: time-dependent Related to particular time in the past and good synonym candidates for a temporal search where changes in semantics must be considered E.g., “Cardinal Joseph Ratzinger” is a time-dependent synonym of “Pope Benedict XVI” before 2005

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-36
SLIDE 36

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-37
SLIDE 37

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Ranking time-independent synonyms

Definition

Time-independent synonyms are weighted by a mixture model of a temporal feature and a frequency feature TIDP(sj) = µ · pf(sj) + (1 − µ) · tf(sj) pf(sj) is the time partition frequency in which sj occurs tf(sj) is an averaged tf of sj in all time partitions, tf(sj) =

  • i tf(sj ,pi )

pf(sj )

µ underlines the importance of a temporal feature and a frequency feature µ = 0.5 yields the best performance in the experiments

Intuition

The model measures popularity of synonyms based on two factors Robustness to change over time, i.e, the more partitions synonyms occur, the more robust to time they are High usages over time, i.e., a high value of averaged frequencies over time

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-38
SLIDE 38

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Ranking time-independent synonyms

Definition

Time-independent synonyms are weighted by a mixture model of a temporal feature and a frequency feature TIDP(sj) = µ · pf(sj) + (1 − µ) · tf(sj) pf(sj) is the time partition frequency in which sj occurs tf(sj) is an averaged tf of sj in all time partitions, tf(sj) =

  • i tf(sj ,pi )

pf(sj )

µ underlines the importance of a temporal feature and a frequency feature µ = 0.5 yields the best performance in the experiments

Intuition

The model measures popularity of synonyms based on two factors Robustness to change over time, i.e, the more partitions synonyms occur, the more robust to time they are High usages over time, i.e., a high value of averaged frequencies over time

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-39
SLIDE 39

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-40
SLIDE 40

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Ranking time-dependent synonyms

Definition

Given time tk, time-dependent synonyms at tk are weighted by TDP(sj, tk) = tf(sj, tk) tf(sj, tk) is a term frequency of sj at tk

Intuition

Only term frequencies will be used to measure the importance of synonyms Time partitions are not considered because only synonyms in a particular time period tk are interesting

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-41
SLIDE 41

Introduction Synonym Detection Query Expansion Evaluation Conclusions Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

Ranking time-dependent synonyms

Definition

Given time tk, time-dependent synonyms at tk are weighted by TDP(sj, tk) = tf(sj, tk) tf(sj, tk) is a term frequency of sj at tk

Intuition

Only term frequencies will be used to measure the importance of synonyms Time partitions are not considered because only synonyms in a particular time period tk are interesting

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-42
SLIDE 42

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-43
SLIDE 43

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Overview of experiments

Our experimental evaluation is divided into three main parts:

1

Extracting and improving the accuracy of time of synonyms

2

Query expansion using time-independent synonyms

3

Query expansion using time-dependent synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-44
SLIDE 44

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Extracting and improving time of synonyms

Data collection: The whole history of English Wikipedia All pages and revisions 03/2001 to 03/2008 – 85 snapshots (01/03/2001, 01/02/2001, . . ., 01/03/2008) about 2.8 Terabytes 4 additional snapshots (24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009) New York Time Annotated Corpus contains over 1.8 million articles from January 1987 to June 2007 Tools: MWDumper http://www.mediawiki.org/wiki/Mwdumper Oracle Berkeley DB version 4.7.25 Burst detection algorithm implemented by Kleinberg Number of states: 2 Ratio of rate of second state to base state: 2 Ratio of rate of each subsequent state to previous state: 2 Gamma parameter of the HMM: 1 Measurement: Accuracy

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-45
SLIDE 45

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Query expansion using time-independent synonyms

Data collection: TREC Robust Track (2004) 250 topics (topics 301-450 and topics 601-700) Tools: Terrier – an open source search engine developed by University of Glasgow BM25 probabilistic model with Generic Divergence From Randomness (DFR) weighting Expand the top-k synonyms {s1, . . . , sk} plus TIDP scores as boosting weight qexp = qorg s1

∧w1 s2 ∧w2

. . . sk

∧wk

Measurement: Mean Average Precision (MAP), R-precision and Recall

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-46
SLIDE 46

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Query expansion using time-dependent synonyms

Data collection: NewsLibrary.com contains more than 182 million newspaper articles from thousands of credible U.S. publications Select 20 strongly time-dependent queries Measurement: Precision at 10, 20 and 30 retrieved documents Examples of temporal queries

Temporal Query Synonym Named Entity Time Period American Broadcasting Company 1995-2000 Disney/ABC Barack Obama 2005-2007 Senator Obama Eminem 1999-2004 Slim Shady George H. W. Bush 1988-1992 President George H.W. Bush George W. Bush 2000-2007 President George W. Bush Hillary Rodham Clinton 2001-2007 Senator Clinton Kmart 1987-1987 Kresge Pope Benedict XVI 1988-2005 Cardinal Ratzinger Ronald Reagan 1987-1989 Reagan Revolution Virgin Media 1999-2002 Telewest Communications Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-47
SLIDE 47

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-48
SLIDE 48

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Extracting and improving time of synonyms

Statistics and accuracy of entity-synonym relationships extracted from Wikipedia NER Method #NE #NE-Syn.

  • Avg. Syn.

Accuracy per NE (%) BPF-NERW 2,574,319 3,199,115 1.2 51 BPCF-NERW 473,829 488,383 1.0 73

BPF-NERW: Bunescu and Pa¸ sca’s Named Entity Recognition of Wikipedia titles with Filtering criteria: 1) time interval < 6 months, and 2) average frequency < 2 BPCF-NERW: BPF-NERW with only the Categories of “people”, “organization” or “company” Note: Randomly selected 500 entity-synonym relationships for assessing the accuracy of time periods Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-49
SLIDE 49

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Robust2004 query statistics

Two methods for recognizing named entities in queries:

1

Exactly matched Wikipedia page (MW-NERQ)

2

Exactly matched Wikipedia page and top-k related Wikipedia pages (MRW-NERQ) k = 2: if k > 2, bring noise to the NERQ process Number of queries using two different NER Type MW-NERQ MRW-NERQ Named entity 42 149 Not named entity 208 101 Total 250 250

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-50
SLIDE 50

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Query expansion using time-independent synonyms

MAP , R-precision, and Recall (* indicates statistically significant at p < 0.05)

Method MW-NERQ MRW-NERQ MAP R-precision Recall MAP R-precision Recall PM .2889 .3309 .6185 .2455 .2904 .5629 PRF .3469 .3711 .6944 .3002 .3227 .6761 SQE-PRF .3608* .3652 .7405* .2507 .2665 .5932 SWQE-PRF .3653* .3861* .7388 .2885 .3080 .6504 PM: Probabilistic Model without query expansion PRF: Pseudo Relevance Feedback using Rocchio algorithm SQE-PRF: Top-k Synonyms Query Expansion with Pseudo Relevant Feedback SWQE-PRF: Top-k Synonyms TIDP-Weighted Query Expansion, with Pseudo Relevant Feedback Note: 40 expansion terms, top-10 retrieved documents, DFR term weighting model, i.e., Bose-Einstein 1 Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-51
SLIDE 51

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

Query expansion using time-dependent synonyms

P@10, P@20 and P@30 (* indicates statistically significant at p < 0.05) Method P@10 P@20 P@30 TQ .1000 .0500 .0333 TSQ .5200* .3800* .2800* TQ: search a Temporal Query, i.e., a keyword wq and time tq TSQ: search a Temporal Query and expand with Synonyms wrt. time tq

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-52
SLIDE 52

Introduction Synonym Detection Query Expansion Evaluation Conclusions Experiment Setting Experimental Results

QUEST: Query Expansion using Synonyms over Time

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-53
SLIDE 53

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Outline

1

Introduction Problem Statement Contributions

2

Synonym Detection Entity Recognition and Synonym Extraction Improving the Accuracy of Time

3

Query Expansion Time-based Synonyms Ranking Time-independent Synonyms Ranking Time-dependent Synonyms

4

Evaluation Experiment Setting Experimental Results

5

Conclusions Conclusions and Future Work

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-54
SLIDE 54

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-55
SLIDE 55

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-56
SLIDE 56

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-57
SLIDE 57

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-58
SLIDE 58

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-59
SLIDE 59

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-60
SLIDE 60

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

Conclusions and future work

Extract time-based synonym from Wikipedia Improve time of synonyms using NYT Perform query expansion using the time-based synonyms Conduct extensive experiments showing significant increase in retrieval effectiveness Future work:

Combine time-dependent synonyms and temporal language models to determine time of queries Exploit temporal information extraction techniques to discover synonyms at particular time points Improve temporal text mining/clustering using time-based synonyms

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search

slide-61
SLIDE 61

Introduction Synonym Detection Query Expansion Evaluation Conclusions Conclusions and Future Work

QUEST: Query Expansion using Synonyms over Time http://research.idi.ntnu.no/wislab/quest/ Thank you!

Kanhabua and Nørvåg Exploiting Time-based Synonyms in Search