[PDF] - Web Dynamics L3S Research Center Hannover, Germany Analysis and PDF Document

SLIDE 1

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 1

Web Dynamics

Analysis and Implications from Search Perspective

1

Lecturers

Dr. Ismail Sengör Altingövde

– Senior researcher – L3S Research Center – Hannover, Germany

Dr. Nattiya Kanhabua

– Postdoctoral researcher – L3S Research Center – Hannover, Germany

2

L3S Research Center

Computer science and interdisciplinary research
n all aspects related to the Web

3

People

80 researchers: 53% post-docs & PhD students,

38% scientific research assistants & int’l interns

15 countries from US and Latin-America, Asia,

Europe and Africa

Equal opportunity: 27% females

4

SLIDE 2

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 2

All project inventory! www.l3s.de

5

L3S: Some Projects

Cubrik: Human-enhanced time-aware multimedia search ARCOMEM: From collect-all archives to community memories Terence: Adaptive learning for poor comprehenders M-Eco: The medical ecosystem detecting disease outbreaks

6

Wanted!

7

Outline

Day 1: Introduction

Evolution of the Web
Overview of research topics
Content and query analysis

Day 3: Indexing the Past

Indexing and searching

versioned documents

Day 2: Evolution of Web Search Results

Short-term impacts
Longitudinal analysis

Day 4: Retrieval and Ranking

Searching the past
Searching the future

8

SLIDE 3

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 3

Evolution of the Web

Web is changing over time in many aspects:

– Size: web pages are added/deleted all the time – Content: web pages are edited/modified – Query: users’ information needs changes, entity-relationship changes over time – Usage: users’ behaviors change over time

[Ke et al., CN 2006; Risvik et al., CN 2002] [Dumais, SIAM-SDM 2012; WebDyn 2010]

9

Size dynamics

Challenge

– Crawling, indexing, and caching

10

2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters! 1995 2012

Web and index sizes

11

2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters! 2004 Index grows to 4.2 billion pages 1995 2012

Web and index sizes

12

SLIDE 4

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 4

2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters! 2004 Index grows to 4.2 billion pages 1995 2012 2008 Google counts 1 trillion unique URLs

Web and index sizes

13

Web and index sizes

14

Web and index sizes

http://www.worldwidewebsize.com/

15

Content dynamics

Challenge

– Document representation and retrieval

WayBack Machine: a web archive search tool by the Internet Archive

16

SLIDE 5

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 5

Content dynamics

Challenge

– Document representation and retrieval

17

Query dynamics

Challenge

– Time-sensitive queries – Query understanding and processing

Google Insights for Search: http://www.google.com/insights/search/ Query: Halloween

18

Usage dynamics

Challenge

– Browsing and search behavior – User preference (e.g., “likes”)

19

Searching documents created/edited over time

– E.g., web archives, news archives, blogs, or emails

Example application

20

SLIDE 6

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 6

Searching documents created/edited over time

– E.g., web archives, news archives, blogs, or emails

Example application

Web archives news archives blogs emails

“temporal document collections” 21

Searching documents created/edited over time

– E.g., web archives, news archives, blogs, or emails

Example application

Web archives news archives blogs emails

“temporal document collections”

Retrieve documents about Pope Benedict XVI written before 2005

Term-based IR approaches may give unsatisfied results

22

A web archive search tool by the Internet Archive

– Query by a URL, e.g., http://www.ntnu.no

Wayback Machine1

1Retrieved on 15 January 2011

23

A web archive search tool by the Internet Archive

– Query by a URL, e.g., http://www.ntnu.no

Wayback Machine1

No keyword query

1Retrieved on 15 January 2011

24

SLIDE 7

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 7

A web archive search tool by the Internet Archive

– Query by a URL, e.g., http://www.ntnu.no

Wayback Machine1

No keyword query No relevance ranking

1Retrieved on 15 January 2011

25

A news archive search tool by Google

– Query by keywords – Rank results by relevance or date

Google News Archive Search

26

A news archive search tool by Google

– Query by keywords – Rank results by relevance or date

Google News Archive Search

Not consider terminology changes over time

27

Research topics

Content Analysis

– Determining timestamps of documents – Temporal information extraction

Query Analysis

– Determining time of queries – Named entity evolution – Query performance prediction

Evolution of Search Results

– Short-term impacts on result caches – Longitudinal analysis of search results

28

SLIDE 8

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 8

Research topics (cont’)

Indexing

– Indexing and query processing techniques for the versioned document collections

Retrieval and Ranking

– Searching the past – Searching the future

29

Content Analysis

(1) Determining timestamps of documents (2) Temporal information extraction

Motivation

Incorporating the time dimension into

search can increase retrieval effectiveness

– Only if temporal information is available

Research question

– How to determine the temporal information of documents?

[Kanhabua, SIGIR Forum 2012]

31

Two time aspects

1. Publication or modified time
Problem: determining timestamps of documents
Method: rule-based technique, or temporal language

models

2. Content or event time
Problem: temporal information extraction
Method: natural language processing, or time and

event recognition algorithms

32

SLIDE 9

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 9

Two time aspects

1. Publication or modified time
Problem: determining timestamps of documents
Method: rule-based technique, or temporal language

models

2. Content or event time
Problem: temporal information extraction
Method: natural language processing, or time and

event recognition algorithms

content time publication time

33

Problem Statements

Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date

Determining time of documents

34

Problem Statements

Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date

Determining time of documents

I found a bible-like

document. But I have

no idea when it was created?

“ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ”

35

Problem Statements

Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date

Determining time of documents

Let’s me see… This document is probably written in 850 A.C. with 95% confidence. I found a bible-like

document. But I have

no idea when it was created?

“ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ”

36

SLIDE 10

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 10

Current approaches

1. Content-based
2. Link-based
3. Hybrid

37

Content-based approach

Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake

Temporal Language Models

Based on the statistic usage
f words over time
Compare each word of a

non-timestamped document with a reference corpus

Tentative timestamp -- a

time partition mostly

verlaps in word usage

[de Jong et al., AHC 2005]

Freq 1 1 1 1 1 1

38

Content-based approach

Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake

Temporal Language Models tsunami Thailand A non-timestamped document

Temporal Language Models

Based on the statistic usage
f words over time
Compare each word of a

non-timestamped document with a reference corpus

Tentative timestamp -- a

time partition mostly

verlaps in word usage

[de Jong et al., AHC 2005]

Freq 1 1 1 1 1 1

39

Content-based approach

Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake

Temporal Language Models tsunami Thailand A non-timestamped document

Temporal Language Models

Based on the statistic usage
f words over time
Compare each word of a

non-timestamped document with a reference corpus

Tentative timestamp -- a

time partition mostly

verlaps in word usage

[de Jong et al., AHC 2005]

Freq 1 1 1 1 1 1

40

SLIDE 11

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 11

Content-based approach

Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake

Temporal Language Models tsunami Thailand A non-timestamped document

Temporal Language Models

Based on the statistic usage
f words over time
Compare each word of a

non-timestamped document with a reference corpus

Tentative timestamp -- a

time partition mostly

verlaps in word usage

[de Jong et al., AHC 2005]

Freq 1 1 1 1 1 1

41

Content-based approach

Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake

Temporal Language Models tsunami Thailand A non-timestamped document

Similarity Scores

Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Temporal Language Models

Based on the statistic usage
f words over time
Compare each word of a

non-timestamped document with a reference corpus

Tentative timestamp -- a

time partition mostly

verlaps in word usage

[de Jong et al., AHC 2005]

Freq 1 1 1 1 1 1

42

Normalized log-likelihood ratio

Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake

Temporal Language Models tsunami Thailand A non-timestamped document

Similarity Scores

Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Normalized log-likelihood ratio

Variant of Kullback-Leibler

divergence

Similarity of a document and

time partitions

C is the background model

estimated on the corpus

Linear interpolation

smoothing to avoid the zero probability of unseen words

[Kraaij, SIGIR Forum 2005]

43

Improving temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing

[Kanhabua et al., ECDL 2008]

44

SLIDE 12

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 12

Improving temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

[Kanhabua et al., ECDL 2008]

45

Improving temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter

[Kanhabua et al., ECDL 2008]

46

Semantic-based preprocessing

Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing

Semantic-based Preprocessing Description

Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States” Word sense disambiguation Identify the correct sense of a word from context, e.g. “bank” Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave” have the common concept of “disaster” Word filtering Select the top-ranked words according to TF-IDF scores for a comparison

[Kanhabua et al., ECDL 2008]

47

Leveraging search statistics

Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

[Kanhabua et al., ECDL 2008]

48

SLIDE 13

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 13

Leveraging search statistics

(b) (a)

Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

[Kanhabua et al., ECDL 2008]

49

Leveraging search statistics

Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

P(wi) is the probability that wi occurs: P(wi) = 1.0 if a gaining query P(wi) = 0.5 if a declining query f(R) converts a ranked number into weight. The higher ranked query is more important. An inverse partition frequency, ipf = log N/n

[Kanhabua et al., ECDL 2008]

50

Temporal entropy

Temporal Entropy

A measure of temporal information which a word conveys. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. Tells how good a term is in separating a partition from others. A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions. The higher temporal entropy a term has, the better representative of a partition.

Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter

[Kanhabua et al., ECDL 2008]

51

Temporal entropy

Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter

[Kanhabua et al., ECDL 2008]

52

SLIDE 14

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 14

Temporal entropy

Np is the total number of partitions in a corpus

Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter

[Kanhabua et al., ECDL 2008]

53

Temporal entropy

A probability of a partition p containing a term wi Np is the total number of partitions in a corpus

Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter

[Kanhabua et al., ECDL 2008]

54

Link-based approach

Dating a document using its neighbors

1. Web pages linking to the document

Incoming links

2. Web pages pointed by the document

Outgoing links

3. Media assets associated with the document

E.g., images
Averaging the last-modified dates of its

neighbors as timestamps

[Nunes et al., WIDM 2007]

55

Hybrid approach

Inferring timestamps using machine

learning

– Exploit links, contents of a web pages and its neighbors – Features: linguistic, position, page formats, and tags

[Chen et al., SIGIR 2010]

56

SLIDE 15

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 15

Temporal information extraction

Three types of temporal expressions
1. Explicit: time mentions being mapped directly to

a time point or interval, e.g., “July 4, 2012”

2. Implicit: imprecise time point or interval, e.g.,

“Independence Day 2012”

3. Relative: resolved to a time point or interval

using other types or the publication date, e.g., “next month”

[Alonso et al., SIGIR Forum 2007]

57

Current approaches

Extract temporal expressions from

unstructured text

– Time and event recognition algorithms

Harvest temporal knowledge from semi-

structured contents like Wikipedia infoboxes

– Rule-base methods

[Verhagen et al., ACL 2005; Strötgen et al., SemEval 2010] [Hoffart et al., AIJ 2012]

58

References

[Alonso et al., SIGIR Forum 2007] Omar Alonso, Michael Gertz, Ricardo A. Baeza-Yates: On the

value of temporal information in information retrieval. SIGIR Forum 41(2): 35-41 (2007)

[Chen et al., SIGIR 2010] Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang: Web

page publication time detection and its application for page rank. SIGIR 2010: 859-860

[Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-

SDM 2012

[de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language

models for the disclosure of historical text. AHC 2005: 161-168

[Kanhabua et al., ECDL 2008] Nattiya Kanhabua, Kjetil Nørvåg: Improving Temporal Language

Models for Determining Time of Non-timestamped Documents. ECDL 2008: 358-370

[Kanhabua, SIGIR Forum 2012] Nattiya Kanhabua: Time-aware approaches to information retrieval.

SIGIR Forum 46(1): 85 (2012)

[Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their

ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447 (2006)

[Kraaij, SIGIR Forum 2005] Wessel Kraaij: Variations on language modeling for information retrieval.

SIGIR Forum 39(1): 61 (2005)

[Nunes et al., WIDM 2007] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Using neighbors to date web
documents. WIDM 2007: 129-136
[Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.

Computer Networks 39(3): 289-302 (2002) 59

References (cont’)

[Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.

Computer Networks 39(3): 289-302 (2002)

[Strötgen et al., SemEval 2010] Jannik Strötgen, Michael Gertz: Heideltime: High quality rule-based

extraction and normalization of temporal expressions. SemEval 2010: 321-324

[WebDyn 2010] Web Dynamics course: http://www.mpi-inf.mpg.de/departments/d5/teaching/ss10/dyn/,

Max-Planck Institute for Informatics, Saarbrücken, Germany, 2010

[Verhagen et al., ACL 2005] Marc Verhagen, Inderjeet Mani, Roser Sauri, Jessica Littman, Robert

Knippen, Seok Bae Jang, Anna Rumshisky, John Phillips, James Pustejovsky: Automating Temporal Annotation with TARSQI. ACL 2005 60

SLIDE 16

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 16

Question?

61

Query Analysis

(1) Determining time of queries (2) Named entity evolution (3) Query performance prediction

Temporal queries

Temporal information needs

– Searching temporal document collections

E.g., digital libraries, web/news archives

– Historians, librarians, journalists or students

Temporal queries exist in both standard

collections and the Web

– Relevancy is dependent on time – Documents are about events at particular time

[Berberich et al., ECIR 2010]

63

Distribution over time of Qrel

Recency query Time-sensitive query Time-insensitive query [Li et al., CIKM 2003]

64

SLIDE 17

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 17

Types of temporal queries

Two types of temporal queries
1. Explicit: time is provided, "Presidential election 2012“
2. Implicit: time is not provided, "Germany World Cup"
Temporal intent can be implicitly inferred
I.e., refer to the World Cup event in 2006
Studies of web search query logs show a

significant fraction of temporal queries

– 1.5% of web queries are explicit – ~7% of web queries are implicit – 13.8% of queries contain explicit time and 17.1% of queries have temporal intent implicitly provided

[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]

65

Semantic gaps: lacking knowledge about

1. possibly relevant time of queries 2. terminology changes over time

Challenges with temporal queries

66

Challenges with temporal queries

Semantic gaps: lacking knowledge about

1. possibly relevant time of queries 2. terminology changes over time query time1 time2 … timek

suggest

67

Challenges with temporal queries

Semantic gaps: lacking knowledge about

1. possibly relevant time of queries 2. terminology changes over time query time1 time2 … timek

suggest

68

SLIDE 18

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 18 Challenges with temporal queries

Semantic gaps: lacking knowledge about

1. possibly relevant time of queries 2. terminology changes over time

Semantic gaps: lacking knowledge about

1. Possibly relevant time of queries 2. Named entity changes over time

69

Determining time of queries

Problem statements

– Implicit temporal queries: users have no knowledge about the relevant time of a query – Difficult to achieve high accuracy using only keywords – Relevant results associated to particular time not given

Research question

– How to determine the time of an implicit temporal query and use the determined time for improving search results?

70

Current approaches

1. Query log analysis
2. Search result analysis

[Kanhabua et al., ECDL 2010]

71

Query log analysis

Mining query logs

– Analyze query frequencies over time for identifying the relevant time of queries – Re-rank search results of implicit temporal queries using the determined time

[Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]

72

SLIDE 19

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 19

Search result analysis

Using temporal language models

– Determine time of queries when no time is given explicitly – Re-rank search results using the determined time

Exploiting time from search snippets

– Extract temporal expressions (i.e., years) from the contents of top-k retrieved web snippets for a given query – Content-based language-independent approach

[Kanhabua et al., ECDL 2010; Campos et al., TempWeb 2012]

73

Approach I. Dating using keywords* Approach II. Dating using top-k documents*

– Queries are short keywords – Inspired by pseudo-relevance feedback

Approach III. Using timestamp of top-k documents

– No temporal language models are used

*Using Temporal Language Models proposed by de Jong et al.

Determining time of queries

[Kanhabua et al., ECDL 2010]

74

I. Dating using keywords

[Kanhabua et al., ECDL 2010]

75

I. Dating using keywords

Query’s temporal profiles

[Kanhabua et al., ECDL 2010]

76

SLIDE 20

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 20

II. Dating using top-k documents

[Kanhabua et al., ECDL 2010]

77

II. Dating using top-k documents

Query’s temporal profiles

[Kanhabua et al., ECDL 2010]

78

III. Using timestamp of documents

[Kanhabua et al., ECDL 2010]

79

III. Using timestamp of documents

Query’s temporal profiles

[Kanhabua et al., ECDL 2010]

80

SLIDE 21

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 21

Intuition: documents published closely to the

time of queries are more relevant

– Assign document priors based on publication dates

Re-ranking search results

query News archive Determine time 2005, 2004, 2006, ... D2009

Initial retrieved results

[Kanhabua et al., ECDL 2010]

81

Intuition: documents published closely to the

time of queries are more relevant

– Assign document priors based on publication dates

Re-ranking search results

query News archive Determine time 2005, 2004, 2006, ... D2009

Initial retrieved results

D2005

Re-ranked results

[Kanhabua et al., ECDL 2010]

82

Challenges of temporal search

Semantic gaps: lacking knowledge about

1. Possibly relevant time of queries 2. Named entity changes over time query

synonym@2001 synonym@2002 … synonym@2011

suggest

83

Problem Statements

Queries of named entities (people, company, place)

– Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift

Named entity evolution

84

SLIDE 22

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 22

Problem Statements

Queries of named entities (people, company, place)

– Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift

Named entity evolution

Scenario 1

Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant

85

Problem Statements

Queries of named entities (people, company, place)

– Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift

Named entity evolution

Scenario 1

Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant

Scenario 2

Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant

86

Research question

How to detect named entity changes in web

documents?

Named entity evolution

87

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

88

SLIDE 23

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 23

Current approaches

1. Temporal co-occurrence
2. Temporal association rule mining
3. Temporal knowledge extraction

– Ontology – Wikipedia history

89

Temporal co-occurrence

Temporal co-occurrence

– Measure the degree of relatedness of two entities at different times by comparing term contexts – Require a recurrent computation at querying time, which reduce efficiency and scalability

[Berberich et al., WebDB 2009]

90

Association rule mining

Temporal association rule mining

– Discover semantically identical concepts (or named entities) that are used in different time – Two entities are semantically related if their associated events occur multiple times in a collection – Events are represented as sentences containing a subject, a verb, objects, and nouns

[Kaluarachchi et al., CIKM 2010]

91

Temporal knowledge extraction

YAGO ontology

– Extract named entities from the YAGO ontology – Track named entity evolution using the New York Times Annotated Corpus

Wikipedia history

– Define a time-based synonym as a term semantically related to a named entity at a particular time period – Extract synonyms of named entities from anchor texts in article links using the whole history of Wikipedia

[Mazeika et al., CIKM 2011; Kanhabua et al., JCDL 2010]

92

SLIDE 24

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 24

Searching with name changes

Extract time-based synonyms from Wikipedia

– Synonyms are words with similar meanings – In this context, synonyms refer name variants (name changes, titles, or roles) of a named entity

E.g., "Cardinal Joseph Ratzinger" is a synonym of

"Pope Benedict XVI" before 2005

Two types of time-based synonyms

1. Time-independent 2. Time-dependent

[Kanhabua et al., JCDL 2010]

93

Recognize named entities

[Kanhabua et al., JCDL 2010]

94

Recognize named entities

[Kanhabua et al., JCDL 2010]

95

Recognize named entities

[Kanhabua et al., JCDL 2010]

96

SLIDE 25

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 25

Find synonyms

Find a set of entity-synonym relationships at time tk

[Kanhabua et al., JCDL 2010]

97

Find synonyms

Find a set of entity-synonym relationships at time tk
For each ei ϵ Etk , extract anchor texts from article

links:

– Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004

[Kanhabua et al., JCDL 2010]

98

Find synonyms

Find a set of entity-synonym relationships at time tk
For each ei ϵ Etk , extract anchor texts from article

links:

– Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004

President_of_th e_United_States [Kanhabua et al., JCDL 2010]

99

Find synonyms

Find a set of entity-synonym relationships at time tk
For each ei ϵ Etk , extract anchor texts from article

links:

– Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004

President_of_th e_United_States George

W. Bush

George

W. Bush

Presiden t George

W. Bush

Presiden t Bush (43) [Kanhabua et al., JCDL 2010]

100

SLIDE 26

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 26

Initial results

Time periods are not accurate

Note: the time of synonyms are timestamps of Wikipedia articles (8 years)

[Kanhabua et al., JCDL 2010]

101

Analyze NYT Corpus to discover accurate time

– 20-year time span (1987-2007)

Use the burst detection algorithm

– Time periods of synonyms = burst intervals

Enhancement using NYT

[Kanhabua et al., JCDL 2010]

102

Analyze NYT Corpus to discover accurate time

– 20-year time span (1987-2007)

Use the burst detection algorithm

– Time periods of synonyms = burst intervals

Enhancement using NYT

[Kanhabua et al., JCDL 2010]

103

Analyze NYT Corpus to discover accurate time

– 20-year time span (1987-2007)

Use the burst detection algorithm

– Time periods of synonyms = burst intervals

Enhancement using NYT

Initial results

[Kanhabua et al., JCDL 2010]

104

SLIDE 27

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 27

Query expansion

1. A user enters an entity as a query

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

[Kanhabua et al., ECML PKDD 2010]

105

Query expansion

1. A user enters an entity as a query

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

[Kanhabua et al., ECML PKDD 2010]

106

Query expansion

1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

[Kanhabua et al., ECML PKDD 2010]

107

Query expansion

1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

[Kanhabua et al., ECML PKDD 2010]

108

SLIDE 28

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 28

Query expansion

1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

[Kanhabua et al., ECML PKDD 2010]

109

Query expansion

1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

[Kanhabua et al., ECML PKDD 2010]

110

1. Performance prediction

– Predict the retrieval effectiveness wrt. a ranking model

Query prediction problems

111

1. Performance prediction

– Predict the retrieval effectiveness wrt. a ranking model

Query prediction problems

query precision = ? recall = ? MAP = ?

predict

112

SLIDE 29

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 29

1. Performance prediction

– Predict the retrieval effectiveness wrt. a ranking model

2. Retrieval model prediction

– Predict the retrieval model that is most suitable

Query prediction problems

113

1. Performance prediction

– Predict the retrieval effectiveness wrt. a ranking model

2. Retrieval model prediction

– Predict the retrieval model that is most suitable

Query prediction problems

query ranking = ?

predict

max(precision) max(recall) max(MAP)

114

Problem Statement

Predict the effectiveness (e.g., MAP) that a query will

achieve in advance of, or during retrieval – high MAP  “good” – low MAP  “poor”

Objective

Apply query enhancement techniques to improve the
verall performance

– Query suggestion is applied for “poor” queries

Query performance prediction

[Hauff et al., CIKM 2008 ;Hauff et al., ECIR 2010; Carmel et al., 2010]

115

First study of performance prediction for temporal

queries

– Propose 10 time-based pre-retrieval predictors

Both text and time are considered
Experiment

– Collection: NYT Corpus and 40 temporal queries

Results

– Time-based predictors outperform keyword-based predictors – Combined predictors outperform single predictors in most cases

Open issue

– Consider time uncertainty

Temporal query performance prediction

[Kanhabua et al., SIGIR 2011]

116

SLIDE 30

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 30

Problem statement

– Two time aspects: publication time and content time

Content time = temporal expressions mentioned in documents

– Difference in effectiveness for temporal queries when ranking using publication time or content time

Time-aware ranking prediction

[Kanhabua et al., SIGIR 2012]

117

First study of the impact on retrieval effectiveness
f ranking models using two time aspects
Three features from analyzing top-k results

– Temporal KL-divergence [Diaz et al., SIGIR 2004] – Content Clarity [Cronen-Townsend et al., SIGIR 2002] – Divergence of retrieval scores [Peng et al., ECIR 2010]

Learning to select time-aware ranking

[Kanhabua et al., SIGIR 2012]

118

Measure the difference between the distribution
ver time of top-k retrieved documents of q and the

collection

– Consider both time dimensions

Temporal KL-divergence

[Diaz et al., SIGIR 2004]

119

The content clarity is measured by the Kullback-

Leibler (KL) divergence between the distribution of terms of retrieved documents and the background collection

Content Clarity

[Cronen-Townsend et al., SIGIR 2002]

120

SLIDE 31

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 31

Measure the divergence of scores from the base

ranking, e.g., a non time-aware ranking model

– To determine the extent that a ranking model alters the scores of the initial ranking

Features

1. averaged scores of the base ranking 2. averaged scores of PT-Rank 3. averaged scores of CT-Rank 4. divergence from the base ranking model

Divergence of ranking scores

[Peng et al., ECIR 2010]

121

Discussion

Results

– A small number of top-k documents achieves better performance – The larger number k, the more irrelevant documents are introduced into the analysis

Open issue

– When comparing with the optimal case there is still room for further improvements

[Kanhabua et al., SIGIR 2012]

122

References

[Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard

Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009

[Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard

Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25

[Campos et al., TempWeb 2012] Ricardo Campos, Gaël Dias, Alípio Jorge, Celia Nunes:

Enriching temporal query understanding through date identification: How to tag implicit temporal queries? TWAW 2012: 41-48

[Carmel et al., 2010] David Carmel, Elad Yom-Tov: Estimating the Query Difficulty for Information

Retrieval Morgan & Claypool Publishers 2010

[Cronen-Townsend et al., SIGIR 2002] Stephen Cronen-Townsend, Yun Zhou, W. Bruce Croft:

Predicting query performance. SIGIR 2002: 299-306

[Diaz et al., SIGIR 2004] Fernando Diaz, Rosie Jones: Using temporal profiles of queries for

precision prediction. SIGIR 2004: 18-24

[Hauff et al., CIKM 2008] Claudia Hauff, Vanessa Murdock, Ricardo A. Baeza-Yates: Improved

query difficulty prediction for the web. CIKM 2008: 439-448

[Hauff et al., ECIR 2010] Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, Franciska de Jong:

Query Performance Prediction: Evaluation Contrasted with Effectiveness. ECIR 2010: 204-216

[Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.

Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query translation in text retrieval with association rules. CIKM 2010: 1789-1792 123

References (cont’)

[Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms

in searching document archives. JCDL 2010: 79-88

[Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for

Re-ranking Search Results. ECDL 2010: 261-272

[Kanhabua et al., SIGIR 2011] Nattiya Kanhabua, Kjetil Nørvåg: Time-based query performance
predictors. SIGIR 2011: 1181-1182
[Kanhabua et al., SIGIR 2012] Nattiya Kanhabua, Klaus Berberich, Kjetil Nørvåg: Learning to

select a time-aware retrieval model. (To appear) SIGIR 2012

[Li et al., CIKM 2003] Xiaoyan Li, W. Bruce Croft: Time-based language models. CIKM 2003:

469-475

[Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:

Improving search relevance for implicitly temporal queries. SIGIR 2009:

[Mazeika et al., CIKM 2011] Arturas Mazeika, Tomasz Tylenda, Gerhard Weikum: Entity

timelines: visual analytics and named entity evolution. CIKM 2011: 2585-2588

[Nunes et al., ECIR 2008] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Use of Temporal

Expressions in Web Search. ECIR 2008: 580-584

700-701
[Peng et al., ECIR 2010] Jie Peng, Craig Macdonald, Iadh Ounis: Learning to Select a Ranking
Function. ECIR 2010: 114-126
[Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,

Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139 124

SLIDE 32

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 32

Question?

125

Evolution of Web Search Results

(1) Short-term impacts: Caching Results (2) Longitudinal analysis of search results

126 RuSSIR 2012 126

Architecture of a Search Engine

How search results change?
1. Crawl the Web
2. Index the content
3. Go to line 1

Crawler Document Collection Indexer Indexes Query processor Answers WWW

127

Architecture of a Search Engine

How search results change?
1. Crawl the Web
2. Index the content
3. Go to line 1

Crawler Document Collection Indexer Indexes Query processor Answers WWW

n-line

128

SLIDE 33

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 33

Crawling the Dynamic Web

129

Indexing the Dynamic Web

How to refresh the index?

– Batch update (re-build) – Re-merge – In-place

Batch update

– Shadowing – Simplest, the old index can keep serving at high rates

[Lester et al., IPM 2006]

130

Indexing the Dynamic Web

Re-merge

– A “buffer” B of new index entries – Compute queries over index I and B and merge results – Merge B with I when size(B) > threshold – Optimizations: logarithmic and geometric merging

[Lester et al., IPM 2006]

131

Indexing the Dynamic Web

In-place

– Over-allocation: Leave free space at the end

f each list

– Add new entries to the free space; o.w., relocate to a new position on disk

[Lester et al., IPM 2006]

132

SLIDE 34

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 34

Query Processing

Is updating the underlying index enough?

If a query is submitted for the first time, it is processed over an up-to-date index But... what about the cached results?

133

Why Result Caching?

Around 50% of a query stream is

composed of repeating queries

[Fagni et al., TOIS 2006] Query popularity: heavy tail distribution

134

Result Caches

Result cache: top-k urls with snippets per query
Results caching helps to reduce

– query response time – traffic volume hitting the backend

“bin laden dead” BACKEND miss (compulsory) hit

135

Problem: dynamicity!

Key observations:

– Caches can be very large (thanks to cheap hw) – Web changes rapidly (thanks to users)

“bin laden dead” BACKEND Minutes, days, or weeks, based on Q

136

SLIDE 35

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 35

A deeper look: Capacity vs. Hits

Query log :Yahoo! Search

engine

– Few cache nodes during 9 days

Cache capacity

– can be very large – eviction policy is less of a concern

Hit rate: fraction of queries

that are hits

Higher capacity  more hits!

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

137

Capacity vs. age

Hit age: time

since the last update

Higher capacity

 higher age!

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

138

Strategies for cache freshness

Decoupled approaches: cache does not

know what changed

Coupled approaches: cache uses clues

from the index to predict what changed

[Cambazoglu et al., WWW 2010]

139

Decoupled approaches: Flushing

Solution: flush periodically

– Coincides with new index – Bounds average age – Impacts hit rate negatively

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

140

SLIDE 36

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 36 Decoupled approaches:TTL

Time-to-live (TTL)

– Assign a fixed TTL value to each cached result – Pros: Practical, almost no implementation cost – Cons: Blind strategy, may be sub-optimal

…

qi Result Cache q1 R1 TTL(q1) q2 R2 TTL(q2) qk Rk TTL(qk)

141

Decoupled approaches:TTL

TTL per query

– Stale results  Acceptable: search engines do not have a perfect view! – Stable hit rate; average age is still bounded!

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

142

Intelligent Refreshing

Enhancement to the TTL mechanism
Updates entries

– Re-execute queries – Uses idle cycles

Ideally

– update: low activity – use: high activity

How to select queries for refreshing?

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

143

Intelligent refreshing

Temperature

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

A new query goes

to young and cold bucket!

Fixed “query

interval” for shifting age

Lazy update of

temperature

Refresh hot

and old first!

144

SLIDE 37

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 37

Refresh-rate adjustment

Idle cycles?
Critical mechanism

– Prevent overloads

Feedback from the

query processors

– Latency to process query

Track recent latency

– Adjust rate accordingly Query

Query Processors

Cache

Results,Latency

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

145

Design summary

Large capacity: millions of entries
TTL: bounds the age of entries
Refreshes: updates cache entries
Refresh rate adjustment: latency feedback

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

146

Cache evaluation

Simulation

– Yahoo! query log

Baseline policies

– No refresh – Cyclic refresh

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

147

Cache evaluation

Simulation

– Yahoo! query log

Baseline policies

– No refresh – Cyclic refresh

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

148

SLIDE 38

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 38

Performance in production

[Cambazoglu et al., WWW 2010] (Slide provided by the authors) Refreshes on Refreshes off Refreshes on Refreshes off

149

Degradation

Under high load

– Processors degrade results – Shorter life

TTL mechanism

– Prioritize for refreshing – AdjustsTTL when degraded

Refreshes

– Replace with better results

[Cambazoglu et al., WWW 2010] (Slide provided by the authors)

150

Adaptive TTL

Up to now we considered fixed TTL values
Are all queries equal?

– “quantum physics” – “barcelona FC” – “ecir 2012”

Another promising direction is assigning

adaptive TTL values

[Alici et al., ECIR 2012] (Slide provided by the authors)

151

Adaptive TTL

Up to now we considered fixed TTL values
Are all queries equal?

– “quantum physics” – “barcelona FC” – “ecir 2012”

Another promising direction is assigning

adaptive TTL values

[Alici et al., ECIR 2012] (Slide provided by the authors)

152

SLIDE 39

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 39

Adaptive I: Average TTL

Observe past update frequency of for

top-k results of a query and compute average

Simple, but

– Needs history – May not capture bursty update periods

5 1 6 4 NOW: q submitted again! TTL =( 5+1+6+4)/ 4= 4 PAST: Update history [Alici et al., ECIR 2012] (Slide provided by the authors)

153

Adaptive II: Incremental TTL

Adjust the new TTL value based on the

current TTL value

Each time a cached result Rcached with an

expired TTLcurr is requested:

compute Rnew if Rnew ≠ Rcached // STALE TTLnew  TTLmin // catch bursty updates! else TTLnew  TTLcurr + F(TTLcurr) // F(.): linear, poly., exp.

[Alici et al., ECIR 2012] (Slide provided by the authors)

154

Adaptive III: Machine-learned

Build an ML model

– Query and result-specific feature set F – Assume a query result changes at time point ti

Create a training instance with features Fi
Set target feature (TTLnew) as the time period from

ti to the time point tj where query result changes again: tj-ti

5 1 6 4 <F1, 5> <F2, 1> <Fn, ?> t1 t2 [Alici et al., ECIR 2012] (Slide provided by the authors)

155

Features for ML

[Alici et al., ECIR 2012] (Slide provided by the authors)

156

SLIDE 40

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 40

Analysis of Web Search Results

Setup:

– 4,500 queries sampled from AOL log – Submitted to Yahoo! API daily from Nov 2010 to April 2011 (120 days) – For days di and di+1

If Ri ≠ Ri+1 (consider top-10 urls and their order), we consider the result as updated!

[Alici et al., ECIR 2012] (Slide provided by the authors)

157

Distribution of result updates

On average, query results are updated in

every 2 days!

[Alici et al., ECIR 2012] (Slide provided by the authors)

158

Effect of query frequency

More frequent  results updated more frequently
Less frequent  scattered

[Alici et al., ECIR 2012] (Slide provided by the authors)

159

Effect of query length

No correlation between query length and

update frequency

[Alici et al., ECIR 2012] (Slide provided by the authors)

160

SLIDE 41

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 41

Simulation

Assume all 4,500 queries are submitted

daily for 120 days

– On day-0, all results are cached – On the following days, whenever the TTL expires, the new result is computed and replaces old ones (i.e., result for that day from Yahoo! API)

[Alici et al., ECIR 2012] (Slide provided by the authors)

161

Simulation setup

Evaluation metrics [Blanco 2010]

– At the end of each day, we compute:

False Positive Ratio = Redundant query executions Number of unique queries Stale Traffic Ratio = Stale results returned Number of query occurrences [Alici et al., ECIR 2012] (Slide provided by the authors)

162

Simulation setup

While computing ST ratio for a given query

at day i:

Strict policy: – if Rcached is not strictly equal to Rday-i staleCount++ Relaxed policy: – if Rcached is not strictly equal to Rday-i staleness += 1 – JaccardSim(Rcached, Rday-i)

[Alici et al., ECIR 2012] (Slide provided by the authors)

163

Performance: Incremental TTL

Strict policy Relaxed policy

[Alici et al., ECIR 2012] (Slide provided by the authors)

164

SLIDE 42

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 42

Less-often updated queries More-often updated queries

Performance: Incremental TTL

[Alici et al., ECIR 2012] (Slide provided by the authors)

165

Performance: Average and Machine Learned TTL

Strict policy Relaxed policy

[Alici et al., ECIR 2012] (Slide provided by the authors)

166

Coupled approaches: CIP

Cache invalidation policy (CIP)
Key idea: Compare all

added/updated/deleted documents to cached queries

– Incremental index update  updates reflected

n-line!

[Blanco et al., SIGIR 2010] all queries in cache(s) all changes in the backend index CIP module

167

Coupled approaches: CIP

Incremental index update: updates reflected on-line!

Content changes sent to CIP module to invalidate queries (offline) [Blanco et al., SIGIR 2010]

168

SLIDE 43

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 43

Synopses

Synopsis: A vector of document’s top-

scoring TF-IDF terms

η: length of synopsis
δ: modification threshold  consider a document d

as updated only if diff(d, dold) > δ

[Blanco et al., SIGIR 2010]

169

Invalidation Approaches

Invalidation logic (for added docs)

[Blanco et al., SIGIR 2010]

Basic: If the synopis matches the query (i.e., involve all query terms) Scored: Compute Sim(synopsis, query) to the score of k-th result

Example

170

Invalidation Approaches

Invalidation logic (for deleted docs):

– Invalidate all query results including a deleted document

Update is a deletion followed by an

addition

Further apply TTL to guarantee an age-

bound

Scored + TTL

[Blanco et al., SIGIR 2010]

171

Evaluation

Really hard to evaluate if you are not

sitting at the production department in a real search engine company!

172

SLIDE 44

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 44

Simulation Setup

Data: English wikipedia dump

– snapshot at Jan 1, 2006 ≈ 1 million pages – All add/deletes/updates for following 30 days

Queries: 10,000 from AOL log

[Blanco et al., SIGIR 2010; Alici et al. SIGIR 2011]

173

Simulation setup

Evaluation metric

– The query result is updated if two top-10 lists are not exactly the same

False Positive Ratio = Redundant query executions Number of unique queries Stale Traffic Ratio = Stale results returned Number of query occurrences [Blanco et al., SIGIR 2010]

174

Performance

Setup

– Score + TTL – Full synopsis

Smaller δ or TTL:

– Lower ST, higher FP

[Blanco et al., SIGIR 2010] TTL: 1≤t≤5 t=1 t=2 t=3 t=4 t=5 δ=0, 2≤t≤10 δ=0.005, 3≤t≤10 δ=0.01, 5≤t≤20

175

TIF: Timestamp-based Invalidation Framewrok

Devise a new invalidation mechanism

– better than TTL and close to CIP in detecting stale results – better than CIP and close to TTL in efficiency and practicality

[Alici et al., SIGIR 2011] (Slide provided by the authors)

176

SLIDE 45

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 45

Timestamp-based Invalidation

The value of the TS on an item shows the

last time the item was updated

TIF has two components:

– Offline (indexing time) : Decide on term and document timestamps – Online (query time): Decide on the staleness of the query result

[Alici et al., SIGIR 2011] (Slide provided by the authors)

177

TIF Architecture

…

Doc. parser documents assigned to the node SEARCH NODE index updates

…

t1 t2 tT

[Alici et al., SIGIR 2011] (Slide provided by the authors)

178

TIF Architecture

…

Doc. parser documents assigned to the node SEARCH NODE index updates

…

t1 t2 tT

…

Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps

[Alici et al., SIGIR 2011] (Slide provided by the authors)

179

TIF Architecture

…

Doc. parser documents assigned to the node SEARCH NODE index updates

…

Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps

[Alici et al., SIGIR 2011] (Slide provided by the authors)

180

SLIDE 46

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 46

TIF Architecture

…

Doc. parser documents assigned to the node SEARCH NODE index updates term TS updates

…

TS(t1) TS(t2) TS(tT)

…

Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps

[Alici et al., SIGIR 2011] (Slide provided by the authors)

181

TIF Architecture

…

Doc. parser documents assigned to the node SEARCH NODE index updates qi results Result cache 0/1 Invalidation logic miss/stale qi, Ri, TS(qi)

…

q1 R1 TS(q1) q2 R2 TS(q2) qk Rk TS(qk) qi results Result cache 0/1 Invalidation logic miss/stale qi, Ri, TS(qi)

…

q1 R1 TS(q1) q2 R2 TS(q2) qk Rk TS(qk) term TS updates

…

TS(t1) TS(t2) TS(tT)

…

Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps

[Alici et al., SIGIR 2011] (Slide provided by the authors)

182

TS Update Policies: Documents

For a newly added document d

– TS(d) = now()

For a deleted document d

– TS(d) = infinite

For an updated document d

– if diff(dnew, dold) > L

TS(d) = now()

– diff(di, dj): |length(di) – length(dj)|

[Alici et al., SIGIR 2011] (Slide provided by the authors)

183

TS Update Policies: Terms

t

Frequency based update

t TS(t) = T0, PLLTS= 5 Number of added postings > F x PLLTS TS(t) = now() PLLTS= 6 [Alici et al., SIGIR 2011] (Slide provided by the authors)

184

SLIDE 47

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 47

TS Update Policies: Terms

Score based update

t p1 p2 p3 p4 p5 TS(t) = T0, STS = Score(p3) p4 p3 p2 p5 p1 t p6 Score of added posting > STS TS(t) = now() STS = re-sort & compute p1 p2 p3 p4 p5 sort w.r.t. scoring function [Alici et al., SIGIR 2011] (Slide provided by the authors)

185

Result Invalidation Policy

A search node decides a result stale if:

– C1: d ϵ R, s.t. TS(d) > TS(q) (d is deleted or revised after the generation of query result)

r,

– C2: t ϵ q, s.t. TS(t) > TS(q) (all query terms appeared in new documents after the generation of query result)

Also apply TTL to avoid stale accumulation

[Alici et al., SIGIR 2011] (Slide provided by the authors)

186

Simulation Setup

Data: English wikipedia dump

– snapshot at Jan 1, 2006 ≈ 1 million pages – All add/deletes/updates for following 30 days

Queries: 10,000 from AOL log

[Alici et al., SIGIR 2011]

187

Simulation setup

Evaluation metrics [Blanco 2010]

– The query result is updated if two top-10 lists are not exactly the same

False Positive Ratio = Redundant query executions Number of unique queries Stale Traffic Ratio = Stale results returned Number of query occurrences [Alici et al., SIGIR 2011] (Slide provided by the authors)

188

SLIDE 48

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 48

Performance: all queries

Frequency-based term TS update Score-based term TS update [Alici et al., SIGIR 2011] (Slide provided by the authors)

189

Performance: single-term queries

Frequency-based term TS uıpdate Score-based term TS update [Alici et al., SIGIR 2011] (Slide provided by the authors)

190

Invalidation Cost

Send <q, R, TS(q)> to the search nodes Send all <q, R> to CIP Send all docs to CIP

TIF CIP

Data transfer Invalidation

perations

Compare TS values Traverse the query index for every document

[Alici et al., SIGIR 2011] (Slide provided by the authors)

191

Performance: TIF

A simple yet effective invalidation approach
Predicting stale queries

– Better than TTL, close to CIP

Efficiency and practicality

– Straightforward in a distributed system

192

SLIDE 49

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 49

Grand Summary

Fixed TTL

– With refreshing

Adaptive TTL
TIF
CIP

Decoupled Decoupled Coupled Coupled Hit rate Hit age ST Ratio FP Ratio ST Ratio FP Ratio ST Ratio FP Ratio ST and FP ratio decreases Invalidation cost decreases Yahoo! Web sample AOL Wikipedia AOL Wikipedia AOL

193

Open Research Directions

Investigate:

User satisfaction vs. freshness
Complex ranking functions
Alternative index update strategies
Combinations
Adaptive TTL + TIF or CIP
Adaptive TTL + refresh strategy

194

Evolution of Web Search Results

Earlier works consider the changes in the

content of the

–Web –queries

How do real life search engines react this

dynamicity?

Compare results from Yahoo! API

–for 630K real life queries (from AOL log) –obtained in 2007 and 2010

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

195

What is Novel?

Queries are real, not synthetic
Query set is large
Results from a search engine at two very

distant points in time

Focus on the properties of results, but not

the underlying content

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

196

SLIDE 50

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 50

We Seek to Answer:

How is the growth in Web reflected to top-

ranked query results?

Do the query results totally change within

time?

Are results located deeper in sites?
Is there any change in result title and

snippet properties?

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

197

Avg no. of results

Almost tripled (from 16M to 52M), but not

uniformly

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

198

No. of unique URLs
20% of the URLs returned at the highest rank in

2010 were at the same position in 2007!

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

199

No. of unique domains
The increase in unique domain names in 2010 is

more emphasized in comparison to the increase in the number of unique URLs (diversity? coverage?)

Even higher overlap for top-1 domains

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

200

SLIDE 51

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 51

Research Directions

How are the results are diversified at

these two different time points?

–Can we deduce these from snippets?

How does the level of bias changes in

query results?

[Altingovde et al., SIGIR 2011] (Slide provided by the authors)

201

References

[Alici et al., SIGIR 2011] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan,

Berkant Barla Cambazoglu, Özgür Ulusoy: Timestamp-based result cache invalidation for web search engines. SIGIR 2011: 973-982

[Alici et al., ECIR 2012] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan,

Berkant Barla Cambazoglu, Özgür Ulusoy: Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines. ECIR 2012: 401-412

[Altingovde et al., SIGIR 2011] Ismail Sengör Altingövde, Rifat Ozcan, Özgür

Ulusoy: Evolution of web search results within years. SIGIR 2011: 1237-1238

[Blanco et al., SIGIR 2010] Roi Blanco, Edward Bortnikov, Flavio Junqueira, Ronny

Lempel, Luca Telloli, Hugo Zaragoza: Caching search engine results over incremental indices. SIGIR 2010: 82-89

[Cambazoglu et al., WWW 2010] Berkant Barla Cambazoglu, Flavio Paiva

Junqueira, Vassilis Plachouras, Scott A. Banachowski, Baoqiu Cui, Swee Lim, Bill Bridge: A refreshing perspective of search engine caching. WWW 2010: 181-190

[Fagni et al., TOIS 2006] Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri,

Salvatore Orlando: Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1): 51-78 (2006)

[Lester et al., IPM 2006] Nicholas Lester, Justin Zobel, Hugh E. Williams: Efficient
nline index maintenance for contiguous inverted lists. Inf. Process. Manage. 42(4):

916-933 (2006)

202

Thank you!

Questions???

203

Indexing the Past

(1) Indexing and searching methods for versioned document collections

204 RuSSIR 2012 204

SLIDE 52

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 52

Time Travel?

Time travel is easy (in IR)! You only need:

– The data – The mechanisms to search

indexing + query processing

– (And fasten your seat belts!)

205

Why Search the Past?

Historical information needs (an analyst

working on the social reactions on the net after 9/11)

To find relevant resources that do not exist

anymore

To discover trends, opinions, etc. for a

certain time period (what people think about UFOs at the beginning of millenium?)

206

Data: Preserving the Web

Non-profit organizations:

– Internet archive, European archive,...

Several EU projects

– Planets, PROTAGE, PrestoPRIME, Scape, ENSURE, BlogForever, LiWA, LAWA, ARCOMEM...

http: p://a /archive.org/

207

Data

There are also other “versioned” data

collections

– Wikis (Wikipedia) – Repositories (code, document, organization intranets) – RSS feeds, news, blogs etc. with continuous updates – Personal desktops

Collections including multiple versions of each document with a different timestamp

208

SLIDE 53

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 53

Indexing

Various earlier approaches:

– [Anick et al., SIGIR 1992 ], [Herscovici et al, ECIR 2007]

Focus on recent work from two different

perspectives

– Indexing schemes that are more concentrated

n the partitioning-related issues
[Berberich et al., SIGIR 2007; Anand et al. CIKM 2010; Anand et al.

SIGIR 2011]

– Indexing schemes that are more concentrated

n the index size
[He et al., CIKM 2009; He et al., CIKM 2010; He et al., SIGIR 2011]

209

Time-travel Queries

“Queries that combine the content and temporal

predicates” [Berberich et al., SIGIR 2007]

Interesting query types

– Point-in-time

“euro 2012 articles” @ 1/July/2012

– Interval

“euro 2012 articles” between 01.06-01.07 2012

– Durability in a time interval [Hou U et al., SIGMOD 2010]:

“search engine research papers” that are in top-10 results for

75% of the time between years 2000 and 2012

210

Time-point Queries: Indexing

Formal model:

– For a document d, each version has begin/end timestamps (validity interval):

For the current version, de = ∞
For d, all validity times are disjoint

t0 t1 t2 t3 ∞ d0: [t0, t1) d3: [t3, ∞) [Berberich et al., SIGIR 2007]

211

Indexing

Key idea

– Keep “validity intervals” within posting lists

v <d, tf> <d, tf, db, de> t0 t1 t2 t3 ∞ tf(v)=1 tf(v)=0 tf(v)=2 tf(v)=2 v <d, 1, t0, t1>

From

<d, 2, t2, t3> <d, 2, t3, ∞>

Problem: Index size explodes!

– Even for Wikipedia, 9*109 entries

[Berberich et al., SIGIR 2007]

212

SLIDE 54

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 54

Remedy: Coalescing

Combine adjacent postings with similar

pay-loads

v <d, 1, t0, t1> <d, 2, t2, t3> <d, 2, t3, ∞> v <d, 1, t0, t1> <d, 2, t2, ∞> [Berberich et al., SIGIR 2007]

213

Optimization Problem

[Berberich et al., SIGIR 2007]

214

ATC

Linear-time approximate algorithm

[Berberich et al., SIGIR 2007]

215

It works!

[Berberich et al., SIGIR 2007]

216

SLIDE 55

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 55

Temporal Partitioning (Slicing)

Even after colescing, wasted I/O for

redundant postings that does not overlap with query’s time point

v, [t0, t1) <d, 1, t0, t1> <d, 2, t2, ∞>

Remark: Validity intervals still reside in the postings!

v, [t2, ∞) [Berberich et al., SIGIR 2007]

217

Trade-off

Optimal approaches

– Space-optimal – Performance-optimal

Trading off space and performance:

– Performance-Guarantee apparoach – Space-Bound apparoach

[Berberich et al., SIGIR 2007]

218

Computational Complexity

Performance-Guarantee Approach

– DP: time and space complexity O(n2)

Space-Bound Approach

– DP solution is prohibitively expensive – Approximate solution using simulated annealing

Time O(n2), Space O(n)
Notice n is the number of all unique time interval

boundaries for a given term’s posting list

[Berberich et al., SIGIR 2007]

219

Time-point Queries: Result

Given a space-bound of 3 (i.e., 3x of the
ptimal space), close-to-optimal

performance is achievable!

– (Reminder: Partitioning is not very cheap!)

[Berberich et al., SIGIR 2007]

220

SLIDE 56

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 56

Time-point Queries: Result

Time-point queries can be handled like

this:

Example

221

Time-interval Queries?

When more than one partitions should be

accessed, wasted I/O due to repeated postings! (e.g., 3x more postings in SB!)

Example

[Anand et al., CIKM 2010]

222

Solution Approaches

Partition selection [Anand et al., CIKM 2010]

Can we avoid accessing all partitions related to a given query?

Document partitioning (sharding) [Anand et al., SIGIR 2011]

Can we partition postings in a list in a different way i.e., other than using time information?

223

Partition Selection

Problem: Optimize result quality by

accessing a subset of “affected partitions” without exceeding a given I/O upper- bound

Optimization criterion: maximize the fraction of

riginal query results (relative recall).

Two types of constraints: a) Size-based: allowed to read up to a fixed number of postings (focus on sequential access) b) Equi-cost: allowed to read up to a fixed number of partitions (focus on random access) [Anand et al., CIKM 2010]

224

SLIDE 57

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 57

Assumption

Assume an oracle exists “providing us the

cardinalities of the individual partitions as well as the result of their intersection/union”

Later, KMV synopses will be employed as

an approximation

[Anand et al., CIKM 2010]

225

Single-term queries

Size-based partition selection
Equi-cost partition selection
DP solutions exist, but expensive!
Approximation

– Reduce the problem to budgeted max coverage

N [Anand et al., CIKM 2010]

226

GreedySelect Algorithm

Cost(P):

– Size-based: no. of postings in the partition – Equi-cost: 1

Benefit(P):

– no of unread postings in P

At each step:

– Select the partition with highest B(P) / C(P) – Update benefits of the unselected partitions

Recall: we assume an oracle providing these numbers!

[Anand et al., CIKM 2010]

227

Example

228

SLIDE 58

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 58

Multi-term Queries

Conjunctive semantics intersect

partitions of query terms

Optimization objective: increase the

coverage of postings that are in this intersection space of partitions

Size-based N Equi-cost [Anand et al., CIKM 2010]

229

Simple Math does the job!

Compute “union of intersections”
x: a tuple of intersection from each term
x is a tuple from the Cartesian product of

partitions for each query term

– X = Pterm1 x Pterm2 x..... X Ptermk and then, – x = {(Pterm1,i, Pterm2,j, ..., x Ptermk, l)}

Choose a partition from this term’s partitions [Anand et al., CIKM 2010]

230

GreedySelect, again!

Apply GreedySelect over X to pick x ϵ X
C(x)

– Sum of the sizes of unselected partitions in x,

r, number of unselected partitions in x
B(x)

– No of new documents that appear in the intersection of the partitions in x

Modify algorithm to update benefits and

costs after each picking a tuple x.

[Anand et al., CIKM 2010]

231

Cartesian Product or Join?

Problem: Set X might be very large!
Remedy: Observe that partitions of

different terms that have no temporal

verlap cannot have any intersecting docs!

Apply the algorithm

ver the t-join set!

[Anand et al., CIKM 2010]

232

SLIDE 59

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 59 Performance of Partition Selection

Reletive recall: 50% of affected partitions

might be enough!

– 3 datasets, Wikipedia seems harder!

Wikipedia UKGOV NY Times [Anand et al., CIKM 2010]

233

Performance of Partition Selection

Compare query run times for:

– Unpartitioned index, – No partition selection, – Partition selection with I/O bound (all index files are compressed!)

[Anand et al., CIKM 2010]

234

Real Life: KMV Synop ses

Instead of assuming an “oracle” for

cardinality estimation, use KMV synopses

A KMV synopsis for a multiset S:

– Fix a hash function h – Apply h to each distinct value in S – k-smallest of the hashed values form KMV

[Beyer et al., SIGMOD 2007] Effective sketches for sets supporting arbitrary multiset operations: ⋃, ∩, / [Anand et al., CIKM 2010]

235

Partition Selection with KMV

For each partition of each term:

– create a flat file storing KMV synopses (5% and 10% samples)

[Anand et al., CIKM 2010]

236

SLIDE 60

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 60

Partition Selection with KMV

KMV is promising

– 5% is enough!

Wikipedia UKGOV NY Times [Anand et al., CIKM 2010]

237

Document partitioning (Sharding)

Another solution to avoid reading and

processing repeated postings with cross- cutting validity intervals in temporal partitioning

Example

[Anand et al., SIGIR 2011]

238

Sharding

Key idea:

– Instead of partitioning temporally, partition postings based on doc-ids – No repetition as in slicing – Example

[Anand et al., SIGIR 2011]

239

Sharding

Entries in a shard are ordered according to

their begin time

For each shard, an axuliary data structure:

– List of pairs: (query begin times, offset in shard) – Maintain for practical granularities of begin times (like, days)  can fit in memory! – For a query with a begin time

Seek to the offset position in the shard and then read

sequentially

[Anand et al., SIGIR 2011]

240

SLIDE 61

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 61

Why several shards?

There can be still wasted disk I/O while reading a

shard:

Problem is the postings with long validity intervals

(i.e., subsuming lot of other postings)

Query begin time Earliest posting that includes the query begin time! This causes reading 5 useless postings! [Anand et al., SIGIR 2011]

241

Idealized Sharding

For a given posting list L, find a minimal set of

shards that satisfy staircase property (SP)

if begin(p) ≤ begin(q) → end(p) ≤ end(q)

Greedy solution:

– For each posting p in L (in begin-time order)

Add p in an exiting shard s if SP is satisfied
Otherwise, start a new shard with p

[Anand et al., SIGIR 2011]

242

Merging Shards

Idealized sharding can yield several

shards

– random access per shard is expensive!

A greedy merging algortihm

– Penalty <= CostRan /CostSeq – Penalty (pairwise): wasted sequential reads for merging two shards – Merge in ascending choice order and then smallest size order

More on this later! [Anand et al., SIGIR 2011]

243

Performance

Sharding improves query processing times

w.r.t. temporal partitioning or no partitioning et al.

Merging shards results shorter QP times
Time measurements are taken on warm

caches!!!

[Anand et al., SIGIR 2011]

244

SLIDE 62

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 62

A Grand Summary

A full posting list with validity intervals

– High sequential access + CPU cost

1 random access per query term
Temporal partitioning (slicing)

– Reduce seq access (but, repetitions) – Reduce CPU cost – ≥1 random accesses per query term (with partition selection)

Document partitioning (sharding)

– Reduce seq access (no repetitions + time map) – Reduce CPU cost (staircase property) – ≥1 random accesses per query term (read all shards?!)

245

An Alternative Perspective

Work from Polytechnique Ins. of NYU
Focus on the index size

Approaches up to now consider each version

f a document separately: no special attention
n the overlap between versions

246

Index Compression

Key ideas

– Small integers can be represented with smaller codes – Doc ids are not so small: instead, compress the gaps between the ids – Term frequencies are already small – Example

247

Indexing Versions

Assume validity-intervals are stored separately

 Time-constraints @ post-processing

Versions of di are represented as di,j

t0 t1 t2 t3 ∞ tf(v)=1 tf(v)=0 tf(v)=2 tf(v)=2 v <d1, 1, t0, t1> <d1, 2, t2, t3> <d1, 2, t3, ∞> v <d11, 1> <d1,2, 2> <d1,3, 2> Question: How can we reduce the storage space for such a representation? [He et al., CIKM 2009]

248

SLIDE 63

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 63

Versions Are Similar

There is a high overlap between versions!

Snapshot@ May 15 Snapshot@ June 15 Snapshot@ July 15

249

How to Exploit?

Simplest idea:

– Assign consecutive doc IDs to consecutive versions of the same document – Allows small d-gaps for overlapping terms among the versions

<d1000, 1> <d3000, 2> <d6000, 2> russir <d1, 1> <d2, 2> <d3, 2> d1: http://romip.ru/russir2012/, [May 15, June 15] d2: http://romip.ru/russir2012/, [June 15, July 15] d3: http://romip.ru/russir2012/, [July 15, Aug 15] russir Random ids Consecutive ids [He et al., CIKM 2009]

250

MSA Approach

Multiple Segment Alignment [Herscovici et al., ECIR’07]

– Given d with some versions, a virtual document Vi,j is all terms occuring in (only) versions i through j – Reduces the number of postings but increases the document space! (theoretically, up to N2 virtual documents!)

[He et al., CIKM 2009]

251

MSA: Example

Herscovici et al., ECIR 2007]

252

SLIDE 64

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 64

Indexing the Differences

DIFF approach [Anick et al., SIGIR’92]

– For every pair of consecutive versions di and di+1; create a virtual document that is the symmetric difference between these versions.

[He et al., CIKM 2009]

253

DIFF: Example

Example:
QP

254

Performance

Two index compression schemes

– Interpolative coding (IPC) – PForDelta with optimizations (PFD)

Two datasets: Wiki, Ireland

Wikipedia Ireland Random-IPC 4957 2908 Random-PFD 5499 3289 Index sizes for doc ids Query processing times (in-memory): Sorted is the best! [He et al., CIKM 2009]

255

Two-level Indexing: Roots

Assume we have clusters of documents

t1 t2 d1 d2 2 d4 2 1 d2 2 d3 6

C1 d1 d2 d3 d4 C2 C3

We have queries restricted to clusters:

– {t1, t2} in C3  intersect lists  d2, d4  d4

d4 1

256

SLIDE 65

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 65

Cluster-skipping index

Skip redundant postings during QP

– Reduce decompression cost [Altingovde et al., TOIS 2008]

Skipping is widely used for QP in practice

t1 t2 C1 d1 1 C2 d2 2 d3 6 C2 d2 2 C3 d4 2 C3 d4 1

[Altingovde et al., TOIS 2008]

257

Two-level indexing

Apply the same idea for versions:

– First level index: document ids – Second level index: a bitvector for versions No need for indexing seperate ids for each version  implicit from the bicvector.

[He et al., CIKM 2009]

258

Two-level indexing

In actual implementation, group blocks of

doc ids and bitvectors of versions

Bitvectors best compressed by hierarchical

Huffman coding!

russir <d1, 1> <d2, 2> <d3, 2> russir <d> 1 1 1 0 0/1: First three versions include “russir” russir <d> 1 2 2 0 TFs: Actual term frequencies [He et al., CIKM 2009]

259

Two-level indexing: HUFF

Query Processing: Decompress bitvectors
f documents that are in the intersection
f query terms

russir <d1> 1 2 2 0 <d2> 1 1 2 0 <d4> 1 0 1 0 schedule <d3> 1 2 2 1 <d4> 1 2 2 0 <d5> 1 3 2 1 Final result: d4,1 and d4,3 [He et al., CIKM 2009]

260

SLIDE 66

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 66

Two level indexing

Same idea can be applied to MSA and

DIFF

– For virtual documents, create bitvectors as before – Compress first level IPC, second level PFD

Even better compression performance

[He et al., CIKM 2010] Index sizes for doc ids HUFF 213 577

261

Two level indexing: QP

Queries without temporal constaints

[He et al., SIGIR 2011]

262

Two level indexing: QP

Queries with temporal constaints

– Post-processing

[He et al., SIGIR 2011]

263

Two level indexing: QP

Queries with temporal constaints

– Partitioning, again!

[He et al., SIGIR 2011]

264

SLIDE 67

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 67

QP Performance

[He et al., SIGIR 2011]

265

Grand Summary

Partition-focused

– Time information kept in postings – Redundancy solved by partitioning

Process only relevant

partitions

– Lossy compression of versions (coalescing)

Size-focused

– Time information kept separately – Redundancy solved by 2-level compression

Process compact blocks
Hierarchical partitions

– Lossless compression

f versions

266

References

[Altingovde et al., TOIS 2008] Ismail Sengör Altingövde, Engin Demir, Fazli Can, Özgür Ulusoy:

Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans.

Inf. Syst. 26(3): (2008)
[Anand et al., CIKM 2010] Avishek Anand, Srikanta J. Bedathur, Klaus Berberich, Ralf Schenkel:

Efficient temporal keyword search over versioned text. CIKM 2010: 699-708

[Anand et al., SIGIR2011] Avishek Anand, Srikanta J. Bedathur, Klaus Berberich, Ralf Schenkel:

Temporal index sharding for space-time efficiency in archive search. SIGIR 2011: 545-554

[Anick et al., SIGIR 1992] Peter G. Anick, Rex A. Flynn: Versioning a Full-text Information

Retrieval System. SIGIR 1992: 98-111

[Berberich et al., SIGIR 2007] Klaus Berberich, Srikanta J. Bedathur, Thomas Neumann,

Gerhard Weikum: A time machine for text search. SIGIR 2007: 519-526

[Beyer et al., SIGMOD 2007] Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis

Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multiset operations. SIGMOD Conference 2007: 199-210

[He et al., CIKM 2009] Jinru He, Hao Yan, Torsten Suel: Compact full-text indexing of versioned

document collections. CIKM 2009: 415-424

[He et al., CIKM 2010] Jinru He, Junyuan Zeng, Torsten Suel: Improved index compression

techniques for versioned document collections. CIKM 2010: 1239-1248

[He et al., SIGIR 2011] Jinru He, Torsten Suel: Faster temporal range queries over versioned
text. SIGIR 2011: 565-574
[Herscovici et al, ECIR 2007] Michael Herscovici, Ronny Lempel, Sivan Yogev: Efficient Indexing
f Versioned Document Sequences. ECIR 2007: 76-87
[Hou U et al., SIGMOD 2010] Leong Hou U, Nikos Mamoulis, Klaus Berberich, Srikanta J.

Bedathur: Durable top-k search in document archives. SIGMOD Conference 2010: 555-566 267

Thank you!

Questions???

268

SLIDE 68

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 68

Retrieval and Ranking Models

(1)Searching the past (2)Searching the future

RECAP

Two time dimensions

1. Publication or modified time 2. Content or event time

270

RECAP

Two time dimensions

1. Publication or modified time 2. Content or event time

content time publication time

271

Searching the past

Historical or temporal information needs

– A journalist working the historical story of a particular news article – A Wikipedia contributor finding relevant information that has not been written about yet

272

SLIDE 69

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 69

Searching the past

Historical or temporal information needs

– A journalist working the historical story of a particular news article – A Wikipedia contributor finding relevant information that has not been written about yet

273

Time must be explicitly modeled in order to

increase the effectiveness of ranking

– To order search results so that the most relevant

nes come first
Time uncertainty should be taken into account

– Two temporal expressions can refer to the same time period even though they are not equally written – E.g. the query “Independence Day 2011”

A retrieval model relying on term-matching only will fail to

retrieve documents mentioning “July 4, 2011”

Challenge

274

Query- and document model

A temporal query consists of:

– Query keywords – Temporal expressions

A document consists of:

– Terms, i.e., bag-of-words – Publication time and temporal expressions

[Berberich et al., ECIR 2010]

275

Query- and document model

A temporal query consists of:

– Query keywords – Temporal expressions

A document consists of:

– Terms, i.e., bag-of-words – Publication time and temporal expressions

[Berberich et al., ECIR 2010]

276

SLIDE 70

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 70

Time-aware ranking models

Follow two main approaches

1. Mixture model [Kanhabua et al., ECDL 2010]

Linearly combining textual- and temporal similarity

2. Probabilistic model [Berberich et al., ECIR 2010]

Generating a query from the textual part and temporal part
f a document independently

277

Mixture model

Linearly combine textual- and temporal similarity

– α indicates the importance of similarity scores

Both scores are normalized before combining

– Textual similarity can be determined using any term- based retrieval model

E.g., tf.idf or a unigram language model

278

Mixture model

Linearly combine textual- and temporal similarity

– α indicates the importance of similarity scores

Both scores are normalized before combining

– Textual similarity can be determined using any term- based retrieval model

E.g., tf.idf or a unigram language model

How to determine temporal similarity?

279

Temporal similarity

Assume that temporal expressions in the query are

generated independently from a two-step generative model:

– P(tq|td) can be estimated based on publication time using an exponential decay function [Kanhabua et al., ECDL 2010] – Linear interpolation smoothing is applied to eliminates zero probabilities

I.e., an unseen temporal expression tq in d

280

SLIDE 71

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 71

Temporal similarity

Assume that temporal expressions in the query are

generated independently from a two-step generative model:

– P(tq|td) can be estimated based on publication time using an exponential decay function [Kanhabua et al., ECDL 2010] – Linear interpolation smoothing is applied to eliminates zero probabilities

I.e., an unseen temporal expression tq in d

Similarity score Time distance d1 d2

281

Five time-aware ranking models

– LMT [Berberich et al., ECIR 2010] – LMTU [Berberich et al., ECIR 2010] – TS [Kanhabua et al., ECLD 2010] – TSU [Kanhabua et al., ECLD 2010] – FuzzySet [Kalczynski et al., Inf. Process. 2005]

Comparison of time-aware ranking

[Kanhabua et al., SIGIR 2011]

282

Experiment

– New York Times Annotated Corpus – 40 temporal queries [Berberich et al., ECIR 2010]

Result

– TSU outperforms other methods significantly for most metrics

Conclusions

– Although TSU gains the best performance, but only applied to a collection with time metadata – LMT, LMTU can be applied to any collection without time metadata, but time extraction is needed

Discussion

[Kanhabua et al., SIGIR 2011]

283

Searching the future

People are naturally curious about the future

– What will happen to EU economies in next 5 years? – What will be potential effects of climate changes?

284

SLIDE 72

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 72

Previous work

Searching the future

– Extract temporal expressions from news articles – Retrieve future information using a probabilistic model, i.e., multiplying textual similarity and a time confidence

Supporting analysis of future-related

information in news and Web

– Extract future mentions from news snippets obtained from search engines – Summarize and aggregate results using clustering methods, but no ranking

[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009]

285

Recorded Future

[http://www.recordedfuture.com/]

286

Yahoo! Time Explorer

[Matthews et al., HCIR 2010]

287

Ranking news predictions

Over 32% of 2.5M documents from Yahoo! News

(July’09 – July’10) contain at least one prediction

Retrieve predictions related to a news story in news

archives and rank by relevance

[Kanhabua et al., SIGIR 2011a]

288

SLIDE 73

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 73

Related news predictions

[Kanhabua et al., SIGIR 2011a]

289

Related news predictions

[Kanhabua et al., SIGIR 2011a]

290

Related news predictions

[Kanhabua et al., SIGIR 2011a]

291

Four classes of features

– Term similarity, entity-based similarity, topic similarity and temporal similarity

Rank results using a learning-to-rank technique

Approach

[Kanhabua et al., SIGIR 2011a]

292

SLIDE 74

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 74

Step 1: Document annotation.

– Extract temporal expressions using time and event recognition – Normalize them to dates so they can be anchored on a timeline – Output: sentences annotated with named entities and dates, i.e., predictions

Step 2: Retrieving predictions.

– Automatically generate a query from a news article being read – Retrieve predictions that match the query – Rank predictions by relevance (i.e., a prediction is “relevant” if it is about the topics of the article)

System architecture

[Kanhabua et al., SIGIR 2011a]

293

Capture the term similarity between q and p

1. TF-IDF scoring function

Problem: keyword matching, short texts
Predictions not match with query terms

2. Field-aware ranking function, e.g., bm25f

Search the context of a prediction, i.e., surrounding

sentences

Term similarity

[Kanhabua et al., SIGIR 2011a]

294

Measure the similarity

between q and p using annotated entities in dp, p, q

– Features commonly employed in entity ranking

Entity-based similarity

[Kanhabua et al., SIGIR 2011a]

295

Compute the similarity between q and p on topic

– Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for modeling topics

1. Train a topic model 2. Infer topics 3. Compute topic similarity

Topic similarity

[Kanhabua et al., SIGIR 2011a]

296

SLIDE 75

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 75

Compute the similarity between q and p on topic

– Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for modeling topics

1. Train a topic model 2. Infer topics 3. Compute topic similarity

Topic similarity

[Kanhabua et al., SIGIR 2011a]

297

Hypothesis I. Predictions that are more recent to the

query are more relevant

Temporal similarity

[Kanhabua et al., SIGIR 2011a]

298

Hypothesis I. Predictions that are more recent to the

query are more relevant

Temporal similarity

Hypothesis II. Predictions extracted from more

Ranking method

[Kanhabua et al., SIGIR 2011a]

300

SLIDE 76

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 76

42 future-related topics

Relevance judgments

[Kanhabua et al., SIGIR 2011a]

301

New York Times Annotated Corpus

– 1.8 million articles, over 20 years – More than 25% contain at least one prediction

Annotation process uses several language processing

tools

– OpenNLP for tokenizing, sentence splitting, part-of- speech tagging, shallow parsing – SuperSense tagger for named entity recognition – TARSQI for extracting temporal expressions

Apache Lucene for indexing and retrieving.

– 44,335,519 sentences and 548,491 predictions – 939,455 future dates (avg. future date/prediction is 1.7)

Experiments

[Kanhabua et al., SIGIR 2011a]

302

Results

– Topic features play an important role in ranking – Features in top-5 features with lowest weights are entity-based features

Open issues

– Extract predictions from other sources, e.g., Wikipedia, blogs, comments, etc. – Sentiment analysis for future-related information

Discussion

[Kanhabua et al., SIGIR 2011a]

303

References

[Baeza-Yates SIGIR Forum 2005] Ricardo A. Baeza-Yates: Searching the future. SIGIR workshop MF/IR 2005
[Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum: A

Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25

[Blei et al., J. Mach. Learn. 2003] David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation.

Journal of Machine Learning Research 3: 993-1022 (2003)

[Crammer et al., J. Mach. Learn. 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz,

Yoram Singer: Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 7: 551-585 (2006)

[Jatowt et al., JCDL 2009] Adam Jatowt, Kensuke Kanazawa, Satoshi Oyama, Katsumi Tanaka: Supporting

analysis of future-related information in news archives and the web. JCDL 2009: 115-124

[Joachims, KDD 2002] Thorsten Joachims: Optimizing search engines using clickthrough data. KDD 2002: 133-

142

[Kalczynski et al., Inf. Process. 2005] Pawel Jan Kalczynski, Amy Chou: Temporal Document Retrieval Model

for business news archives. Inf. Process. Manage. 41(3): 635-650 (2005)

[Kanhabua et al., SIGIR 2011] Nattiya Kanhabua, Kjetil Nørvåg: A comparison of time-aware ranking methods.

SIGIR 2011: 1257-1258

[Kanhabua et al., SIGIR 2011a] Nattiya Kanhabua, Roi Blanco, Michael Matthews: Ranking related news
predictions. SIGIR 2011: 755-764
[Matthews et al., HCIR 2010] Michael Matthews, Pancho Tolchinsky, Roi Blanco, Jordi Atserias, Peter Mika,

Hugo Zaragoza: Searching through time in the new york times. HCIR workshop 2010

[Shalev-Shwartz et al., ICML 2007] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter:

Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1): 3-30 (2011)

[Yue et al., SIGIR 2007] Yisong Yue, Thomas Finley, Filip Radlinski, Thorsten Joachims: A support vector method

for optimizing average precision. SIGIR 2007: 271-278

[Zhang, ICML 2004] Tong Zhang: Solving large scale linear prediction problems using stochastic gradient descent
algorithms. ICML 2004

304

SLIDE 77

August 6-10, 2012 The 6th Russian Summer School in IR (RuSSIR'2012) 77

Question?

305