Query Log Analysis for Enhancing Web Search
Salvatore Orlando, University of Venice, Italy Fabrizio Silvestri, ISTI - CNR, Pisa, Italy
From tutorials given at IEEE / WIC / ACM WI/IAT'09
and ECIR’09
Query Log Analysis for Enhancing Web Search Salvatore Orlando, - - PowerPoint PPT Presentation
Query Log Analysis for Enhancing Web Search Salvatore Orlando, University of Venice, Italy Fabrizio Silvestri, ISTI - CNR, Pisa, Italy From tutorials given at IEEE / WIC / ACM WI/IAT'09 and ECIR09 Query Log Analysis for Enhancing Web
Salvatore Orlando, University of Venice, Italy Fabrizio Silvestri, ISTI - CNR, Pisa, Italy
and ECIR’09
Salvatore Orlando, University of Venice, Italy Fabrizio Silvestri, ISTI - CNR, Pisa, Italy
and ECIR’09
History Teaches Everything... Even the Future!
The 250 most frequent queried terms in the “famous” AOL query log!
Thanks to http://www.wordle.net for the tagcloud generator
The data is sorted by anonymous user ID and sequentially arranged. The goal of this collection is to provide real query log data that is based on real
search research. The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}. AnonID - an anonymous user ID number. Query - the query issued by the user, case shifted with most punctuation removed. QueryTime - the time at which the query was submitted for search. ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed. ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed. Each line in the data represents one of two types of events:
In the first case (query only) there is data in only the first three columns/fields
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.
Fabrizio Silvestri: Mining Query Logs: Turning Search Usage Data into Knowledge. Foundations and Trends in Information Retrieval. (To Appear).
Computer, vol. 35, no. 3, pp. 107–109, 2002.
topically categorized web query log,” J. Am. Soc. Inf. Sci. Technol., vol. 58, no. 2, pp. 166–178, 2007.
Queries ordered by popularity Popularity
Terms ordered by popularity Popularity
URLs ordered by number of clicks Number of clicks
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
Queries
popularity
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
Queries
popularity
. Junqueira,
ACM Trans. Web, vol. 2, no. 4, pp. 1–28, 2008.
Queries
popularity
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
topically categorized web query log,” J. Am. Soc. Inf. Sci. Technol., vol. 58, no. 2, pp. 166–178, 2007.
Characteristic 1997 1999 2001
Mean terms per query 2,4 2,4 2,6 Terms per query 1 term 26,3% 29,8% 26,9% 2 terms 31,5% 33,8% 30,5% 3+ terms 43,1% 36,4% 42,6% Mean queries per user 2,5 1,9 2,3
Computer, vol. 35, no. 3, pp. 107–109, 2002.
Characteristic 1997 1999 2001
Mean terms per query 2,4 2,4 2,6 Terms per query 1 term 26,3% 29,8% 26,9% 2 terms 31,5% 33,8% 30,5% 3+ terms 43,1% 36,4% 42,6% Mean queries per user 2,5 1,9 2,3
Computer, vol. 35, no. 3, pp. 107–109, 2002.
In 2008: 2.5 terms per query.
. Junqueira,
V. Plachouras, and F. Silvestri, “Design trade-ofgs for search engine caching,” ACM Trans. Web, vol. 2, no. 4, pp. 1–28, 2008.
topically categorized web query log,” J. Am. Soc. Inf. Sci. Technol., vol. 58, no. 2, pp. 166–178, 2007.
noted that the top position reported by WSE strongly influence user behavior, beyond snippets
number of clicks a given position obtained in two different conditions: normal and swapping the first two top positions
noted that the top position reported by WSE strongly influence user behavior, beyond snippets
number of clicks a given position obtained in two different conditions: normal and swapping the first two top positions
dropping in the number
result
effectiveness metrics may negatively influence the scientific value of research results
thus experiments may not be reproducible
people, e.g., metrics are tested on small human- annotated testbeds
communities
user profiles
search
aiming at enhancing performance (like caching)
poorly correlated
most important terms contained in each document according to if x idf) and the query vector space (all the terms contained in the group of queries for which a document was clicked)
and only a small percentage of documents have similarity above 0.8
325-332, ACM, 2002.
technique
queries to improve effectiveness of query expansion is the
retrieved by similar past queries (the TREC queries and databases were used)
from the resulting top scoring documents, a set of “important" terms is automatically extracted to enrich the query
documents and web search engine queries
<query, (list of clicked docIDs)>
Query Term Set Document Term Set
325-332, ACM, 2002.
documents and web search engine queries
<query, (list of clicked docIDs)>
Query Term Set Document Term Set
A link is inserted
sessions Term tq occurs is a query of a session. Term td occurs in a clicked document within the same session
325-332, ACM, 2002.
Query Term Set Document Term Set
A link is inserted
sessions Term tq occurs is a query of a session. Term td occurs in a clicked document within the same session.
325-332, ACM, 2002.
Query Term Set Document Term Set
A link is inserted
sessions Term tq occurs is a query of a session. Term td occurs in a clicked document within the same session.
325-332, ACM, 2002.
W
Query Term Set Document Term Set
A link is inserted
sessions Term tq occurs is a query of a session. Term td occurs in a clicked document within the same session.
325-332, ACM, 2002.
W
W = degree of term correlation
Query Term Set Document Term Set
A link is inserted
sessions Term tq occurs is a query of a session. Term td occurs in a clicked document within the same session.
325-332, ACM, 2002.
W
W = degree of term correlation
expansion method
and a candidate term td for query expansion
325-332, ACM, 2002.
expansion method
and a candidate term td for query expansion
325-332, ACM, 2002.
Naïve hypothesis on independence
query
expansion method
and a candidate term td for query expansion
325-332, ACM, 2002.
Naïve hypothesis on independence
query
candidate terms
selected as expansion terms for query Q
make use of logs to expands queries
documents retrieved for a query to expand the query itself
hand-crafted queries), and the following table summarizes the average results
325-332, ACM, 2002.
ACM Trans. Inf. Syst., vol. 18, no. 1, pp. 79-112, 2000.
Precision baseline 17% local context 22% log-based 30%
already proposed by by Scholer et al.
they share a high statistically similarity
document itself
considered as Surrogate Documents, and can be used as a source of terms for query expansion
12th CIKM, pp. 2-9, 2003.
Past Queries Full Document Collection
q
Past Queries Full Document Collection
q
Each past queries q is naturally associated with the K most relevant documents returned by a search engine
Past Queries Full Document Collection
q
Each past queries q is naturally associated with the K most relevant documents returned by a search engine
Past Queries Full Document Collection
q
Each past queries q is naturally associated with the K most relevant documents returned by a search engine
Past Queries Full Document Collection
q
Each past queries q is naturally associated with the K most relevant documents returned by a search engine
comparative experiments". Inf. Process. Manage., vol. 36, no. 6, pp. 779-808, 2000.
d
Each document d can result to be associated with many queries Only the M closest queries are kept w.r.t. the Okapi BM25 similarity measure
Past Queries Full Document Collection
comparative experiments". Inf. Process. Manage., vol. 36, no. 6, pp. 779-808, 2000.
d
Each document d can result to be associated with many queries Only the M closest queries are kept w.r.t. the Okapi BM25 similarity measure
Past Queries Full Document Collection
Surrogate Document
for expanding queries?
means that, in some sense, the query terms have topical relationships with each other.
because the terms contained in the associated surrogate documents have already been chosen by users as descriptors of topics
because the surrogate document has many more terms than an individual query
earthquakes earthquake recent nevada seismograph tectonic faults perpetual 1812 kobe magnitude california volcanic activity plates past motion seismological
earthquakes tectonics earthquake geology geological
CIKM, pp. 2-9, ACM Press, 2003.
same session) submitted
refine their search, instead of having the query uncontrollably stuffed with a lot of terms
extent, similar to “Florida Orlando”, since they share term “Orlando”
membership
Workshops, pp. 207-216, 2002.
q1 also issue query q2 afterwards, query q2 is suggested for query q1
to generate query suggestions according to the above idea
. B. Golgher, E. S. de Moura, and N. Ziviani, “Using association rules to discover search engines related queries" in LA-WEB '03, p. 66, IEEE Computer Society, 2003.
application of association rules
corresponding to an unordered user session, where items are queries qi
where A and B are disjoint sets of queries
A and B are singletons are indeed extracted: qi ⇒ qj, where qi ≠ qj
. B. Golgher, E. S. de Moura, and N. Ziviani, “Using association rules to discover search engines related queries" in LA-WEB '03, p. 66, IEEE Computer Society, 2003.
qi ⇒ q1, qi ⇒ q2, qi ⇒ q3, ...., qi ⇒ qm
from a real Brazilian search engine
qj) appeared in at least 3 user sessions
. B. Golgher, E. S. de Moura, and N. Ziviani, “Using association rules to discover search engines related queries" in LA-WEB '03, p. 66, IEEE Computer Society, 2003.
query text along with the text of clicked URLs.
the input one
a selection of terms extracted from the documents pointed by the user clicked URLs.
means algorithm
vector-space approach
*http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
− → q
term ti of the vocabulary (all different words are considered)
− → q
term ti of the vocabulary (all different words are considered)
Percentage of clicks that URL u receives when answered in response to query q
− → q
term ti of the vocabulary (all different words are considered)
Percentage of clicks that URL u receives when answered in response to query q Number of occurrences of the term in the document pointed to URL u
− → q
term ti of the vocabulary (all different words are considered)
Percentage of clicks that URL u receives when answered in response to query q Number of occurrences of the term in the document pointed to URL u Sum over all the clicked URL u for query q
− → q
(I) for an input query the most similar cluster is selected
(II) ranking of the queries of the cluster, according to:
returned by the query that captured the attention of users (clicked documents)
clustering)
queries
TodoCL search engine
user study.
queries yielded to more precise and high quality suggestions
2006.
for the same issued query, depending on
the same query “game theory”
and theoretical studies
game theory real-world economy problems
web search", in Proc. of Workshop on New Technologies for Personalized Inf. Access (PIA '05), 2005.
user's profile, built automatically by exploiting knowledge mined from query logs
variations among individuals, re-ranking results according to a personalization function may be insufficient (or even dangerous)
Yu, and W. Meng, “ in 11th CIKM '02, pp. 558-565, ACM Press, 2002.
retrieval history of each user
search engine to classify web pages
allows to focus on the most relevant results for each user
snippet-indexes", in CIKM '06, pp. 277-286, ACM, 2006
the user recognizes in their snippets certain combinations of terms that are related to their information needs
reflects the evolving interests of a group of searchers
re-ranking of the search results
snippet-indexes", in CIKM '06, pp. 277-286, ACM, 2006
C
search, RM, are revised with reference to the community’s snippet index IC
to community preferences.
returned to the user as RT
Collaborative Web Search (CWS)
systems:
these queries share some minimal overlapping terms within qT
two related queries do not contain any common terms
queries is indexed along with the surrogate clicked documents (snippets)
Q1=“Captain Kirk”, can potentially be returned in response to query Q2=“Starship Enterprise”
result previously selected in response to Q2
Systems
IR Core 1
idx
IR Core 2
idx
IR Core k
idx
t1,t2,…tq r1,r2,…rr
query results
Broker
Larger, but slower memory Smaller, but faster memory
CPU
Broker
IR Core IR Core IR Core
Broker
IR Core IR Core IR Core
Result Cache
Posting Cache Posting Cache Posting Cache
Broker
IR Core IR Core IR Core
Result Cache
Posting Cache Posting Cache Posting Cache
Broker
IR Core IR Core IR Core
Result Cache
Posting Cache Posting Cache Posting Cache
This is true on an ideal world.
lists for term new and for term york
Evangelos P . Markatos: On caching search engine query results. Computer Communications 24(2): 137-143 (2001)
Queries ordered by popularity Popularity
Queries ordered by popularity Popularity
Queries ordered by popularity Popularity
Store these queries forever! Static Caching
Evangelos P . Markatos: On caching search engine query results. Computer Communications 24(2): 137-143 (2001)
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
2x query throughput
prefetching query results by exploiting historical usage data,” ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 51–78, 2006.
IR Core 1
idx
IR Core 2
idx
IR Core k
idx
t1,t2,…tq r1,r2,…rr
query results
Broker
to the servers managing those terms.
needs reindexing the entire collection from scratch when an update occurs)
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3 T1 T1 T8
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3 T1 T1 T8
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3 T1 T1 T8
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3 T1 T1 T8 T2 T4 T3 T1 T8
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3 T1 T1 T8 T2 T4 T3 T1 T8
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T4 T1 T8 T4 T8
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3
IR Core 1
T1...T3
IR Core 2
T4...T6
IR Core 3
T7...T9
IR Core 4
T10...T12
T2 T4 T3
Terms ordered by popularity Popularity
balancing for term-distributed parallel
Washington, USA, August 06 - 11, 2006.
partitioning in parallel web search
Suzhou, China, June 06 - 08, 2007.