Going Beyond the Document-Query Lexical Match Oren Kurland Faculty - - PowerPoint PPT Presentation

going beyond the document query lexical match
SMART_READER_LITE
LIVE PREVIEW

Going Beyond the Document-Query Lexical Match Oren Kurland Faculty - - PowerPoint PPT Presentation

Going Beyond the Document-Query Lexical Match Oren Kurland Faculty of Industrial Engineering and Management Technion 1 / 29 Search engines 2 / 29 The ad hoc retrieval task Relevance Ranking Rank documents in a corpus by their relevance to


slide-1
SLIDE 1

Going Beyond the Document-Query Lexical Match

Oren Kurland

Faculty of Industrial Engineering and Management Technion

1 / 29

slide-2
SLIDE 2

Search engines

2 / 29

slide-3
SLIDE 3

The ad hoc retrieval task

Relevance Ranking Rank documents in a corpus by their relevance to the information need expressed by a given query

3 / 29

slide-4
SLIDE 4

The vector space model

Salton ’68

q = “Technion”

  • q =< 0, . . . , 0, 1, 0, . . . , 0 >

d = “Technion faculty student Technion”

  • d =< 0, . . . , 0, 1, 0, . . . , 0, 1, 0, . . . , 0, 2, 0, . . . , 0 >

scoreV S(d; q)

def

= cos( d, q) Term weighting scheme: TF.IDF TF: the number of occurrences of the term in the document IDF: The inverse of the document frequency of the term

4 / 29

slide-5
SLIDE 5

The language modeling approach

Ponte&Croft ’98

Ranking scoreLM(d; q)

def

=

  • w∈q

p(w|d) p(w|d) is the probability that w is generated from a language model induced from d A language model p(“Hello”|“Hello Hello World”)

def

= (1−λ)2 3+λp(“Hello”|corpus)

5 / 29

slide-6
SLIDE 6

The document-query similarity estimate

Retrieval frameworks Probabilistic retrieval (Maron&Kuhns ’60); Okapi BM25 (Robertson et al. ’93) Vector space model (Salton ’68) The inference network model (Turtle&Croft ’90) Pivoted document length normalization (Singhal et al. ’96) Language modeling (Ponte&Croft ’98) Divergence from randomness (Amati&van Rijsbergen ’00) Are all these the same? It’s all about TF, IDF and length normalization (Fang et

  • al. ’04, ’09)

Axiomatization of document-query similarity functions used for ranking (Fang et al. ’05)

6 / 29

slide-7
SLIDE 7

Web search

A variety of relevance signals The similarity between the page and the query (query dependent) The similarity between the anchor text and the query (query dependent) The PageRank score of the page (query independent) Additional document quality measures; e.g., spam score, entropy (query independent) The clickthrough rate for the page (query independent) ...

7 / 29

slide-8
SLIDE 8

Learning to rank

A training set: {(f(qi, dj), l(qi, dj))}i,j qi: query dj: document f(qi, dj): a representation for the pair (qi, dj) l(qi, dj): a relevance judgment for the pair (qi, dj) Minimize a loss function using pointwise/pairwise/listwise approaches (Liu ’09)

8 / 29

slide-9
SLIDE 9

Observations

Relevance is determined based on whether the document content satisfies the information need expressed by the query The document-query similarity is among the most important features for ranking pages in Web search (Liu ’09) Can document-query similarity estimates be further improved?

9 / 29

slide-10
SLIDE 10

The surface-level document query similarity

The vocabulary mismatch problem Relevant documents might not contain some, or even all, query terms Short queries Short documents (e.g., Tweets) query: “shipment vehicles” document: “cargo freight truck”

10 / 29

slide-11
SLIDE 11

The risk minimization framework

Lafferty&Zhai ’01

11 / 29

slide-12
SLIDE 12

Semantic matching

Li&Xu ’13

Query reformulation Term dependence models Translation models Topic models Latent space models

12 / 29

slide-13
SLIDE 13

Short queries

Automatic query expansion Global methods analyze the corpus or external resources in a query-independent fashion Local methods rely on some initial search Global methods Using Wordnet (Voorhees ’94), large external corpus (Diaz&Metzler ’06), Wikipedia (Xu et al. ’09) Translation model (Berger&Lafferty ’99)

p(q|d)

def

=

w∈q

  • w′∈Lexicon p(w′|d)T(w′|w)

Estimating T using mutual information (Karimzadehgan&Zhai ’09) Effective for microbolog search (Karimzadehgan et al. ’13)

13 / 29

slide-14
SLIDE 14

Pseudo-feedback-based query expansion

Utilize information from documents that are highly ranked by an initial search performed in response to the query Relevance modeling (Lavrenko&Croft ’01) A generative theory of relevance: The query and the relevant documents are sampled from the same language model (relevance model; R) p(w|R)

def

= λp(w|q) + (1 − λ)

d∈Dinit p(w|d)p(d|q)

score(d; q)

def

= KL

  • p(·|R)
  • p(·|d)
  • State-of-the-art (unigram) pseudo-feedback-based query

expansion approach (Lv&Zhai ’09) How do we set λ? Adaptive/selective query expansion ...

14 / 29

slide-15
SLIDE 15

Beyond bag-of-terms (unigram) representations

Markov Random Fields (Metzler&Croft ’05)

Q: query composed of the terms q1, q2, . . . D: document p(Q, D) = ?

15 / 29

slide-16
SLIDE 16

Markov Random Fields

P(D|Q) rank =

  • c∈Cliques(G)

λcf(c) G: graph; f(c) : feature function Unigram features: fT (c)

def

= log p(qi|D) Ordered phrase features: fO(c)

def

= log p(ow(qi, . . . , qi+k)|D) Unordered phrase features: fU(c)

def

= log p(uw(qi, . . . , qi+k)|D) Additional models Linear discriminant model (Gao et al. ’05) Differential concept weighting (Shi&Nie ’10) Modeling higher order term (concept) dependencies using query hypergraphs (Bendersky&Croft ’12)

16 / 29

slide-17
SLIDE 17

Latent concept expansion

Metzler&Croft ’07

p(E|Q)

def

=

  • D∈Dinit

(fQD(Q, D) + fD(D) + fQD(E, D) + fQ(E))

Additional models Using hierarchical Markov Random Fields for query expansion (Lang et al. ’10) Learning concept importance (Bendersky et al. ’11)

17 / 29

slide-18
SLIDE 18

Parametrized concept weighting

Bendersky et al. ’11

score(D; Q)

def

=

  • T∈T
  • c∈T

λcf(c, D)

c: concept T : Types of concepts: query terms, phrases (bigrams), biterms, expansion terms

λc

def

=

  • ϕ∈φT

wϕϕ(c)

φT : a set of feture (importance) functions for a concept of type T (e.g., using the corpus, Google n-grams, Wikipedia, search log)

score(D; Q) =

  • T∈T
  • ϕ∈φT

  • c∈T

ϕ(c)f(c, D)

18 / 29

slide-19
SLIDE 19

Positional language models

Lv&Zhai ’09

c(w, i): count of term w at position i in document D k(i, j): the term count propagated to position i from position j c′(w, i)

def

= PN

j=1 c(w, j)k(i, j)

p(w|D, i)

def

= c′(w, i) P

w′∈Vocabulary c′(w′, i)

score(Q, D, i)

def

= KL “ p(·|Q) ˛ ˛ ˛ ˛ ˛ ˛ p(·|D) ” Query expansion: A positional relevance language model (Lv&Zhai ’10)

19 / 29

slide-20
SLIDE 20

Matching in a latent space

The term document matrix: A = w1 w2 d1 f(w1; d1) f(w2; d1) d2 f(w1; d2) f(w2; d2) Latent Semantic Analysis (LSA; Deerwester et al. ’90): Low rank approximation using SVD Ak = minX:rank(X)=k||A − X||F Probabilistic Latent Semantic Analysis (pLSA; Hofmann ’99) Supervised methods for doc-query matching in a latent space (Bai et al. ’09, Huang et al. ’13, Wu et al. ’13)

20 / 29

slide-21
SLIDE 21

The cluster hypothesis

The cluster hypothesis

(Jardine&van Rijsbergern ’71, van Rijsbergen ’79):

Closely associated documents tend to be relevant to the same requests

Leveraging the hypothesis: enrich a document representation using information induced from its corpus context

21 / 29

slide-22
SLIDE 22

Smoothing document representations

p(w|D)

def

= λ1 c(w, D)

  • w′ c(w′, D)+

λ2 c(w, corpus)

  • w′ c(w′, corpus) + λ3
  • t∈Topics

p(w|t)p(t|D) Topics: Clusters with which D is associated (Kurland&Lee ’04, Liu&Croft ’04) LDA (Blei et al. ’03; Wei&Croft ’06) or pLSA (Hofmann ’99; Lu et al. ’11) or PAM (Li&McCallum ’06; Yi&Allan ’09)

22 / 29

slide-23
SLIDE 23

Smoothing document representations

Empirical observations (Yi&Allan ’09) Using more sophisticated topic models doesn’t yield improved retrieval effectiveness Using nearest-neighbors clusters as “topics” results in retrieval performance as good as that as using topic models Pseudo-feedback-based query expansion (specifically, relevance modeling) outperforms using topic models Cluster-based smoothing is highly effective for microblog retrieval (Efron ’11)

23 / 29

slide-24
SLIDE 24

A different approach to utilizing corpus context

Cluster ranking

Query Initial list of documents Set of clusters Ranking of clusters Ranking of documents Document ranking method Clustering method Each cluster is replaced with its documents Cluster ranking method

24 / 29

slide-25
SLIDE 25

The optimal cluster

p@5

Doc-query similarity Query expansion Oracle experiment

25 / 29

slide-26
SLIDE 26

Ranking clusters using Markov Random Fields

Raiber&Kurland ’13

Winner of the Web track in TREC 2013 p(c|q)

def

= p(c, q) p(q)

rank

= p(c, q) p(c, q) rank =

  • l∈Cliques(G)

λlfl(l) fl(l): feature function defined over the clique l

26 / 29

slide-27
SLIDE 27

Challenges

Query = “oren kurland dblp” Search #1 Search #2 27 / 29

slide-28
SLIDE 28

Challenges

Prediction over queries: Fixing the query-document similarity estimate, which queries will result in better performance than

  • thers? (Carmel&Yom Tov ’10)

Prediction over similarity functions: Fixing the query, which similarity function will result in better performance than others? (Balasubramanian&Allan ’10) Fusion/rank aggregation of similarity functions: Aggregating the results attained using different document-query similarity measures (Fox&Shaw ’92) Fusion of document representations: Effectively integrating cluster/topic-based representation with passage-based representation (Krikon&Kurland ’10) Adaptive/selective query expansion: Should we expand a given query? (Cronen-Townsend et al. ’02, Amati et al. ’04) Cluster ranking: Devising novel query-cluster similarity measures Adversarial retrieval: Devising document-query similarity functions in light of search engine optimization (Raiber et al. ’10) More semantic analysis? (e.g., Symond et al. ’12-’14; Bruza&Song ’02)

28 / 29

slide-29
SLIDE 29

Summary

From surface-level document query similarity to: Automatic query expansion

More generally: query reformulation (e.g., query segmentation, query reduction)

Translation models Term-dependence models Dimension reduction, topic modeling, and (supervised) latent-space models Cluster ranking Query-document similarity estimates: there is still a long way to go

29 / 29