Scale Effects in Web Search Soroush Ebadian, Parand Alizadeh Under - - PowerPoint PPT Presentation

scale effects in web search
SMART_READER_LITE
LIVE PREVIEW

Scale Effects in Web Search Soroush Ebadian, Parand Alizadeh Under - - PowerPoint PPT Presentation

Sharif University of Technology 23/2/1397 1/32 Scale Effects in Web Search Soroush Ebadian, Parand Alizadeh Under Supervision of Prof. Fazli Social and Economic Networks, Spring 1397 Sharif University of Technology 23/2/1397 Contents 2/32


slide-1
SLIDE 1

Scale Effects in Web Search

Soroush Ebadian, Parand Alizadeh Under Supervision of Prof. Fazli Social and Economic Networks, Spring 1397

23/2/1397 Sharif University of Technology

1/32

slide-2
SLIDE 2

Contents

  • Overview on Problem Space
  • Data Description
  • Direct Effects of Scale
  • Indirect Effects of Scale
  • Discussion & Conclusion

23/2/1397 Sharif University of Technology

2/32

slide-3
SLIDE 3
  • Overview on Problem Space
  • Data Description
  • Direct Effects of Scale
  • Indirect Effects of Scale
  • Discussion & Conclusion

23/2/1397 Sharif University of Technology

3/32

slide-4
SLIDE 4

Analysis of Web Search Markets

  • T

wo different worlds

  • Ranking based on algorithmic innovation

and fixed document features

  • Learning from historical queries is

critical ranking quality

Little is known about which one we live in.

23/2/1397 Sharif University of Technology

4/32

slide-5
SLIDE 5

Analysis of Web Search Markets (cont.)

  • Learning tends to

slow down with each additional data point Can any viable entrant easily achieve?!

  • Fig. 1: A learning curve averaged
  • ver many trials

23/2/1397 Sharif University of Technology

5/32

slide-6
SLIDE 6

Authors of Paper

  • Microsoft AI & Research: 4/5
  • HomeAway Inc. : 1/5

23/2/1397 Sharif University of Technology

6/32

slide-7
SLIDE 7
  • Overview on Problem Space
  • Data Description
  • Direct Effects of Scale
  • Indirect Effects of Scale
  • Discussion & Conclusion

23/2/1397 Sharif University of Technology

7/32

slide-8
SLIDE 8

Data Description

  • T

wo search engines with same restrictions

  • More than 6 months
  • Based on Click-Through-Rates (CTR)

Provider 1 (# impressions) > 200 billion Provider 2 (# impressions) > 300 billion Provider 1 (# clicks) > 100 billion Provider 2 (# clicks) > 150 billion

Table 1. Summary statistics

23/2/1397 Sharif University of Technology

8/32

slide-9
SLIDE 9
  • Overview on Problem Space
  • Data Description
  • Direct Effects of Scale
  • Indirect Effects of Scale
  • Discussion & Conclusion

23/2/1397 Sharif University of Technology

9/32

slide-10
SLIDE 10

Benchmark & Target Data

  • Legally limited time raw log retention
  • Benchmark data: first 3 months
  • Target data: next 9 months
  • <H(q,d), CTR(q,d)>
  • H(q,d): historical measure before day d for

query q

  • CTR(q,d): CTR in day d of query q

23/2/1397 Sharif University of Technology

10/32

slide-11
SLIDE 11

CTR & Historical Occurrences Positive Correlation

  • Generated 270 pairs into buckets by H(q,d)
  • Fig. 2. CTR shows a positive correlation with the number of

historical occurrences. 1 0.5

Provider 1 Provider 2 0-10 10-100 100-1k 10k-100k 100k-1m 1m-10m 10m-100m

23/2/1397 Sharif University of Technology

11/32

slide-12
SLIDE 12

Regression Analysis

CTR = −0.0530[−0.085, −0.021] + 0.3287[0.315, 0.343] sqrt(log(x))

  • Fig. 3. Provider 1, relationship between CTR and number of

historical examples.

23/2/1397 Sharif University of Technology

12/32

slide-13
SLIDE 13

Regression Analysis (cont.)

CTR = −0.3871[−0.486, −0.288] + 0.4792[0.438, 0.520] sqrt(log(x))

  • Fig. 4. Provider 2, relationship between CTR and number of

historical examples.

23/2/1397 Sharif University of Technology

13/32

slide-14
SLIDE 14

Scale Effect Analysis on New Queries

  • Popular queries may be easier to satisfy
  • Same “query difficulty”
  • (1) query has less than 200 clicks in the

three-month benchmark

  • (2) total number of clicks of the query

between 1000 and 2000 (in a year)

  • Provider 1: 8000 queries
  • Provider 2: 10000 queries

23/2/1397 Sharif University of Technology

14/32

slide-15
SLIDE 15

Scale Effect Analysis on New Queries (cont.)

  • CTR(q, c): CTR of q in period of c+1 to

c+100 clicks

  • c ∈ {100, 200, . . . , 900}

23/2/1397 Sharif University of Technology

15/32

slide-16
SLIDE 16

Scale Effect Exists in Both

  • Fig. 5. Provider 1, relationship between CTR and number of historical

examples for new queries only.

23/2/1397 Sharif University of Technology

16/32

slide-17
SLIDE 17

Scale Effect Exists in Both (cont.)

  • Fig. 6. Provider 2, relationship between CTR and number of historical

examples for new queries only.

23/2/1397 Sharif University of Technology

17/32

slide-18
SLIDE 18
  • Overview on Problem Space
  • Data Description
  • Direct Effects of Scale
  • Indirect Effects of Scale
  • Discussion & Conclusion

23/2/1397 Sharif University of Technology

18/32

slide-19
SLIDE 19

Constructing Bipartite Knowledge Graph

  • G = <Q, D, E>
  • Q = queries, D = documents
  • eij = click count between qi and dj
  • Represent each query as a bag of words

queries reduce by 7%

23/2/1397 Sharif University of Technology

19/32

slide-20
SLIDE 20

Summary of Query-Document Graph

  • Cardinality Q: 4.82 billion
  • Cardinality D: 3.26 billion
  • Cardinality E: 11.6 billion
  • T
  • tal clicks: > 100 billion

23/2/1397 Sharif University of Technology

20/32

slide-21
SLIDE 21

Clustering Documents

  • Construct similarity matrix of document

using cosine similarity

  • Convert similarity weights to 0 or 1 using

a threshold

  • Construct document similarity graph

23/2/1397 Sharif University of Technology

21/32

slide-22
SLIDE 22

Clustering Documents (cont.)

  • Find connected components of documents

similarity graph

  • Each connected component is an intent -

cluster

  • Construct query/intent-cluster graph
  • Eij = fraction of clicks from qi to clusterj

23/2/1397 Sharif University of Technology

22/32

slide-23
SLIDE 23

Algorithm 1. Find Connected Components

  • 1. Every document pair is a separate

cluster

  • 2. Identify link nodes between pairs and

merge

  • 3. Repeat 2 until convergence

23/2/1397 Sharif University of Technology

23/32

slide-24
SLIDE 24

Evaluation of Clusters

  • Form a 100-query test set and get all

clusters

  • Score edges with 0 or 1 using auditors
  • Choose thresholds between 0.7, 0.8, 0.9

and 0.95

23/2/1397 Sharif University of Technology

24/32

slide-25
SLIDE 25

Evaluation of Clusters (cont.)

  • Precision: fraction of pairs judged to be

relevant

  • Weighted Precision: precision with

applying Markov weight to each pair

23/2/1397 Sharif University of Technology

25/32

slide-26
SLIDE 26

Evaluation of Clusters (cont.)

  • Pseudo Recall: for threshold 0.7 is 1, o.w is

fraction of pairs each method recovers

  • Weighted Recall: pseudo recall with

applying Markov weight to each pair

23/2/1397 Sharif University of Technology

26/32

slide-27
SLIDE 27

Evaluation of Clusters (cont.)

Threshold

Precision

  • W. Precision

Pseudo recall W. Recall

0.7

0.69 0.79 1 1 0.8 0.7 0.84 0.76 1.054 0.9 0.68 0.83 0.45 1.04 0.95 0.66 0.83 0.26 1.03

Table 2. Precision and Recall by threshold

23/2/1397 Sharif University of Technology

27/32

slide-28
SLIDE 28
  • Fig. 7. CDF of the number intent clusters with edge to submitted query.

23/2/1397 Sharif University of Technology

28/32

slide-29
SLIDE 29
  • Fig. 8. CDF of the number of queries per queries per intent cluster.

23/2/1397 Sharif University of Technology

29/32

slide-30
SLIDE 30

Impact on CTR

23/2/1397 Sharif University of Technology

30/32

slide-31
SLIDE 31

Discussion & Conclusion

  • It is unclear that increase on scale makes

the search problem easier or harder

  • Search engines are one of the most

complicated engineering tasks ever attempted

23/2/1397 Sharif University of Technology

31/32

slide-32
SLIDE 32

Thanks for your attention!

23/2/1397 Sharif University of Technology

32/32